Commit graph

396 commits

Author SHA1 Message Date
FAROVITUS
2b92eefa41 import G965FXXU7DTAA OSRC
*First release for Android (Q).

Signed-off-by: FAROVITUS <farovitus@gmail.com>
2020-02-04 13:50:09 +02:00
Greg Kroah-Hartman
cdbe07ad26 This is the 4.9.55 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlnfOyYACgkQONu9yGCS
 aT4WIhAAp2/bzmMGPUIQNwxCoeHWQUA8SIPPtjFSO3XKfFeKB8CFcxo4LUTlfEah
 OJmTcg2ajSwWmHcE9LUq42cW6AqA77I3NFkUrB2QMPzL6AirTVF1qrCbqHfvslb4
 0ur/jIW8ElAoaT6hzV7BCszlAfpMS+WLjafdAYRdZjRX9GfJp77pxg79aYyPz2P7
 RC4MNdQ7nhK7eeEqmUHV5CDs+aywU/H40ylZwmkk3detijVxZqEYPItrXUxFs7Ms
 fMJv1u/lfwmHC9ljUdZzHsKBGkNMN+lZeeNW7dOCsDRjuo/XZpuqLE8TDX7n5qcu
 E9+p28xXWq7o0fTsX+wEkkm70CPWCHg/7dMPeHBGtwBJWBWk01gvQhk2f2N20U+o
 eqSTQeuo+UT3J8/T5jDR0DmyYO1YZ4euGA6uczpEDMFK5y+qeeL/zLC/5OmETa7W
 0GgtyQDkEAFLoHZhgAFYtKIgs30XuY3vkc9DNu6U+G4GbbpLhqmWb6C+t71osE96
 2y5t+gJxi8nA91y33gWeIWF1pVit2jqFYA9/P5Ful48CAYykQ2sp1H8R2IwNewKP
 s+beYNQfVS6grz1tkKA282Fqxnd+OUNnrf1k1f7vsUIYq2wbjbqduEbEQ6nhpg0Z
 xa57uZO72RUH0Psd2iT4ylGKLklQUMuRRB6V50by8HgFemsjf74=
 =rBh3
 -----END PGP SIGNATURE-----

Merge 4.9.55 into android-4.9

Changes in 4.9.55
	USB: gadgetfs: Fix crash caused by inadequate synchronization
	USB: gadgetfs: fix copy_to_user while holding spinlock
	usb: gadget: udc: atmel: set vbus irqflags explicitly
	usb: gadget: udc: renesas_usb3: fix for no-data control transfer
	usb: gadget: udc: renesas_usb3: fix Pn_RAMMAP.Pn_MPKT value
	usb: gadget: udc: renesas_usb3: Fix return value of usb3_write_pipe()
	usb-storage: unusual_devs entry to fix write-access regression for Seagate external drives
	usb-storage: fix bogus hardware error messages for ATA pass-thru devices
	usb: renesas_usbhs: fix the BCLR setting condition for non-DCP pipe
	usb: renesas_usbhs: fix usbhsf_fifo_clear() for RX direction
	ALSA: usb-audio: Check out-of-bounds access by corrupted buffer descriptor
	usb: pci-quirks.c: Corrected timeout values used in handshake
	USB: cdc-wdm: ignore -EPIPE from GetEncapsulatedResponse
	USB: dummy-hcd: fix connection failures (wrong speed)
	USB: dummy-hcd: fix infinite-loop resubmission bug
	USB: dummy-hcd: Fix erroneous synchronization change
	USB: devio: Don't corrupt user memory
	usb: gadget: mass_storage: set msg_registered after msg registered
	USB: g_mass_storage: Fix deadlock when driver is unbound
	USB: uas: fix bug in handling of alternate settings
	USB: core: harden cdc_parse_cdc_header
	usb: Increase quirk delay for USB devices
	USB: fix out-of-bounds in usb_set_configuration
	xhci: fix finding correct bus_state structure for USB 3.1 hosts
	xhci: Fix sleeping with spin_lock_irq() held in ASmedia 1042A workaround
	xhci: set missing SuperSpeedPlus Link Protocol bit in roothub descriptor
	Revert "xhci: Limit USB2 port wake support for AMD Promontory hosts"
	iio: adc: twl4030: Fix an error handling path in 'twl4030_madc_probe()'
	iio: adc: twl4030: Disable the vusb3v1 rugulator in the error handling path of 'twl4030_madc_probe()'
	iio: ad_sigma_delta: Implement a dedicated reset function
	staging: iio: ad7192: Fix - use the dedicated reset function avoiding dma from stack.
	iio: core: Return error for failed read_reg
	IIO: BME280: Updates to Humidity readings need ctrl_reg write!
	iio: ad7793: Fix the serial interface reset
	iio: adc: mcp320x: Fix readout of negative voltages
	iio: adc: mcp320x: Fix oops on module unload
	uwb: properly check kthread_run return value
	uwb: ensure that endpoint is interrupt
	staging: vchiq_2835_arm: Fix NULL ptr dereference in free_pagelist
	mm, oom_reaper: skip mm structs with mmu notifiers
	lib/ratelimit.c: use deferred printk() version
	lsm: fix smack_inode_removexattr and xattr_getsecurity memleak
	ALSA: compress: Remove unused variable
	Revert "ALSA: echoaudio: purge contradictions between dimension matrix members and total number of members"
	ALSA: usx2y: Suppress kernel warning at page allocation failures
	mlxsw: spectrum: Prevent mirred-related crash on removal
	net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker
	sctp: potential read out of bounds in sctp_ulpevent_type_enabled()
	tcp: update skb->skb_mstamp more carefully
	bpf/verifier: reject BPF_ALU64|BPF_END
	tcp: fix data delivery rate
	udpv6: Fix the checksum computation when HW checksum does not apply
	ip6_gre: skb_push ipv6hdr before packing the header in ip6gre_header
	net: phy: Fix mask value write on gmii2rgmii converter speed register
	ip6_tunnel: do not allow loading ip6_tunnel if ipv6 is disabled in cmdline
	net/sched: cls_matchall: fix crash when used with classful qdisc
	tcp: fastopen: fix on syn-data transmit failure
	net: emac: Fix napi poll list corruption
	packet: hold bind lock when rebinding to fanout hook
	bpf: one perf event close won't free bpf program attached by another perf event
	isdn/i4l: fetch the ppp_write buffer in one shot
	net_sched: always reset qdisc backlog in qdisc_reset()
	net: qcom/emac: specify the correct size when mapping a DMA buffer
	vti: fix use after free in vti_tunnel_xmit/vti6_tnl_xmit
	l2tp: Avoid schedule while atomic in exit_net
	l2tp: fix race condition in l2tp_tunnel_delete
	tun: bail out from tun_get_user() if the skb is empty
	net: dsa: Fix network device registration order
	packet: in packet_do_bind, test fanout with bind_lock held
	packet: only test po->has_vnet_hdr once in packet_snd
	net: Set sk_prot_creator when cloning sockets to the right proto
	netlink: do not proceed if dump's start() errs
	ip6_gre: ip6gre_tap device should keep dst
	ip6_tunnel: update mtu properly for ARPHRD_ETHER tunnel device in tx path
	tipc: use only positive error codes in messages
	net: rtnetlink: fix info leak in RTM_GETSTATS call
	socket, bpf: fix possible use after free
	powerpc/64s: Use emergency stack for kernel TM Bad Thing program checks
	powerpc/tm: Fix illegal TM state in signal handler
	percpu: make this_cpu_generic_read() atomic w.r.t. interrupts
	driver core: platform: Don't read past the end of "driver_override" buffer
	Drivers: hv: fcopy: restore correct transfer length
	stm class: Fix a use-after-free
	ftrace: Fix kmemleak in unregister_ftrace_graph
	HID: i2c-hid: allocate hid buffers for real worst case
	HID: wacom: leds: Don't try to control the EKR's read-only LEDs
	HID: wacom: Always increment hdev refcount within wacom_get_hdev_data
	HID: wacom: bits shifted too much for 9th and 10th buttons
	rocker: fix rocker_tlv_put_* functions for KASAN
	netlink: fix nla_put_{u8,u16,u32} for KASAN
	iwlwifi: mvm: use IWL_HCMD_NOCOPY for MCAST_FILTER_CMD
	iwlwifi: add workaround to disable wide channels in 5GHz
	scsi: sd: Do not override max_sectors_kb sysfs setting
	brcmfmac: add length check in brcmf_cfg80211_escan_handler()
	brcmfmac: setup passive scan if requested by user-space
	drm/i915/bios: ignore HDMI on port A
	nvme-pci: Use PCI bus address for data/queues in CMB
	mmc: core: add driver strength selection when selecting hs400es
	sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs
	vfs: deny copy_file_range() for non regular files
	ext4: fix data corruption for mmap writes
	ext4: Don't clear SGID when inheriting ACLs
	ext4: don't allow encrypted operations without keys
	f2fs: don't allow encrypted operations without keys
	KVM: x86: fix singlestepping over syscall
	Linux 4.9.55

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-10-12 22:31:24 +02:00
Peter Zijlstra
ba15518c26 sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs
commit 50e76632339d4655859523a39249dd95ee5e93e7 upstream.

Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
because it now resulted in non-cpuset usage breaking too.

On suspend cpuset_cpu_inactive() doesn't call into
cpuset_update_active_cpus() because it doesn't want to move tasks about,
there is no need, all tasks are frozen and won't run again until after
we've resumed everything.

But this means that when we finally do call into
cpuset_update_active_cpus() after resuming the last frozen cpu in
cpuset_cpu_active(), the top_cpuset will not have any difference with
the cpu_active_mask and this it will not in fact do _anything_.

So the cpuset configuration will not be restored. This was largely
hidden because we would unconditionally create identity domains and
mobile users would not in fact use cpusets much. And servers what do use
cpusets tend to not suspend-resume much.

An addition problem is that we'd not in fact wait for the cpuset work to
finish before resuming the tasks, allowing spurious migrations outside
of the specified domains.

Fix the rebuild by introducing cpuset_force_rebuild() and fix the
ordering with cpuset_wait_for_hotplug().

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: deb7aa308e ("cpuset: reorganize CPU / memory hotplug handling")
Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12 11:51:25 +02:00
Greg Kroah-Hartman
85e1c0178a This is the 4.9.48 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlmw6M8ACgkQONu9yGCS
 aT4HnBAAukEkkEFHfzSuqHXfIpyUn0amfvAqXk5VPXmCibi2NUlOPpYRYFSC2mUo
 l1+ksacxvNhaRdMYOyde7B8pklcRPRhhf1klLYMZKn+FtY6cjnh6Ft37AmcOuh4r
 mCufTKTi4Lhhy2sUDXg0FjdpgwA0p4yTT0OwnEoUXcquawGNHYooEA9wu2VMMMSg
 5ytUTls04wR7TVszZ0XHO+5tiGcjIxjPFkhE7dLUcGA7y/NUyv8t7fFH+RfGZ+Zq
 KVMG/5unwMNXK9FMot1NO6/C9/geXjb8mx/6RLGm5+97Wwe9Iw+sGytlF86+nu+N
 4Cjq3bOWf092eGF63WC0OE5w5KsDw4gnzdSsjAcOfLmtPgvsE8jOYM7JULFNJ69M
 a7U6DGpc3YoeHHsWPl3A9YzV4NdeJTod87HRrPFgHNR/2gR+/jf+twZyQeAYEeps
 370um9BxZLp6e9mPVTmhXiIO4712uAEkA8ZaZq7qnhH51BE4u72/KQxKgMVYkaVI
 qTi4fVx1GbEg1J0N78aXMQrVcFmDkkM3aPZqT4YB9BIjz5Sy9A0HXT4paP6TpG0j
 xX6srdYLabOZxT439ZkW1kCjDMC5ogOGSFQQoSI5sWZtKtE1G9fb8dH5SN4I29ou
 SMuq5gHkU3pJYZOCD3T9W2AelNFIacdeyhwflJCctogNTuYdDXI=
 =AiX1
 -----END PGP SIGNATURE-----

Merge 4.9.48 into android-4.9

Changes in 4.9.48
	irqchip: mips-gic: SYNC after enabling GIC region
	i2c: ismt: Don't duplicate the receive length for block reads
	i2c: ismt: Return EMSGSIZE for block reads with bogus length
	crypto: algif_skcipher - only call put_page on referenced and used pages
	mm, uprobes: fix multiple free of ->uprobes_state.xol_area
	mm, madvise: ensure poisoned pages are removed from per-cpu lists
	ceph: fix readpage from fscache
	cpumask: fix spurious cpumask_of_node() on non-NUMA multi-node configs
	cpuset: Fix incorrect memory_pressure control file mapping
	alpha: uapi: Add support for __SANE_USERSPACE_TYPES__
	CIFS: Fix maximum SMB2 header size
	CIFS: remove endian related sparse warning
	wl1251: add a missing spin_lock_init()
	lib/mpi: kunmap after finishing accessing buffer
	xfrm: policy: check policy direction value
	drm/ttm: Fix accounting error when fail to get pages for pool
	kvm: arm/arm64: Force reading uncached stage2 PGD
	epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()
	Linux 4.9.48

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-09-07 10:32:23 +02:00
Waiman Long
309e4dbfaf cpuset: Fix incorrect memory_pressure control file mapping
commit 1c08c22c874ac88799cab1f78c40f46110274915 upstream.

The memory_pressure control file was incorrectly set up without
a private value (0, by default). As a result, this control
file was treated like memory_migrate on read. By adding back the
FILE_MEMORY_PRESSURE private value, the correct memory pressure value
will be returned.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 7dbdb199d3 ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-07 08:35:40 +02:00
Greg Kroah-Hartman
02f29ab1b9 This is the 4.9.42 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlmN0iUACgkQONu9yGCS
 aT6gZRAAhYbpsz2XRFSQ5H/Sk8xwJuwtLB2at3Y1CLCb+lLhlRsV3l4pD+4KRoEl
 fU01l5s5ZalZJekGfEfEOQOfoJHsCxzFSzKzP06/GA5u6DbwtHUE2SjNWe84j6Ct
 Hx0jN90yj7S8vy2umROux+fVvZQ4Xay4TDCWhBeXgOFXevwC/G9D2LWE2NIYwbDH
 Ighahrhs21FZc9wbah0L04bRBAR7+ALLq1sO8ebKwl8eFzAkcEwI/yS48cnjGlgW
 9HW5MmY1BYTnRCrXaw5L0Vf5zH6obT7amrLNljNYN6vN62DRoOfwQh4QcblnIAoi
 L+HdZilifZ970RwQ2As3vy63/Kk3b207ht4mriTCyGXM9MY6bRovYv1wDAUlv7aD
 GlA8Q7xwsiJ4sG4i5LABjly+QeWymZ2b0kVWYpneJuBuj/gWVDhh1lfT+nOCAcJ6
 ROUY6d64ghKPBomkqlMSC+7sH7QKa0/W9WDQCLxtnmjcAkeElpGNGu/m/Thhvi2I
 NDq2sbMAeGJquXBXIN8W4NPy0puOn0wjqFI7LE61ujSiAxT8973uDtNlrmB/eCAf
 zD9yJsKELS20PKToren4hYYuRM2XlKh9gVIOWB2pShfzvSO7807ZyVcoI8/bVgZe
 I2BH6Dt3t+qqWR7B5/qvxxmNCv3HNMpNUzy/z+fXEf8/U3zqiTM=
 =n7pX
 -----END PGP SIGNATURE-----

Merge 4.9.42 into android-4.9

Changes in 4.9.42
	parisc: Handle vma's whose context is not current in flush_cache_range
	cgroup: create dfl_root files on subsys registration
	cgroup: fix error return value from cgroup_subtree_control()
	libata: array underflow in ata_find_dev()
	workqueue: restore WQ_UNBOUND/max_active==1 to be ordered
	iwlwifi: dvm: prevent an out of bounds access
	brcmfmac: fix memleak due to calling brcmf_sdiod_sgtable_alloc() twice
	NFSv4: Fix EXCHANGE_ID corrupt verifier issue
	mmc: sdhci-of-at91: force card detect value for non removable devices
	device property: Make dev_fwnode() public
	mmc: core: Fix access to HS400-ES devices
	mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
	cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
	ALSA: hda - Fix speaker output from VAIO VPCL14M1R
	drm/amdgpu: Fix undue fallthroughs in golden registers initialization
	ASoC: do not close shared backend dailink
	KVM: async_pf: make rcu irq exit if not triggered from idle task
	mm/page_alloc: Remove kernel address exposure in free_reserved_area()
	timers: Fix overflow in get_next_timer_interrupt
	powerpc/tm: Fix saving of TM SPRs in core dump
	powerpc/64: Fix __check_irq_replay missing decrementer interrupt
	iommu/amd: Enable ga_log_intr when enabling guest_mode
	gpiolib: skip unwanted events, don't convert them to opposite edge
	ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize
	ext4: fix overflow caused by missing cast in ext4_resize_fs()
	ARM: dts: armada-38x: Fix irq type for pca955
	ARM: dts: tango4: Request RGMII RX and TX clock delays
	media: platform: davinci: return -EINVAL for VPFE_CMD_S_CCDC_RAW_PARAMS ioctl
	iscsi-target: Fix initial login PDU asynchronous socket close OOPs
	mmc: dw_mmc: Use device_property_read instead of of_property_read
	mmc: core: Use device_property_read instead of of_property_read
	media: lirc: LIRC_GET_REC_RESOLUTION should return microseconds
	f2fs: sanity check checkpoint segno and blkoff
	Btrfs: fix early ENOSPC due to delalloc
	saa7164: fix double fetch PCIe access condition
	tcp_bbr: cut pacing rate only if filled pipe
	tcp_bbr: introduce bbr_bw_to_pacing_rate() helper
	tcp_bbr: introduce bbr_init_pacing_rate_from_rtt() helper
	tcp_bbr: remove sk_pacing_rate=0 transient during init
	tcp_bbr: init pacing rate on first RTT sample
	ipv4: ipv6: initialize treq->txhash in cookie_v[46]_check()
	net: Zero terminate ifr_name in dev_ifname().
	ipv6: avoid overflow of offset in ip6_find_1stfragopt
	net: dsa: b53: Add missing ARL entries for BCM53125
	ipv4: initialize fib_trie prior to register_netdev_notifier call.
	rtnetlink: allocate more memory for dev_set_mac_address()
	mcs7780: Fix initialization when CONFIG_VMAP_STACK is enabled
	openvswitch: fix potential out of bound access in parse_ct
	packet: fix use-after-free in prb_retire_rx_blk_timer_expired()
	ipv6: Don't increase IPSTATS_MIB_FRAGFAILS twice in ip6_fragment()
	net: ethernet: nb8800: Handle all 4 RGMII modes identically
	dccp: fix a memleak that dccp_ipv6 doesn't put reqsk properly
	dccp: fix a memleak that dccp_ipv4 doesn't put reqsk properly
	dccp: fix a memleak for dccp_feat_init err process
	sctp: don't dereference ptr before leaving _sctp_walk_{params, errors}()
	sctp: fix the check for _sctp_walk_params and _sctp_walk_errors
	net/mlx5: Consider tx_enabled in all modes on remap
	net/mlx5: Fix command bad flow on command entry allocation failure
	net/mlx5e: Fix outer_header_zero() check size
	net/mlx5e: Fix wrong delay calculation for overflow check scheduling
	net/mlx5e: Schedule overflow check work to mlx5e workqueue
	net: phy: Correctly process PHY_HALTED in phy_stop_machine()
	xen-netback: correctly schedule rate-limited queues
	sparc64: Measure receiver forward progress to avoid send mondo timeout
	sparc64: Fix exception handling in UltraSPARC-III memcpy.
	wext: handle NULL extra data in iwe_stream_add_point better
	sh_eth: fix EESIPR values for SH77{34|63}
	sh_eth: R8A7740 supports packet shecksumming
	net: phy: dp83867: fix irq generation
	tg3: Fix race condition in tg3_get_stats64().
	x86/boot: Add missing declaration of string functions
	spi: spi-axi: Free resources on error path
	ASoC: rt5645: set sel_i2s_pre_div1 to 2
	netfilter: use fwmark_reflect in nf_send_reset
	phy state machine: failsafe leave invalid RUNNING state
	ipv4: make tcp_notsent_lowat sysctl knob behave as true unsigned int
	clk/samsung: exynos542x: mark some clocks as critical
	scsi: qla2xxx: Get mutex lock before checking optrom_state
	drm/virtio: fix framebuffer sparse warning
	ARM: dts: sun8i: Support DTB build for NanoPi M1
	ARM: dts: sunxi: Change node name for pwrseq pin on Olinuxino-lime2-emmc
	iw_cxgb4: do not send RX_DATA_ACK CPLs after close/abort
	nbd: blk_mq_init_queue returns an error code on failure, not NULL
	virtio_blk: fix panic in initialization error path
	ARM: 8632/1: ftrace: fix syscall name matching
	mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER
	lib/Kconfig.debug: fix frv build failure
	signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
	mm: don't dereference struct page fields of invalid pages
	net/mlx5: E-Switch, Re-enable RoCE on mode change only after FDB destroy
	ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output
	net: account for current skb length when deciding about UFO
	net: phy: Fix PHY unbind crash
	workqueue: implicit ordered attribute should be overridable
	Linux 4.9.42

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-08-11 13:55:02 -07:00
Dima Zavin
45a636ec18 cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
commit 89affbf5d9ebb15c6460596822e8857ea2f9e735 upstream.

In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.

This problem manifested itself as a deadlock in the slub allocator,
inside get_any_partial.  The loop reads mems_allowed_seq value (via
read_mems_allowed_begin), performs the defrag operation, and then
verifies the consistency of mem_allowed via the read_mems_allowed_retry
and the cookie returned by xxx_begin.

The issue here is that both begin and retry first check if cpusets are
enabled via cpusets_enabled() static branch.  This branch can be
rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
x86 jump label code fully synchronizes across all CPUs for every entry
it rewrites.  If it rewrites only one of the callsites (specifically the
one in read_mems_allowed_retry) and then waits for the
smp_call_function(do_sync_core) to complete while a CPU is inside the
begin/retry section with IRQs off and the mems_allowed value is changed,
we can hang.

This is because begin() will always return 0 (since it wasn't patched
yet) while retry() will test the 0 against the actual value of the seq
counter.

The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets.  Similarly, when
disabling cpusets via cpuset_dec(), we first ensure that callers of
cpuset_mems_allowed_retry() will start ignoring the seqcount value
before we let cpuset_mems_allowed_begin() return 0.

The relevant stack traces of the two stuck threads:

  CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
  Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
  task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
  RIP: smp_call_function_many+0x1f9/0x260
  Call Trace:
    smp_call_function+0x3b/0x70
    on_each_cpu+0x2f/0x90
    text_poke_bp+0x87/0xd0
    arch_jump_label_transform+0x93/0x100
    __jump_label_update+0x77/0x90
    jump_label_update+0xaa/0xc0
    static_key_slow_inc+0x9e/0xb0
    cpuset_css_online+0x70/0x2e0
    online_css+0x2c/0xa0
    cgroup_apply_control_enable+0x27f/0x3d0
    cgroup_mkdir+0x2b7/0x420
    kernfs_iop_mkdir+0x5a/0x80
    vfs_mkdir+0xf6/0x1a0
    SyS_mkdir+0xb7/0xe0
    entry_SYSCALL_64_fastpath+0x18/0xad

  ...

  CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
  Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
  task: ffff8818087c0000 task.stack: ffffc90000030000
  RIP: int3+0x39/0x70
  Call Trace:
    <#DB> ? ___slab_alloc+0x28b/0x5a0
    <EOE> ? copy_process.part.40+0xf7/0x1de0
    __slab_alloc.isra.80+0x54/0x90
    copy_process.part.40+0xf7/0x1de0
    copy_process.part.40+0xf7/0x1de0
    kmem_cache_alloc_node+0x8a/0x280
    copy_process.part.40+0xf7/0x1de0
    _do_fork+0xe7/0x6c0
    _raw_spin_unlock_irq+0x2d/0x60
    trace_hardirqs_on_caller+0x136/0x1d0
    entry_SYSCALL_64_fastpath+0x5/0xad
    do_syscall_64+0x27/0x350
    SyS_clone+0x19/0x20
    do_syscall_64+0x60/0x350
    entry_SYSCALL64_slow_path+0x25/0x25

Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
Fixes: 46e700abc4 ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
Reported-by: Cliff Spradlin <cspradlin@waymo.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11 08:49:29 -07:00
Greg Kroah-Hartman
da3493c028 This is the 4.9.32 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAllBNNMACgkQONu9yGCS
 aT4rJA//XOtxp2nr01OqffSItCX0vq7vMofbjo7spI2LF3JyxKkx0hje0im82XoD
 cYhZHyeK9lCHRZt2O541xOp3ITconecRvlt6uq49zKNvlFAelnp0rRTOktluJK/n
 Xm7EP2KLHeHXv+rvCfzU0hDC9fvoZe82NIF7kddAWHX4D/K3C6Jw6FVc4/QPl2MG
 11R6pylPtMGObgPEmfDlohBitcC3KawXAhIcyTTAr3rcuO2Wm00H+uF0VOpyD9hZ
 S5A8ecuyZVQw3mIKAJ9vOwkVoln1E+/P/OltWEXElkBZONHpI4LB2IE7Nmwnx+pE
 5oO3afNZIYOejX+GQvK2Apc8VONDV2VZRoIstzl+uaKDNGFaSLIu8AQ/UQfmoqTE
 Pwrp/FxgDGjq6NOf/+MqNhiqS9433Xgmt4HT5xctzS0zFNGzLIokdjRH3x7n6cez
 2JvdtRvJfdNq31kh6xQE/8oXuZktRSw3GxC5K4Dw3NRh02/9KNAgEjNPEMNXziGw
 RhLhyXl8i5eGAu1gxruojzRyCvaMS6S8ZcAvSSoJze0kj8jfkZtQbBswvS1LF9EI
 FFemshd1sQxUeZiXGYPuUUfyDvxvs7KRIqAYux8lsIhSQTu16REqsZRrzidpDP+K
 xtH61YBWhwDHZKIoT+Jchmn8QCytI77MvFdGuTl+KRPV4OyyeJs=
 =NE4C
 -----END PGP SIGNATURE-----

Merge 4.9.32 into android-4.9

Changes in 4.9.32
	bnx2x: Fix Multi-Cos
	vxlan: eliminate cached dst leak
	ipv6: xfrm: Handle errors reported by xfrm6_find_1stfragopt()
	cxgb4: avoid enabling napi twice to the same queue
	tcp: disallow cwnd undo when switching congestion control
	vxlan: fix use-after-free on deletion
	ipv6: Fix leak in ipv6_gso_segment().
	net: ping: do not abuse udp_poll()
	net/ipv6: Fix CALIPSO causing GPF with datagram support
	net: ethoc: enable NAPI before poll may be scheduled
	net: stmmac: fix completely hung TX when using TSO
	net: bridge: start hello timer only if device is up
	sparc64: Add __multi3 for gcc 7.x and later.
	sparc64: mm: fix copy_tsb to correctly copy huge page TSBs
	sparc: Machine description indices can vary
	sparc64: reset mm cpumask after wrap
	sparc64: combine activate_mm and switch_mm
	sparc64: redefine first version
	sparc64: add per-cpu mm of secondary contexts
	sparc64: new context wrap
	sparc64: delete old wrap code
	arch/sparc: support NR_CPUS = 4096
	serial: ifx6x60: fix use-after-free on module unload
	ptrace: Properly initialize ptracer_cred on fork
	crypto: asymmetric_keys - handle EBUSY due to backlog correctly
	KEYS: fix dereferencing NULL payload with nonzero length
	KEYS: fix freeing uninitialized memory in key_update()
	KEYS: encrypted: avoid encrypting/decrypting stack buffers
	crypto: drbg - wait for crypto op not signal safe
	crypto: gcm - wait for crypto op not signal safe
	drm/amdgpu/ci: disable mclk switching for high refresh rates (v2)
	nfsd4: fix null dereference on replay
	nfsd: Fix up the "supattr_exclcreat" attributes
	efi: Don't issue error message when booted under Xen
	kvm: async_pf: fix rcu_irq_enter() with irqs enabled
	KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulation
	arm64: KVM: Preserve RES1 bits in SCTLR_EL2
	arm64: KVM: Allow unaligned accesses at EL2
	arm: KVM: Allow unaligned accesses at HYP
	KVM: async_pf: avoid async pf injection when in guest mode
	KVM: arm/arm64: vgic-v3: Do not use Active+Pending state for a HW interrupt
	KVM: arm/arm64: vgic-v2: Do not use Active+Pending state for a HW interrupt
	dmaengine: usb-dmac: Fix DMAOR AE bit definition
	dmaengine: ep93xx: Always start from BASE0
	dmaengine: ep93xx: Don't drain the transfers in terminate_all()
	dmaengine: mv_xor_v2: handle mv_xor_v2_prep_sw_desc() error properly
	dmaengine: mv_xor_v2: properly handle wrapping in the array of HW descriptors
	dmaengine: mv_xor_v2: do not use descriptors not acked by async_tx
	dmaengine: mv_xor_v2: enable XOR engine after its configuration
	dmaengine: mv_xor_v2: fix tx_submit() implementation
	dmaengine: mv_xor_v2: remove interrupt coalescing
	dmaengine: mv_xor_v2: set DMA mask to 40 bits
	cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode
	xen/privcmd: Support correctly 64KB page granularity when mapping memory
	ext4: fix SEEK_HOLE
	ext4: keep existing extra fields when inode expands
	ext4: fix data corruption with EXT4_GET_BLOCKS_ZERO
	ext4: fix fdatasync(2) after extent manipulation operations
	drm: Fix oops + Xserver hang when unplugging USB drm devices
	usb: gadget: f_mass_storage: Serialize wake and sleep execution
	usb: chipidea: udc: fix NULL pointer dereference if udc_start failed
	usb: chipidea: debug: check before accessing ci_role
	staging/lustre/lov: remove set_fs() call from lov_getstripe()
	iio: adc: bcm_iproc_adc: swap primary and secondary isr handler's
	iio: light: ltr501 Fix interchanged als/ps register field
	iio: proximity: as3935: fix AS3935_INT mask
	iio: proximity: as3935: fix iio_trigger_poll issue
	mei: make sysfs modalias format similar as uevent modalias
	cpufreq: cpufreq_register_driver() should return -ENODEV if init fails
	target: Re-add check to reject control WRITEs with overflow data
	drm/msm: Expose our reservation object when exporting a dmabuf.
	ahci: Acer SA5-271 SSD Not Detected Fix
	cgroup: Prevent kill_css() from being called more than once
	Input: elantech - add Fujitsu Lifebook E546/E557 to force crc_enabled
	cpuset: consider dying css as offline
	fs: add i_blocksize()
	ufs: restore proper tail allocation
	fix ufs_isblockset()
	ufs: restore maintaining ->i_blocks
	ufs: set correct ->s_maxsize
	ufs_extend_tail(): fix the braino in calling conventions of ufs_new_fragments()
	ufs_getfrag_block(): we only grab ->truncate_mutex on block creation path
	cxl: Fix error path on bad ioctl
	cxl: Avoid double free_irq() for psl,slice interrupts
	btrfs: use correct types for page indices in btrfs_page_exists_in_range
	btrfs: fix memory leak in update_space_info failure path
	KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages
	scsi: qla2xxx: don't disable a not previously enabled PCI device
	scsi: qla2xxx: Modify T262 FW dump template to specify same start/end to debug customer issues
	scsi: qla2xxx: Set bit 15 for DIAG_ECHO_TEST MBC
	scsi: qla2xxx: Fix mailbox pointer error in fwdump capture
	powerpc/sysdev/simple_gpio: Fix oops in gpio save_regs function
	powerpc/numa: Fix percpu allocations to be NUMA aware
	powerpc/hotplug-mem: Fix missing endian conversion of aa_index
	powerpc/kernel: Fix FP and vector register restoration
	powerpc/kernel: Initialize load_tm on task creation
	perf/core: Drop kernel samples even though :u is specified
	drm/vmwgfx: Handle vmalloc() failure in vmw_local_fifo_reserve()
	drm/vmwgfx: limit the number of mip levels in vmw_gb_surface_define_ioctl()
	drm/vmwgfx: Make sure backup_handle is always valid
	drm/nouveau/tmr: fully separate alarm execution/pending lists
	ALSA: timer: Fix race between read and ioctl
	ALSA: timer: Fix missing queue indices reset at SNDRV_TIMER_IOCTL_SELECT
	ASoC: Fix use-after-free at card unregistration
	cpu/hotplug: Drop the device lock on error
	drivers: char: mem: Fix wraparound check to allow mappings up to the end
	serial: sh-sci: Fix panic when serial console and DMA are enabled
	arm64: traps: fix userspace cache maintenance emulation on a tagged pointer
	arm64: hw_breakpoint: fix watchpoint matching for tagged pointers
	arm64: entry: improve data abort handling of tagged pointers
	ARM: 8636/1: Cleanup sanity_check_meminfo
	ARM: 8637/1: Adjust memory boundaries after reservations
	usercopy: Adjust tests to deal with SMAP/PAN
	drm/i915/vbt: don't propagate errors from intel_bios_init()
	drm/i915/vbt: split out defaults that are set when there is no VBT
	cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
	cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
	netfilter: nft_set_rbtree: handle element re-addition after deletion
	Linux 4.9.32

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-06-14 16:42:56 +02:00
Tejun Heo
829a1cab22 cpuset: consider dying css as offline
commit 41c25707d21716826e3c1f60967f5550610ec1c9 upstream.

In most cases, a cgroup controller don't care about the liftimes of
cgroups.  For the controller, a css becomes online when ->css_online()
is called on it and offline when ->css_offline() is called.

However, cpuset is special in that the user interface it exposes cares
whether certain cgroups exist or not.  Combined with the RCU delay
between cgroup removal and css offlining, this can lead to user
visible behavior oddities where operations which should succeed after
cgroup removals fail for some time period.  The effects of cgroup
removals are delayed when seen from userland.

This patch adds css_is_dying() which tests whether offline is pending
and updates is_cpuset_online() so that the function returns false also
while offline is pending.  This gets rid of the userland visible
delays.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 15:06:00 +02:00
Riley Andrews
b3ea9acfdc ANDROID: cpuset: Make cpusets restore on hotplug
This deliberately changes the behavior of the per-cpuset
cpus file to not be effected by hotplug. When a cpu is offlined,
it will be removed from the cpuset/cpus file. When a cpu is onlined,
if the cpuset originally requested that that cpu was part of the cpuset,
that cpu will be restored to the cpuset. The cpus files still
have to be hierachical, but the ranges no longer have to be out of
the currently online cpus, just the physically present cpus.

Change-Id: I22cdf33e7d312117bcefba1aeb0125e1ada289a9
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2017-01-27 13:55:27 -08:00
Linus Torvalds
f34d3606f7 Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - tracepoints for basic cgroup management operations added

 - kernfs and cgroup path formatting functions updated to behave in the
   style of strlcpy()

 - non-critical bug fixes

* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
  cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
  cpuset: fix error handling regression in proc_cpuset_show()
  cgroup: add tracepoints for basic operations
  cgroup: make cgroup_path() and friends behave in the style of strlcpy()
  kernfs: remove kernfs_path_len()
  kernfs: make kernfs_path*() behave in the style of strlcpy()
  kernfs: add dummy implementation of kernfs_path_from_node()
2016-10-14 12:18:50 -07:00
Tejun Heo
679a5e3f12 cpuset: fix error handling regression in proc_cpuset_show()
4c737b41de ("cgroup: make cgroup_path() and friends behave in the
style of strlcpy()") botched the conversion of proc_cpuset_show() and
broke its error handling.  It made the function return 0 on failures
and fail to handle error returns from cgroup_path_ns().  Fix it.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-09-29 15:55:02 +02:00
Wei Yongjun
8a15b81741 cpuset: fix non static symbol warning
Fixes the following sparse warning:

kernel/cpuset.c:2088:6: warning:
 symbol 'cpuset_fork' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-09-16 11:31:17 -04:00
Joonwoo Park
28b89b9e6f cpuset: handle race between CPU hotplug and cpuset_hotplug_work
A discrepancy between cpu_online_mask and cpuset's effective_cpus
mask is inevitable during hotplug since cpuset defers updating of
effective_cpus mask using a workqueue, during which time nothing
prevents the system from more hotplug operations.  For that reason
guarantee_online_cpus() walks up the cpuset hierarchy until it finds
an intersection under the assumption that top cpuset's effective_cpus
mask intersects with cpu_online_mask even with such a race occurring.

However a sequence of CPU hotplugs can open a time window, during which
none of the effective CPUs in the top cpuset intersect with
cpu_online_mask.

For example when there are 4 possible CPUs 0-3 and only CPU0 is online:

  ========================  ===========================
   cpu_online_mask           top_cpuset.effective_cpus
  ========================  ===========================
   echo 1 > cpu2/online.
   CPU hotplug notifier woke up hotplug work but not yet scheduled.
      [0,2]                     [0]

   echo 0 > cpu0/online.
   The workqueue is still runnable.
      [2]                       [0]
  ========================  ===========================

  Now there is no intersection between cpu_online_mask and
  top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
  this moment can cause following:

   Unable to handle kernel NULL pointer dereference at virtual address 000000d0
   ------------[ cut here ]------------
   Kernel BUG at ffffffc0001389b0 [verbose debug info unavailable]
   Internal error: Oops - BUG: 96000005 [#1] PREEMPT SMP
   Modules linked in:
   CPU: 2 PID: 1420 Comm: taskset Tainted: G        W       4.4.8+ #98
   task: ffffffc06a5c4880 ti: ffffffc06e124000 task.ti: ffffffc06e124000
   PC is at guarantee_online_cpus+0x2c/0x58
   LR is at cpuset_cpus_allowed+0x4c/0x6c
   <snip>
   Process taskset (pid: 1420, stack limit = 0xffffffc06e124020)
   Call trace:
   [<ffffffc0001389b0>] guarantee_online_cpus+0x2c/0x58
   [<ffffffc00013b208>] cpuset_cpus_allowed+0x4c/0x6c
   [<ffffffc0000d61f0>] sched_setaffinity+0xc0/0x1ac
   [<ffffffc0000d6374>] SyS_sched_setaffinity+0x98/0xac
   [<ffffffc000085cb0>] el0_svc_naked+0x24/0x28

The top cpuset's effective_cpus are guaranteed to be identical to
cpu_online_mask eventually.  Hence fall back to cpu_online_mask when
there is no intersection between top cpuset's effective_cpus and
cpu_online_mask.

Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org> # 3.17+
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-09-13 11:26:27 -04:00
Tejun Heo
4c737b41de cgroup: make cgroup_path() and friends behave in the style of strlcpy()
cgroup_path() and friends used to format the path from the end and
thus the resulting path usually didn't start at the start of the
passed in buffer.  Also, when the buffer was too small, the partial
result was truncated from the head rather than tail and there was no
way to tell how long the full path would be.  These make the functions
less robust and more awkward to use.

With recent updates to kernfs_path(), cgroup_path() and friends can be
made to behave in strlcpy() style.

* cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
  always return the length of the full path.  If buffer is too small,
  it contains nul terminated truncated output.

* All users updated accordingly.

v2: cgroup_path() usage in kernel/sched/debug.c converted.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-08-10 11:23:44 -04:00
Zefan Li
06f4e94898 cpuset: make sure new tasks conform to the current config of the cpuset
A new task inherits cpus_allowed and mems_allowed masks from its parent,
but if someone changes cpuset's config by writing to cpuset.cpus/cpuset.mems
before this new task is inserted into the cgroup's task list, the new task
won't be updated accordingly.

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2016-08-09 23:58:01 -04:00
Michal Hocko
fec1e5f987 cpuset, mm: fix TIF_MEMDIE check in cpuset_change_task_nodemask
Commit c0ff7453bb ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") has added TIF_MEMDIE and PF_EXITING check but
it is checking the flag on the current task rather than the given one.

This doesn't make much sense and it is actually wrong.  If the current
task which updates the nodemask of a cpuset got killed by the OOM killer
then a part of the cpuset cgroup processes would have incompatible
nodemask which is surprising to say the least.

The comment suggests the intention was to skip oom victim or an exiting
task so we should be checking the given task.  But even then it would be
layering violation because it is the memory allocator to interpret the
TIF_MEMDIE meaning.  Simply drop both checks.  All tasks in the cpuset
should simply follow the same mask.

Link: http://lkml.kernel.org/r/1467029719-17602-3-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Vlastimil Babka
002f290627 cpuset: use static key better and convert to new API
An important function for cpusets is cpuset_node_allowed(), which
optimizes on the fact if there's a single root CPU set, it must be
trivially allowed.  But the check "nr_cpusets() <= 1" doesn't use the
cpusets_enabled_key static key the right way where static keys eliminate
branching overhead with jump labels.

This patch converts it so that static key is used properly.  It's also
switched to the new static key API and the checking functions are
converted to return bool instead of int.  We also provide a new variant
__cpuset_zone_allowed() which expects that the static key check was
already done and they key was enabled.  This is needed for
get_page_from_freelist() where we want to also avoid the relatively
slower check when ALLOC_CPUSET is not set in alloc_flags.

The impact on the page allocator microbenchmark is less than expected
but the cleanup in itself is worthwhile.

                                             4.6.0-rc2                  4.6.0-rc2
                                       multcheck-v1r20               cpuset-v1r20
  Min      alloc-odr0-1               348.00 (  0.00%)           348.00 (  0.00%)
  Min      alloc-odr0-2               254.00 (  0.00%)           254.00 (  0.00%)
  Min      alloc-odr0-4               213.00 (  0.00%)           213.00 (  0.00%)
  Min      alloc-odr0-8               186.00 (  0.00%)           183.00 (  1.61%)
  Min      alloc-odr0-16              173.00 (  0.00%)           171.00 (  1.16%)
  Min      alloc-odr0-32              166.00 (  0.00%)           163.00 (  1.81%)
  Min      alloc-odr0-64              162.00 (  0.00%)           159.00 (  1.85%)
  Min      alloc-odr0-128             160.00 (  0.00%)           157.00 (  1.88%)
  Min      alloc-odr0-256             169.00 (  0.00%)           166.00 (  1.78%)
  Min      alloc-odr0-512             180.00 (  0.00%)           180.00 (  0.00%)
  Min      alloc-odr0-1024            188.00 (  0.00%)           187.00 (  0.53%)
  Min      alloc-odr0-2048            194.00 (  0.00%)           193.00 (  0.52%)
  Min      alloc-odr0-4096            199.00 (  0.00%)           198.00 (  0.50%)
  Min      alloc-odr0-8192            202.00 (  0.00%)           201.00 (  0.50%)
  Min      alloc-odr0-16384           203.00 (  0.00%)           202.00 (  0.49%)

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Andrew Morton
0edaf86cf1 include/linux/nodemask.h: create next_node_in() helper
Lots of code does

	node = next_node(node, XXX);
	if (node == MAX_NUMNODES)
		node = first_node(XXX);

so create next_node_in() to do this and use it in various places.

[mhocko@suse.com: use next_node_in() helper]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Hui Zhu <zhuhui@xiaomi.com>
Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Tejun Heo
5cf1cacb49 cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback
Since e93ad19d05 ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write().  This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.

memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function.  This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex.  This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished.  cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem.  We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
2016-04-25 15:45:14 -04:00
Linus Torvalds
5518f66b5a Merge branch 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup namespace support from Tejun Heo:
 "These are changes to implement namespace support for cgroup which has
  been pending for quite some time now.  It is very straight-forward and
  only affects what part of cgroup hierarchies are visible.

  After unsharing, mounting a cgroup fs will be scoped to the cgroups
  the task belonged to at the time of unsharing and the cgroup paths
  exposed to userland would be adjusted accordingly"

* 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix and restructure error handling in copy_cgroup_ns()
  cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
  Add FS_USERNS_FLAG to cgroup fs
  cgroup: Add documentation for cgroup namespaces
  cgroup: mount cgroupns-root when inside non-init cgroupns
  kernfs: define kernfs_node_dentry
  cgroup: cgroup namespace setns support
  cgroup: introduce cgroup namespaces
  sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
  kernfs: Add API to generate relative kernfs path
2016-03-21 10:05:13 -07:00
Tejun Heo
b38e42e962 cgroup: convert cgroup_subsys flag fields to bool bitfields
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-02-23 10:00:50 -05:00
Aditya Kali
a79a908fd2 cgroup: introduce cgroup namespaces
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
 they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-02-16 13:04:58 -05:00
Tejun Heo
e93ad19d05 cpuset: make mm migration asynchronous
If "cpuset.memory_migrate" is set, when a process is moved from one
cpuset to another with a different memory node mask, pages in used by
the process are migrated to the new set of nodes.  This was performed
synchronously in the ->attach() callback, which is synchronized
against process management.  Recently, the synchronization was changed
from per-process rwsem to global percpu rwsem for simplicity and
optimization.

Combined with the synchronous mm migration, this led to deadlocks
because mm migration could schedule a work item which may in turn try
to create a new worker blocking on the process management lock held
from cgroup process migration path.

This heavy an operation shouldn't be performed synchronously from that
deep inside cgroup migration in the first place.  This patch punts the
actual migration to an ordered workqueue and updates cgroup process
migration and cpuset config update paths to flush the workqueue after
all locks are released.  This way, the operations still seem
synchronous to userland without entangling mm migration with process
management synchronization.  CPU hotplug can also invoke mm migration
but there's no reason for it to wait for mm migrations and thus
doesn't synchronize against their completions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: stable@vger.kernel.org # v4.4+
2016-01-22 10:22:46 -05:00
Tejun Heo
8075b542cf Merge branch 'for-4.4-fixes' into for-4.5 2015-12-03 10:22:52 -05:00
Tejun Heo
1f7dd3e5a6 cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.

  P0 (+memory) --- P1 (-memory) --- A
                                 \- B
       
P0 has memory enabled in its subtree_control while P1 doesn't.  If
both A and B contain processes, they would belong to the memory css of
P1.  Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter.  IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses.  pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

 WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
 Modules linked in:
 CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
 ...
  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
 Call Trace:
  [<ffffffff81551ffc>] dump_stack+0x4e/0x82
  [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
  [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
  [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
  [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
  [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
  [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
  [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
  [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
  [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
  [<ffffffff81265f88>] __vfs_write+0x28/0xe0
  [<ffffffff812666fc>] vfs_write+0xac/0x1a0
  [<ffffffff81267019>] SyS_write+0x49/0xb0
  [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated.  All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
  target csses can be converted trivially.  cpu, io, freezer, perf,
  netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
  and destination and thus doesn't support v2 hierarchy already.  The
  only change made by this patchset is how that single destination css
  is obtained.

* memory migration path already doesn't do anything on v2.  How the
  single destination css is obtained is updated and the prep stage of
  mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug.  It now
  correctly handles multi-destination migrations and no longer causes
  counter underflow from incorrect accounting.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 10:18:21 -05:00
Arnd Bergmann
d2b4365809 cpuset: Replace all instances of time_t with time64_t
The following patch replaces all instances of time_t with time64_t i.e.
change the type used for representing time from 32-bit to 64-bit. All
32-bit kernels to date use a signed 32-bit time_t type, which can only
represent time until January 2038. Since embedded systems running 32-bit
Linux are going to survive beyond that date, we have to change all
current uses, in a backwards compatible way.

The patch also changes the function get_seconds() that returns a 32-bit
integer to ktime_get_seconds() that returns seconds as 64-bit integer.

The patch changes the type of ticks from time_t to u32. We keep ticks as
32-bits as the function uses 32-bit arithmetic which would prove less
expensive than 64-bit arithmetic and the function is expected to be
called atleast once every 32 seconds.

Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-11-25 14:00:05 -05:00
Linus Torvalds
2e3078af2c Merge branch 'akpm' (patches from Andrew)
Merge patch-bomb from Andrew Morton:

 - inotify tweaks

 - some ocfs2 updates (many more are awaiting review)

 - various misc bits

 - kernel/watchdog.c updates

 - Some of mm.  I have a huge number of MM patches this time and quite a
   lot of it is quite difficult and much will be held over to next time.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
  selftests: vm: add tests for lock on fault
  mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
  mm: introduce VM_LOCKONFAULT
  mm: mlock: add new mlock system call
  mm: mlock: refactor mlock, munlock, and munlockall code
  kasan: always taint kernel on report
  mm, slub, kasan: enable user tracking by default with KASAN=y
  kasan: use IS_ALIGNED in memory_is_poisoned_8()
  kasan: Fix a type conversion error
  lib: test_kasan: add some testcases
  kasan: update reference to kasan prototype repo
  kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
  kasan: various fixes in documentation
  kasan: update log messages
  kasan: accurately determine the type of the bad access
  kasan: update reported bug types for kernel memory accesses
  kasan: update reported bug types for not user nor kernel memory accesses
  mm/kasan: prevent deadlock in kasan reporting
  mm/kasan: don't use kasan shadow pointer in generic functions
  mm/kasan: MODULE_VADDR is not available on all archs
  ...
2015-11-05 23:10:54 -08:00
David Rientjes
da39da3a54 mm, oom: remove task_lock protecting comm printing
The oom killer takes task_lock() in a couple of places solely to protect
printing the task's comm.

A process's comm, including current's comm, may change due to
/proc/pid/comm or PR_SET_NAME.

The comm will always be NULL-terminated, so the worst race scenario would
only be during update.  We can tolerate a comm being printed that is in
the middle of an update to avoid taking the lock.

Other locations in the kernel have already dropped task_lock() when
printing comm, so this is consistent.

Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Tejun Heo
27bd4dbb8d cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
Currently, cgroup_has_tasks() tests whether the target cgroup has any
css_set linked to it.  This works because a css_set's refcnt converges
with the number of tasks linked to it and thus there's no css_set
linked to a cgroup if it doesn't have any live tasks.

To help tracking resource usage of zombie tasks, putting the ref of
css_set will be separated from disassociating the task from the
css_set which means that a cgroup may have css_sets linked to it even
when it doesn't have any live tasks.

This patch replaces cgroup_has_tasks() with cgroup_is_populated()
which tests cgroup->nr_populated instead which locally counts the
number of populated css_sets.  Unlike cgroup_has_tasks(),
cgroup_is_populated() is recursive - if any of the descendants is
populated, the cgroup is populated too.  While this changes the
meaning of the test, all the existing users are okay with the change.

While at it, replace the open-coded ->populated_cnt test in
cgroup_events_show() with cgroup_is_populated().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
2015-10-15 16:41:50 -04:00
Tejun Heo
4530eddb59 cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader()
It wasn't explicitly documented but, when a process is being migrated,
cpuset and memcg depend on cgroup_taskset_first() returning the
threadgroup leader; however, this approach is somewhat ghetto and
would no longer work for the planned multi-process migration.

This patch introduces explicit cgroup_taskset_for_each_leader() which
iterates over only the threadgroup leaders and replaces
cgroup_taskset_first() usages for accessing the leader with it.

This prepares both memcg and cpuset for multi-process migration.  This
patch also updates the documentation for cgroup_taskset_for_each() to
clarify the iteration rules and removes comments mentioning task
ordering in tasksets.

v2: A previous patch which added threadgroup leader test was dropped.
    Patch updated accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2015-09-22 12:46:53 -04:00
Tejun Heo
3df9ca0a2b cpuset: migrate memory only for threadgroup leaders
If memory_migrate flag is set, cpuset migrates memory according to the
destnation css's nodemask.  The current implementation migrates memory
whenever any thread of a process is migrated making the behavior
somewhat arbitrary.  Let's tie memory operations to the threadgroup
leader so that memory is migrated only when the leader is migrated.

While this is a behavior change, given the inherent fuziness, this
change is not too likely to be noticed and allows us to clearly define
who owns the memory (always the leader) and helps the planned atomic
multi-process migration.

Note that we're currently migrating memory in migration path proper
while holding all the locks.  In the long term, this should be moved
out to an async work item.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
2015-09-22 12:46:53 -04:00
Tejun Heo
7dbdb199d3 cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE
cftype->mode allows controllers to give arbitrary permissions to
interface knobs.  Except for "cgroup.event_control", the existing uses
are spurious.

* Some explicitly specify S_IRUGO | S_IWUSR even though that's the
  default.

* "cpuset.memory_pressure" specifies S_IRUGO while also setting a
  write callback which returns -EACCES.  All it needs to do is simply
  not setting a write callback.

"cgroup.event_control" uses cftype->mode to make the file
world-writable.  It's a misdesigned interface and we don't want
controllers to be tweaking interface file permissions in general.
This patch removes cftype->mode and all its spurious uses and
implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is
marked as compatibility-only.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2015-09-18 17:54:23 -04:00
Tejun Heo
9e10a130d9 cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl()
cgroup_on_dfl() tests whether the cgroup's root is the default
hierarchy; however, an individual controller is only interested in
whether the controller is attached to the default hierarchy and never
tests a cgroup which doesn't belong to the hierarchy that the
controller is attached to.

This patch replaces cgroup_on_dfl() tests in controllers with faster
static_key based cgroup_subsys_on_dfl().  This leaves cgroup core as
the only user of cgroup_on_dfl() and the function is moved from the
header file to cgroup.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
2015-09-18 11:56:28 -04:00
Alban Crequy
24ee3cf89b cpuset: use trialcs->mems_allowed as a temp variable
The comment says it's using trialcs->mems_allowed as a temp variable but
it didn't match the code. Change the code to match the comment.

This fixes an issue when writing in cpuset.mems when a sub-directory
exists: we need to write several times for the information to persist:

| root@alban:/sys/fs/cgroup/cpuset# mkdir footest9
| root@alban:/sys/fs/cgroup/cpuset# cd footest9
| root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9#

This should help to fix the following issue in Docker:
https://github.com/opencontainers/runc/issues/133
In some conditions, a Docker container needs to be started twice in
order to work.

Signed-off-by: Alban Crequy <alban@endocode.com>
Tested-by: Iago López Galeiras <iago@endocode.com>
Cc: <stable@vger.kernel.org> # 3.17+
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-08-10 11:18:41 -04:00
David Rientjes
6e276d2a51 kernel, cpuset: remove exception for __GFP_THISNODE
Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so
remove the obscure comment about it and its special-case exception.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Jarno Rajahalme <jrajahalme@nicira.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 16:49:03 -07:00
Rik van Riel
47b8ea7186 cpusets, isolcpus: exclude isolcpus from load balancing in cpusets
Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.

Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.

Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.

This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: cgroups@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: David Rientjes <rientjes@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-03-19 14:28:19 -04:00
Jason Low
283cb41f42 cpuset: Fix cpuset sched_relax_domain_level
The cpuset.sched_relax_domain_level can control how far we do
immediate load balancing on a system. However, it was found on recent
kernels that echo'ing a value into cpuset.sched_relax_domain_level
did not reduce any immediate load balancing.

The reason this occurred was because the update_domain_attr_tree() traversal
did not update for the "top_cpuset". This resulted in nothing being changed
when modifying the sched_relax_domain_level parameter.

This patch is able to address that problem by having update_domain_attr_tree()
allow updates for the root in the cpuset traversal.

Fixes: fc560a26ac ("cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()")
Cc: <stable@vger.kernel.org> # 3.9+
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Serge Hallyn <serge.hallyn@canonical.com>
2015-03-02 11:55:04 -05:00
Zefan Li
79063bffc8 cpuset: fix a warning when clearing configured masks in old hierarchy
When we clear cpuset.cpus, cpuset.effective_cpus won't be cleared:

  # mount -t cgroup -o cpuset xxx /mnt
  # mkdir /mnt/tmp
  # echo 0 > /mnt/tmp/cpuset.cpus
  # echo > /mnt/tmp/cpuset.cpus
  # cat cpuset.cpus

  # cat cpuset.effective_cpus
  0-15

And a kernel warning in update_cpumasks_hier() is triggered:

 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 4028 at kernel/cpuset.c:894 update_cpumasks_hier+0x471/0x650()

Cc: <stable@vger.kernel.org> # 3.17+
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Serge Hallyn <serge.hallyn@canonical.com>
2015-03-02 11:55:04 -05:00
Zefan Li
790317e1b2 cpuset: initialize effective masks when clone_children is enabled
If clone_children is enabled, effective masks won't be initialized
due to the bug:

  # mount -t cgroup -o cpuset xxx /mnt
  # echo 1 > cgroup.clone_children
  # mkdir /mnt/tmp
  # cat /mnt/tmp/
  # cat cpuset.effective_cpus

  # cat cpuset.cpus
  0-15

And then this cpuset won't constrain the tasks in it.

Either the bug or the fix has no effect on unified hierarchy, as
there's no clone_chidren flag there any more.

Reported-by: Christian Brauner <christianvanbrauner@gmail.com>
Reported-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: <stable@vger.kernel.org> # 3.17+
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Serge Hallyn <serge.hallyn@canonical.com>
2015-03-02 11:55:04 -05:00
Tejun Heo
e8e6d97c9b cpuset: use %*pb[l] to print bitmaps including cpumasks and nodemasks
printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.

* kernel/cpuset.c::cpuset_print_task_mems_allowed() used a static
  buffer which is protected by a dedicated spinlock.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:37 -08:00
Rasmus Villemoes
8f4ab07f4b kernel/cpuset.c: Mark cpuset_init_current_mems_allowed as __init
The only caller of cpuset_init_current_mems_allowed is the __init
annotated build_all_zonelists_init, so we can also make the former __init.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
Cc: Pintu Kumar <pintu.k@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:11 -08:00
Linus Torvalds
2756d373a3 Merge branch 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup update from Tejun Heo:
 "cpuset got simplified a bit.  cgroup core got a fix on unified
  hierarchy and grew some effective css related interfaces which will be
  used for blkio support for writeback IO traffic which is currently
  being worked on"

* 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: implement cgroup_get_e_css()
  cgroup: add cgroup_subsys->css_e_css_changed()
  cgroup: add cgroup_subsys->css_released()
  cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
  cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
  cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
  cpuset: lock vs unlock typo
  cpuset: simplify cpuset_node_allowed API
  cpuset: convert callback_mutex to a spinlock
2014-12-11 18:57:19 -08:00
Juri Lelli
f82f80426f sched/deadline: Ensure that updates to exclusive cpusets don't break AC
How we deal with updates to exclusive cpusets is currently broken.
As an example, suppose we have an exclusive cpuset composed of
two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it
up to the allowed bandwidth. If we want now to modify cpusetA's
cpumask, we have to check that removing a cpu's amount of
bandwidth doesn't break AC guarantees. This thing isn't checked
in the current code.

This patch fixes the problem above, denying an update if the
new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks
that are currently active.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:48:00 +01:00
Juri Lelli
7f51412a41 sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets
Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
affinity (performing what is commonly called clustered scheduling).
Unfortunately, such thing is currently broken for two reasons:

 - No check is performed when the user tries to attach a task to
   an exlusive cpuset (recall that exclusive cpusets have an
   associated maximum allowed bandwidth).

 - Bandwidths of source and destination cpusets are not correctly
   updated after a task is migrated between them.

This patch fixes both things at once, as they are opposite faces
of the same coin.

The check is performed in cpuset_can_attach(), as there aren't any
points of failure after that function. The updated is split in two
halves. We first reserve bandwidth in the destination cpuset, after
we pass the check in cpuset_can_attach(). And we then release
bandwidth from the source cpuset when the task's affinity is
actually changed. Even if there can be time windows when sched_setattr()
may erroneously fail in the source cpuset, we are fine with it, as
we can't perfom an atomic update of both cpusets at once.

Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Vincent Legout <vincent@legout.info>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: michael@amarulasolutions.com
Cc: luca.abeni@unitn.it
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-28 10:47:58 +01:00
Dan Carpenter
cea74465e2 cpuset: lock vs unlock typo
This will deadlock instead of unlocking.

Fixes: f73eae8d8384 ('cpuset: simplify cpuset_node_allowed API')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-10-27 11:53:29 -04:00
Vladimir Davydov
344736f29b cpuset: simplify cpuset_node_allowed API
Current cpuset API for checking if a zone/node is allowed to allocate
from looks rather awkward. We have hardwall and softwall versions of
cpuset_node_allowed with the softwall version doing literally the same
as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
If it isn't, the softwall version may check the given node against the
enclosing hardwall cpuset, which it needs to take the callback lock to
do.

Such a distinction was introduced by commit 02a0e53d82 ("cpuset:
rework cpuset_zone_allowed api"). Before, we had the only version with
the __GFP_HARDWALL flag determining its behavior. The purpose of the
commit was to avoid sleep-in-atomic bugs when someone would mistakenly
call the function without the __GFP_HARDWALL flag for an atomic
allocation. The suffixes introduced were intended to make the callers
think before using the function.

However, since the callback lock was converted from mutex to spinlock by
the previous patch, the softwall check function cannot sleep, and these
precautions are no longer necessary.

So let's simplify the API back to the single check.

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-10-27 11:15:27 -04:00
Vladimir Davydov
8447a0fee9 cpuset: convert callback_mutex to a spinlock
The callback_mutex is only used to synchronize reads/updates of cpusets'
flags and cpu/node masks. These operations should always proceed fast so
there's no reason why we can't use a spinlock instead of the mutex.

Converting the callback_mutex into a spinlock will let us call
cpuset_zone_allowed_softwall from atomic context. This, in turn, makes
it possible to simplify the code by merging the hardwall and asoftwall
checks into the same function, which is the business of the next patch.

Suggested-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-10-27 11:15:26 -04:00
Linus Torvalds
b211e9d7c8 Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Nothing too interesting.  Just a handful of cleanup patches"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Revert "cgroup: remove redundant variable in cgroup_mount()"
  cgroup: remove redundant variable in cgroup_mount()
  cgroup: fix missing unlock in cgroup_release_agent()
  cgroup: remove CGRP_RELEASABLE flag
  perf/cgroup: Remove perf_put_cgroup()
  cgroup: remove redundant check in cgroup_ino()
  cpuset: simplify proc_cpuset_show()
  cgroup: simplify proc_cgroup_show()
  cgroup: use a per-cgroup work for release agent
  cgroup: remove bogus comments
  cgroup: remove redundant code in cgroup_rmdir()
  cgroup: remove some useless forward declarations
  cgroup: fix a typo in comment.
2014-10-10 07:24:40 -04:00
Zefan Li
2ad654bc5e cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags
When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.

Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.

Here's the full report:
https://lkml.org/lkml/2014/9/19/230

To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Kees Cook <keescook@chromium.org>
Fixes: 950592f7b9 ("cpusets: update tasks' page/slab spread flags in time")
Cc: <stable@vger.kernel.org> # 2.6.31+
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-24 22:16:06 -04:00