Commit graph

931 commits

Author SHA1 Message Date
xxmustafacooTR
71e430b7c9
gaming_control, mali: more controls, optimizations 2024-09-27 17:20:00 +03:00
Diep Quynh
609163c324
drivers: Introduce brand new kernel gaming mode
How to trigger gaming mode? Just open a game that is supported in games list
How to exit? Kill the game from recents, or simply back to homescreen

What does this gaming mode do?
- It limits big cluster maximum frequency to 2,0GHz, and little cluster
  to values matching GPU frequencies as below:
+ 338MHz: 455MHz
+ 385MHz and above: 1053MHz
- As for the cluster freq limits, it overcomes heating issue while playing
  heavy games, as well as saves battery juice

Big thanks to [kerneltoast] for the following commits on his wahoo kernel
5ac1e81d3d
e13e2c4554

Gaming control's idea was based on these

Signed-off-by: Diep Quynh <remilia.1505@gmail.com>
2024-09-27 17:19:55 +03:00
Alexei Starovoitov
28ebd88f5c
BACKPORT: bpf: introduce BPF_PROG_QUERY command
introduce BPF_PROG_QUERY command to retrieve a set of either
attached programs to given cgroup or a set of effective programs
that will execute for events within a cgroup

Change-Id: I05e0ed5f6eddc30f4a18216d4541448816fd1ae5
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-25 16:54:46 +03:00
Alexei Starovoitov
1437a0899f
UPSTREAM: bpf: multi program support for cgroup+bpf
introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
bpf programs to a cgroup.

The difference between three possible flags for BPF_PROG_ATTACH command:
- NONE(default): No further bpf programs allowed in the subtree.
- BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
  the program in this cgroup yields to sub-cgroup program.
- BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
  that cgroup program gets run in addition to the program in this cgroup.

NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
change their behavior. It only clarifies the semantics in relation
to new flag.

Only one program is allowed to be attached to a cgroup with
NONE or BPF_F_ALLOW_OVERRIDE flag.
Multiple programs are allowed to be attached to a cgroup with
BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
(those that were attached first, run first)
The programs of sub-cgroup are executed first, then programs of
this cgroup and then programs of parent cgroup.
All eligible programs are executed regardless of return code from
earlier programs.

To allow efficient execution of multiple programs attached to a cgroup
and to avoid penalizing cgroups without any programs attached
introduce 'struct bpf_prog_array' which is RCU protected array
of pointers to bpf programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 324bda9e6c5add86ba2e1066476481c48132aca0)
Signed-off-by: Connor O'Brien <connoro@google.com>
Bug: 121213201
Bug: 138317270
Test: build & boot cuttlefish
Change-Id: If17b11a773f73d45ea565a947fc1bf7e158db98d
2024-09-25 16:54:37 +03:00
Mustafa Gökmen
4fce632291
Revert "BACKPORT: bpf: multi program support for cgroup+bpf"
This reverts commit 148f111e98.
2024-09-25 16:54:36 +03:00
FAROVITUS
eb6fae6224 Merge 4.9.217 branch 'android-4.9-q' into tw10-android-4.9-q 2020-03-23 16:23:34 +02:00
Greg Kroah-Hartman
d2adefff96 This is the 4.9.217 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl50ed8ACgkQONu9yGCS
 aT7A8A//WF8aq34T1LLR70ESa6fJVj2dEzCQA94qvUCPHDJfBj1x00R8OGvZfKEc
 E7YPI8MPrvMmXnyxImge2UcqCJhmnis+gk1fLn+N3Yi5nfZWrxC1dnShCnCTeye1
 tA+BuhyzOXHDmLqrktBcT3TCTrIUWK8lehR8m7jlkAh7wG6Z2QmgKlaiq5fRAVE9
 Z0je4fw0HxdVrCM84BJiM1M6w3pa+AYUkSBfqvU+fgm8xfOWGUPZFEitx9O1PwaH
 m0UOS51gM5ZSzpAMDBlXYnmUgPxUTlpwqIL77zgMFzYcpKbf2rVphFRxmacWWqQ2
 MWnt6vn/eatDeJ6TFAF08APy63SUAR7BvssSYoySz8Y4m6LbL4yUfCWYkL9lhLFH
 dJL6TTjUwFw1Rd7UVzHongEr24NASCS37MiR8V/3RUiDqIYtrlaJKxhhVdTKg7mm
 xRKA+QBXkGUfwFJm2mL+3E3D7/Xgl4kNo22lp1YgamTgc50/UMB5JEf8Z3aqa/qh
 ie/4oKnmaa+SSgOxQmvqS3lWF7RYU+axvc9K0FK5CYSXnc6Jfgtq0yDB5twslj05
 HdCwNJX9KOL0NZAzQKs+FLjnhOtnRigANMG7KMm5nmS3vwsxQqIkm0b6tPhgBcd/
 k4QQm8h6C5Tdi+R39fGQOgtBK0gVVJl+n3XfA3gbmunvXhSo6IQ=
 =fFH1
 -----END PGP SIGNATURE-----

Merge 4.9.217 into android-4.9-q

Changes in 4.9.217
	NFS: Remove superfluous kmap in nfs_readdir_xdr_to_array
	phy: Revert toggling reset changes.
	net: phy: Avoid multiple suspends
	cgroup, netclassid: periodically release file_lock on classid updating
	gre: fix uninit-value in __iptunnel_pull_header
	ipv6/addrconf: call ipv6_mc_up() for non-Ethernet interface
	net: macsec: update SCI upon MAC address change.
	net: nfc: fix bounds checking bugs on "pipe"
	r8152: check disconnect status after long sleep
	bnxt_en: reinitialize IRQs when MTU is modified
	fib: add missing attribute validation for tun_id
	nl802154: add missing attribute validation
	nl802154: add missing attribute validation for dev_type
	macsec: add missing attribute validation for port
	net: fq: add missing attribute validation for orphan mask
	team: add missing attribute validation for port ifindex
	team: add missing attribute validation for array index
	nfc: add missing attribute validation for SE API
	nfc: add missing attribute validation for vendor subcommand
	ipvlan: add cond_resched_rcu() while processing muticast backlog
	ipvlan: do not add hardware address of master to its unicast filter list
	ipvlan: egress mcast packets are not exceptional
	ipvlan: do not use cond_resched_rcu() in ipvlan_process_multicast()
	ipvlan: don't deref eth hdr before checking it's set
	macvlan: add cond_resched() during multicast processing
	net: fec: validate the new settings in fec_enet_set_coalesce()
	slip: make slhc_compress() more robust against malicious packets
	bonding/alb: make sure arp header is pulled before accessing it
	cgroup: memcg: net: do not associate sock with unrelated cgroup
	net: phy: fix MDIO bus PM PHY resuming
	virtio-blk: fix hw_queue stopped on arbitrary error
	iommu/vt-d: quirk_ioat_snb_local_iommu: replace WARN_TAINT with pr_warn + add_taint
	workqueue: don't use wq_select_unbound_cpu() for bound works
	drm/amd/display: remove duplicated assignment to grph_obj_type
	cifs_atomic_open(): fix double-put on late allocation failure
	gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache
	KVM: x86: clear stale x86_emulate_ctxt->intercept value
	ARC: define __ALIGN_STR and __ALIGN symbols for ARC
	efi: Fix a race and a buffer overflow while reading efivars via sysfs
	iommu/vt-d: dmar: replace WARN_TAINT with pr_warn + add_taint
	iommu/vt-d: Fix a bug in intel_iommu_iova_to_phys() for huge page
	nl80211: add missing attribute validation for critical protocol indication
	nl80211: add missing attribute validation for beacon report scanning
	nl80211: add missing attribute validation for channel switch
	netfilter: cthelper: add missing attribute validation for cthelper
	mwifiex: Fix heap overflow in mmwifiex_process_tdls_action_frame()
	iommu/vt-d: Fix the wrong printing in RHSA parsing
	iommu/vt-d: Ignore devices with out-of-spec domain number
	ipv6: restrict IPV6_ADDRFORM operation
	efi: Add a sanity check to efivar_store_raw()
	batman-adv: Fix double free during fragment merge error
	batman-adv: Fix transmission of final, 16th fragment
	batman-adv: Initialize gw sel_class via batadv_algo
	batman-adv: Fix rx packet/bytes stats on local ARP reply
	batman-adv: Use default throughput value on cfg80211 error
	batman-adv: Accept only filled wifi station info
	batman-adv: fix TT sync flag inconsistencies
	batman-adv: Avoid spurious warnings from bat_v neigh_cmp implementation
	batman-adv: Always initialize fragment header priority
	batman-adv: Fix check of retrieved orig_gw in batadv_v_gw_is_eligible
	batman-adv: Fix lock for ogm cnt access in batadv_iv_ogm_calc_tq
	batman-adv: Fix internal interface indices types
	batman-adv: Avoid race in TT TVLV allocator helper
	batman-adv: Fix TT sync flags for intermediate TT responses
	batman-adv: prevent TT request storms by not sending inconsistent TT TLVLs
	batman-adv: Fix debugfs path for renamed hardif
	batman-adv: Fix debugfs path for renamed softif
	batman-adv: Avoid storing non-TT-sync flags on singular entries too
	batman-adv: Fix multicast TT issues with bogus ROAM flags
	batman-adv: Prevent duplicated gateway_node entry
	batman-adv: Fix duplicated OGMs on NETDEV_UP
	batman-adv: Avoid free/alloc race when handling OGM2 buffer
	batman-adv: Avoid free/alloc race when handling OGM buffer
	batman-adv: Don't schedule OGM for disabled interface
	batman-adv: update data pointers after skb_cow()
	batman-adv: Avoid probe ELP information leak
	batman-adv: Use explicit tvlv padding for ELP packets
	perf/amd/uncore: Replace manual sampling check with CAP_NO_INTERRUPT flag
	ACPI: watchdog: Allow disabling WDAT at boot
	HID: apple: Add support for recent firmware on Magic Keyboards
	HID: i2c-hid: add Trekstor Surfbook E11B to descriptor override
	cfg80211: check reg_rule for NULL in handle_channel_custom()
	net: ks8851-ml: Fix IRQ handling and locking
	mac80211: rx: avoid RCU list traversal under mutex
	signal: avoid double atomic counter increments for user accounting
	jbd2: fix data races at struct journal_head
	ARM: 8957/1: VDSO: Match ARMv8 timer in cntvct_functional()
	ARM: 8958/1: rename missed uaccess .fixup section
	mm: slub: add missing TID bump in kmem_cache_alloc_bulk()
	ipv4: ensure rcu_read_lock() in cipso_v4_error()
	Linux 4.9.217

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia7aeed273cd7548dc8d0dfaaad8b96bedfe499b1
2020-03-20 11:01:08 +01:00
Shakeel Butt
529f4b7ad3 cgroup: memcg: net: do not associate sock with unrelated cgroup
[ Upstream commit e876ecc67db80dfdb8e237f71e5b43bb88ae549c ]

We are testing network memory accounting in our setup and noticed
inconsistent network memory usage and often unrelated cgroups network
usage correlates with testing workload. On further inspection, it
seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
irq context specially for cgroup v1.

mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
and kind of assumes that this can only happen from sk_clone_lock()
and the source sock object has already associated cgroup. However in
cgroup v1, where network memory accounting is opt-in, the source sock
can be unassociated with any cgroup and the new cloned sock can get
associated with unrelated interrupted cgroup.

Cgroup v2 can also suffer if the source sock object was created by
process in the root cgroup or if sk_alloc() is called in irq context.
The fix is to just do nothing in interrupt.

WARNING: Please note that about half of the TCP sockets are allocated
from the IRQ context, so, memory used by such sockets will not be
accouted by the memcg.

The stack trace of mem_cgroup_sk_alloc() from IRQ-context:

CPU: 70 PID: 12720 Comm: ssh Tainted:  5.6.0-smp-DEV #1
Hardware name: ...
Call Trace:
 <IRQ>
 dump_stack+0x57/0x75
 mem_cgroup_sk_alloc+0xe9/0xf0
 sk_clone_lock+0x2a7/0x420
 inet_csk_clone_lock+0x1b/0x110
 tcp_create_openreq_child+0x23/0x3b0
 tcp_v6_syn_recv_sock+0x88/0x730
 tcp_check_req+0x429/0x560
 tcp_v6_rcv+0x72d/0xa40
 ip6_protocol_deliver_rcu+0xc9/0x400
 ip6_input+0x44/0xd0
 ? ip6_protocol_deliver_rcu+0x400/0x400
 ip6_rcv_finish+0x71/0x80
 ipv6_rcv+0x5b/0xe0
 ? ip6_sublist_rcv+0x2e0/0x2e0
 process_backlog+0x108/0x1e0
 net_rx_action+0x26b/0x460
 __do_softirq+0x104/0x2a6
 do_softirq_own_stack+0x2a/0x40
 </IRQ>
 do_softirq.part.19+0x40/0x50
 __local_bh_enable_ip+0x51/0x60
 ip6_finish_output2+0x23d/0x520
 ? ip6table_mangle_hook+0x55/0x160
 __ip6_finish_output+0xa1/0x100
 ip6_finish_output+0x30/0xd0
 ip6_output+0x73/0x120
 ? __ip6_finish_output+0x100/0x100
 ip6_xmit+0x2e3/0x600
 ? ipv6_anycast_cleanup+0x50/0x50
 ? inet6_csk_route_socket+0x136/0x1e0
 ? skb_free_head+0x1e/0x30
 inet6_csk_xmit+0x95/0xf0
 __tcp_transmit_skb+0x5b4/0xb20
 __tcp_send_ack.part.60+0xa3/0x110
 tcp_send_ack+0x1d/0x20
 tcp_rcv_state_process+0xe64/0xe80
 ? tcp_v6_connect+0x5d1/0x5f0
 tcp_v6_do_rcv+0x1b1/0x3f0
 ? tcp_v6_do_rcv+0x1b1/0x3f0
 __release_sock+0x7f/0xd0
 release_sock+0x30/0xa0
 __inet_stream_connect+0x1c3/0x3b0
 ? prepare_to_wait+0xb0/0xb0
 inet_stream_connect+0x3b/0x60
 __sys_connect+0x101/0x120
 ? __sys_getsockopt+0x11b/0x140
 __x64_sys_connect+0x1a/0x20
 do_syscall_64+0x51/0x200
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
Fixes: 2d75807383 ("mm: memcontrol: consolidate cgroup socket tracking")
Fixes: d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-20 09:07:43 +01:00
FAROVITUS
af1d3ae977 Merge 4.9.212 branch 'android-4.9-q' into tw10-android-4.9-q
Documentation/filesystems/fscrypt.rst
	arch/arm/common/Kconfig
	arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
	arch/arm64/boot/dts/amd/amd-seattle-soc.dtsi
	arch/arm64/boot/dts/arm/juno-clocks.dtsi
	arch/arm64/boot/dts/broadcom/ns2.dtsi
	arch/arm64/boot/dts/lg/lg1312.dtsi
	arch/arm64/boot/dts/lg/lg1313.dtsi
	arch/arm64/boot/dts/marvell/armada-37xx.dtsi
	arch/arm64/boot/dts/nvidia/tegra210-p2180.dtsi
	arch/arm64/boot/dts/nvidia/tegra210-p2597.dtsi
	arch/arm64/boot/dts/nvidia/tegra210.dtsi
	arch/arm64/boot/dts/qcom/apq8016-sbc.dtsi
	arch/arm64/boot/dts/qcom/msm8996.dtsi
	arch/arm64/configs/ranchu64_defconfig
	arch/arm64/include/asm/cpucaps.h
	arch/arm64/kernel/cpufeature.c
	arch/arm64/kernel/traps.c
	arch/arm64/mm/mmu.c
	crypto/Makefile
	crypto/ablkcipher.c
	crypto/blkcipher.c
	crypto/testmgr.h
	crypto/zstd.c
	drivers/android/binder.c
	drivers/android/binder_alloc.c
	drivers/char/random.c
	drivers/clocksource/exynos_mct.c
	drivers/dma/pl330.c
	drivers/hid/hid-sony.c
	drivers/hid/uhid.c
	drivers/hid/usbhid/hiddev.c
	drivers/i2c/i2c-core.c
	drivers/md/dm-crypt.c
	drivers/media/v4l2-core/videobuf2-v4l2.c
	drivers/mmc/host/dw_mmc.c
	drivers/net/ethernet/broadcom/tg3.c
	drivers/net/usb/r8152.c
	drivers/scsi/scsi_logging.c
	drivers/scsi/sd.c
	drivers/scsi/ufs/ufshcd-pci.c
	drivers/scsi/ufs/ufshcd-pltfrm.c
	drivers/staging/android/Kconfig
	drivers/staging/android/ion/ion.c
	drivers/staging/android/ion/ion_priv.h
	drivers/staging/android/ion/ion_system_heap.c
	drivers/staging/android/lowmemorykiller.c
	drivers/tty/serial/samsung.c
	drivers/usb/dwc3/core.c
	drivers/usb/dwc3/gadget.c
	drivers/usb/host/xhci-hub.c
	drivers/video/fbdev/core/fbmon.c
	drivers/video/fbdev/core/modedb.c
	fs/crypto/fname.c
	fs/crypto/fscrypt_private.h
	fs/crypto/keyinfo.c
	fs/ext4/ialloc.c
	fs/ext4/namei.c
	fs/ext4/xattr.c
	fs/f2fs/checkpoint.c
	fs/f2fs/data.c
	fs/f2fs/debug.c
	fs/f2fs/dir.c
	fs/f2fs/f2fs.h
	fs/f2fs/file.c
	fs/f2fs/gc.c
	fs/f2fs/inline.c
	fs/f2fs/inode.c
	fs/f2fs/namei.c
	fs/f2fs/node.c
	fs/f2fs/recovery.c
	fs/f2fs/segment.c
	fs/f2fs/segment.h
	fs/f2fs/super.c
	fs/f2fs/sysfs.c
	fs/fat/dir.c
	fs/fat/fatent.c
	fs/file.c
	fs/namespace.c
	fs/pnode.c
	fs/proc/inode.c
	fs/proc/root.c
	fs/proc/task_mmu.c
	fs/sdcardfs/dentry.c
	fs/sdcardfs/derived_perm.c
	fs/sdcardfs/file.c
	fs/sdcardfs/inode.c
	fs/sdcardfs/lookup.c
	fs/sdcardfs/main.c
	fs/sdcardfs/sdcardfs.h
	fs/sdcardfs/super.c
	include/linux/blk_types.h
	include/linux/cpuhotplug.h
	include/linux/cred.h
	include/linux/fb.h
	include/linux/power_supply.h
	include/linux/sched.h
	include/linux/zstd.h
	include/trace/events/sched.h
	include/uapi/linux/android/binder.h
	init/Kconfig
	init/main.c
	kernel/bpf/hashtab.c
	kernel/cpu.c
	kernel/cred.c
	kernel/fork.c
	kernel/locking/spinlock_debug.c
	kernel/panic.c
	kernel/printk/printk.c
	kernel/sched/Makefile
	kernel/sched/core.c
	kernel/sched/fair.c
	kernel/sched/rt.c
	kernel/sched/walt.c
	kernel/sched/walt.h
	kernel/trace/trace.c
	lib/bug.c
	lib/list_debug.c
	lib/vsprintf.c
	lib/zstd/bitstream.h
	lib/zstd/compress.c
	lib/zstd/decompress.c
	lib/zstd/fse.h
	lib/zstd/fse_compress.c
	lib/zstd/fse_decompress.c
	lib/zstd/huf_compress.c
	lib/zstd/huf_decompress.c
	lib/zstd/zstd_internal.h
	mm/debug.c
	mm/filemap.c
	mm/rmap.c
	net/core/filter.c
	net/ipv4/sysctl_net_ipv4.c
	net/ipv4/sysfs_net_ipv4.c
	net/ipv4/tcp_input.c
	net/ipv4/tcp_output.c
	net/ipv4/udp.c
	net/ipv6/netfilter/nf_conntrack_reasm.c
	net/netfilter/Kconfig
	net/netfilter/Makefile
	net/netfilter/xt_qtaguid.c
	net/netfilter/xt_qtaguid_internal.h
	net/xfrm/xfrm_policy.c
	net/xfrm/xfrm_state.c
	scripts/checkpatch.pl
	security/selinux/hooks.c
	sound/core/compress_offload.c
2020-02-12 12:32:38 +02:00
FAROVITUS
2b92eefa41 import G965FXXU7DTAA OSRC
*First release for Android (Q).

Signed-off-by: FAROVITUS <farovitus@gmail.com>
2020-02-04 13:50:09 +02:00
Alexei Starovoitov
148f111e98 BACKPORT: bpf: multi program support for cgroup+bpf
introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
bpf programs to a cgroup.

The difference between three possible flags for BPF_PROG_ATTACH command:
- NONE(default): No further bpf programs allowed in the subtree.
- BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
  the program in this cgroup yields to sub-cgroup program.
- BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
  that cgroup program gets run in addition to the program in this cgroup.

NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
change their behavior. It only clarifies the semantics in relation
to new flag.

Only one program is allowed to be attached to a cgroup with
NONE or BPF_F_ALLOW_OVERRIDE flag.
Multiple programs are allowed to be attached to a cgroup with
BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
(those that were attached first, run first)
The programs of sub-cgroup are executed first, then programs of
this cgroup and then programs of parent cgroup.
All eligible programs are executed regardless of return code from
earlier programs.

To allow efficient execution of multiple programs attached to a cgroup
and to avoid penalizing cgroups without any programs attached
introduce 'struct bpf_prog_array' which is RCU protected array
of pointers to bpf programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 324bda9e6c5add86ba2e1066476481c48132aca0)
Signed-off-by: Connor O'Brien <connoro@google.com>
Bug: 121213201
Bug: 138317270
Test: build & boot cuttlefish
Change-Id: I06b71c850b9f3e052b106abab7a4a3add012a3f8
2019-12-12 15:48:18 -08:00
Suren Baghdasaryan
a163d3fb8a FROMLIST: psi: introduce psi monitor
Psi monitor aims to provide a low-latency short-term pressure
detection mechanism configurable by users. It allows users to
monitor psi metrics growth and trigger events whenever a metric
raises above user-defined threshold within user-defined time window.

Time window and threshold are both expressed in usecs. Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10 times
per tracking window. Min window size is 500ms, therefore the min monitoring
interval is 50ms. Max window size is 10s with monitoring interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

(not upstream yet, latest version published at: https://lore.kernel.org/patchwork/patch/1052418/)

Conflicts:
        include/linux/psi.h
        kernel/cgroup.c
        kernel/sched/psi.c

(1. replaced __poll_t with unsigned int
2. replaced EPOLLERR/EPOLLPRI with POLLERR/POLLPRI (values are the same)
3. include <linux/cgroup-defs.h> in include/linux/psi.h)

Bug: 127712811
Bug: 129157727
Test: lmkd in PSI mode
Change-Id: I1688f047e98e1f109627dad72a33d2f70e575268
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-23 00:29:15 +00:00
Johannes Weiner
0ee3cb39dc BACKPORT: kernel: cgroup: add poll file operation
Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes.  To allow polling for
custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

(cherry picked from commit: dc50537bdd1a0804fa2cbc990565ee9a944e66fa)

Conflicts:
        include/linux/cgroup-defs.h
        kernel/cgroup.c

1. made changes in kernel/cgroup.c instead of kernel/cgroup/cgroup.c
2. replaced __poll_t with unsigned int

Bug: 111308141
Test: modified lmkd to use PSI and tested using lmkd_unit_test

Change-Id: Ie3d914197d1f150e1d83c6206865566a7cbff1b4
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 14:15:40 -07:00
Tejun Heo
59735bfc7e UPSTREAM: cgroup add cftype->open/release() callbacks
Pipe the newly added kernfs->open/release() callbacks through cftype.
While at it, as cleanup operations now can be performed from
->release() instead of ->seq_stop(), make the latter optional.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>

(cherry picked from commit e90cbebc3fa5caea4c8bfeb0d0157a0cee53efc7)

Bug: 111308141
Test: modified lmkd to use PSI and tested using lmkd_unit_test

Change-Id: Iff9794cbbc2c7067c24cb2f767bbdeffa26b5180
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-22 14:14:49 -07:00
Johannes Weiner
e868a99c44 BACKPORT: psi: cgroup support
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups.  In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.

Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 2ce7135adc9ad081aa3c49744144376ac74fea60)

Conflicts:
        Documentation/cgroup-v2.txt
        include/linux/cgroup.h
        kernel/cgroup/cgroup.c

(1. manual merge from Documentation/admin-guide/cgroup-v2.rst
2. manually merged changes from kernel/cgroup/cgroup.c into kernel/cgroup.c
3. manual merge in css_free_work_fn to allow psi support only for cgroup v2
4. manual merge in cgroup_create to allow psi support only for cgroup v2)

Bug: 111308141
Test: modified lmkd to use PSI and tested using lmkd_unit_test

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I2ffb9d50ba87f8b7655bed215a625784e098879c
2019-03-22 14:10:44 -07:00
Tejun Heo
86c5c11aa0 BACKPORT: cgroup: misc changes
Misc trivial changes to prepare for future changes.  No functional
difference.

* Expose cgroup_get(), cgroup_tryget() and cgroup_parent().

* Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.

* Rename cgroup_stats_show() to cgroup_stat_show() for consistency
  with the file name.

Signed-off-by: Tejun Heo <tj@kernel.org>

(cherry picked from commit 3e48930cc74f0c212ee1838f89ad0ca7fcf2fea1)

Conflicts:
        kernel/cgroup/cgroup.c

(1. manual merge because kernel/cgroup/cgroup.c is under kernel/cgroup.c
2. cgroup_stats_show change is skipped because the function dos not exist)

Bug: 111308141
Test: modified lmkd to use PSI and tested using lmkd_unit_test

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I756ee3dcf0d0f3da69cd1b58e644271625053538
2019-03-22 14:08:16 -07:00
Greg Kroah-Hartman
d589c0d406 This is the 4.9.133 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlvBnGMACgkQONu9yGCS
 aT6KZA//cXQIPaITOcikXVN9LTbfNI32k0803qJi2PPr0lajb6RETnfYbQH+h+Ta
 d77uEJ7e15U7bxaoLDKJhYe8SvtqMHq6hhqNXwg+woFgAkoMU4pGmAocywLoMpS/
 INIc4KwEPKD4JkQYQF1sk3nB22K9DiXn/OjoXTlxyAYuCnvdu99xwIpvaI8GN1Yo
 SWiMDvmhu1Nj+vD79VDV5UhigI/+lHIaCasJ3fSVyFx6kE2XYBz7qcNGULGzI+pl
 c6qQpds4UIdVp04d/JQNOurANA6oHYWexrV1Q8zENMwpAOEoDVMDjC/Fi8iIxfCK
 EfgNABr+5INg2gtPZ1YFBad96vgrxWo1gS+Rdaq71zO5luoWktqG/Kv99WnMIOZf
 TiS0+6GRKfL06PyfFu/Bsx8Dzp/GVCsYWQEEPNhp5PYUOoIRzI8XiJcofRNlxL0p
 IiQ0RUansCaN7VN8Z3GzBqDOGikuhxAPrTTAhTAXABrXcrSoAEVhUpmf5UMItnIj
 cp8AoYqqQxA81PMTLLK0K78B3Mkkdk9BIB82y39WV28j6dB6ZzfPFg4ikdIwS+hm
 ZNck/2IwhuuPGugDVVX206zH5ATQaS0Vg+BQNqw/Y03/U6SKSlUs6HB4c0X1quy6
 jwskBvuADXXlWFuzMlenlAGg1Shk4lwsCGN3F9ziSkrSI6azBRs=
 =aHnt
 -----END PGP SIGNATURE-----

Merge 4.9.133 into android-4.9

Changes in 4.9.133
	mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly
	fbdev/omapfb: fix omapfb_memory_read infoleak
	xen-netback: fix input validation in xenvif_set_hash_mapping()
	x86/vdso: Fix asm constraints on vDSO syscall fallbacks
	x86/vdso: Fix vDSO syscall fallback asm constraint regression
	PCI: Reprogram bridge prefetch registers on resume
	mac80211: fix setting IEEE80211_KEY_FLAG_RX_MGMT for AP mode keys
	PM / core: Clear the direct_complete flag on errors
	dm cache metadata: ignore hints array being too small during resize
	dm cache: fix resize crash if user doesn't reload cache table
	xhci: Add missing CAS workaround for Intel Sunrise Point xHCI
	usb: xhci-mtk: resume USB3 roothub first
	USB: serial: simple: add Motorola Tetra MTP6550 id
	tty: Drop tty->count on tty_reopen() failure
	of: unittest: Disable interrupt node tests for old world MAC systems
	ext4: add corruption check in ext4_xattr_set_entry()
	ext4: always verify the magic number in xattr blocks
	cgroup: Fix deadlock in cpu hotplug path
	ath10k: fix use-after-free in ath10k_wmi_cmd_send_nowait
	ath10k: fix kernel panic issue during pci probe
	powerpc/fadump: Return error when fadump registration fails
	ARC: clone syscall to setp r25 as thread pointer
	x86/mm: Expand static page table for fixmap space
	f2fs: fix invalid memory access
	ucma: fix a use-after-free in ucma_resolve_ip()
	ubifs: Check for name being NULL while mounting
	ath10k: fix scan crash due to incorrect length calculation
	ebtables: arpreply: Add the standard target sanity check
	x86/fpu: Remove use_eager_fpu()
	x86/fpu: Remove struct fpu::counter
	Revert "perf: sync up x86/.../cpufeatures.h"
	x86/fpu: Finish excising 'eagerfpu'
	Linux 4.9.133

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-10-13 10:42:50 +02:00
Prateek Sood
35b80e75a6 cgroup: Fix deadlock in cpu hotplug path
commit 116d2f7496c51b2e02e8e4ecdd2bdf5fb9d5a641 upstream.

Deadlock during cgroup migration from cpu hotplug path when a task T is
being moved from source to destination cgroup.

kworker/0:0
cpuset_hotplug_workfn()
   cpuset_hotplug_update_tasks()
      hotplug_update_tasks_legacy()
        remove_tasks_in_empty_cpuset()
          cgroup_transfer_tasks() // stuck in iterator loop
            cgroup_migrate()
              cgroup_migrate_add_task()

In cgroup_migrate_add_task() it checks for PF_EXITING flag of task T.
Task T will not migrate to destination cgroup. css_task_iter_start()
will keep pointing to task T in loop waiting for task T cg_list node
to be removed.

Task T
do_exit()
  exit_signals() // sets PF_EXITING
  exit_task_namespaces()
    switch_task_namespaces()
      free_nsproxy()
        put_mnt_ns()
          drop_collected_mounts()
            namespace_unlock()
              synchronize_rcu()
                _synchronize_rcu_expedited()
                  schedule_work() // on cpu0 low priority worker pool
                  wait_event() // waiting for work item to execute

Task T inserted a work item in the worklist of cpu0 low priority
worker pool. It is waiting for expedited grace period work item
to execute. This work item will only be executed once kworker/0:0
complete execution of cpuset_hotplug_workfn().

kworker/0:0 ==> Task T ==>kworker/0:0

In case of PF_EXITING task being migrated from source to destination
cgroup, migrate next available task in source cgroup.

Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
[AmitP: Upstream commit cherry-pick failed, so I picked the
        backported changes from CAF/msm-4.9 tree instead:
        https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?id=49b74f1696417b270c89cd893ca9f37088928078]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13 09:18:56 +02:00
Greg Kroah-Hartman
02f29ab1b9 This is the 4.9.42 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlmN0iUACgkQONu9yGCS
 aT6gZRAAhYbpsz2XRFSQ5H/Sk8xwJuwtLB2at3Y1CLCb+lLhlRsV3l4pD+4KRoEl
 fU01l5s5ZalZJekGfEfEOQOfoJHsCxzFSzKzP06/GA5u6DbwtHUE2SjNWe84j6Ct
 Hx0jN90yj7S8vy2umROux+fVvZQ4Xay4TDCWhBeXgOFXevwC/G9D2LWE2NIYwbDH
 Ighahrhs21FZc9wbah0L04bRBAR7+ALLq1sO8ebKwl8eFzAkcEwI/yS48cnjGlgW
 9HW5MmY1BYTnRCrXaw5L0Vf5zH6obT7amrLNljNYN6vN62DRoOfwQh4QcblnIAoi
 L+HdZilifZ970RwQ2As3vy63/Kk3b207ht4mriTCyGXM9MY6bRovYv1wDAUlv7aD
 GlA8Q7xwsiJ4sG4i5LABjly+QeWymZ2b0kVWYpneJuBuj/gWVDhh1lfT+nOCAcJ6
 ROUY6d64ghKPBomkqlMSC+7sH7QKa0/W9WDQCLxtnmjcAkeElpGNGu/m/Thhvi2I
 NDq2sbMAeGJquXBXIN8W4NPy0puOn0wjqFI7LE61ujSiAxT8973uDtNlrmB/eCAf
 zD9yJsKELS20PKToren4hYYuRM2XlKh9gVIOWB2pShfzvSO7807ZyVcoI8/bVgZe
 I2BH6Dt3t+qqWR7B5/qvxxmNCv3HNMpNUzy/z+fXEf8/U3zqiTM=
 =n7pX
 -----END PGP SIGNATURE-----

Merge 4.9.42 into android-4.9

Changes in 4.9.42
	parisc: Handle vma's whose context is not current in flush_cache_range
	cgroup: create dfl_root files on subsys registration
	cgroup: fix error return value from cgroup_subtree_control()
	libata: array underflow in ata_find_dev()
	workqueue: restore WQ_UNBOUND/max_active==1 to be ordered
	iwlwifi: dvm: prevent an out of bounds access
	brcmfmac: fix memleak due to calling brcmf_sdiod_sgtable_alloc() twice
	NFSv4: Fix EXCHANGE_ID corrupt verifier issue
	mmc: sdhci-of-at91: force card detect value for non removable devices
	device property: Make dev_fwnode() public
	mmc: core: Fix access to HS400-ES devices
	mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
	cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
	ALSA: hda - Fix speaker output from VAIO VPCL14M1R
	drm/amdgpu: Fix undue fallthroughs in golden registers initialization
	ASoC: do not close shared backend dailink
	KVM: async_pf: make rcu irq exit if not triggered from idle task
	mm/page_alloc: Remove kernel address exposure in free_reserved_area()
	timers: Fix overflow in get_next_timer_interrupt
	powerpc/tm: Fix saving of TM SPRs in core dump
	powerpc/64: Fix __check_irq_replay missing decrementer interrupt
	iommu/amd: Enable ga_log_intr when enabling guest_mode
	gpiolib: skip unwanted events, don't convert them to opposite edge
	ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize
	ext4: fix overflow caused by missing cast in ext4_resize_fs()
	ARM: dts: armada-38x: Fix irq type for pca955
	ARM: dts: tango4: Request RGMII RX and TX clock delays
	media: platform: davinci: return -EINVAL for VPFE_CMD_S_CCDC_RAW_PARAMS ioctl
	iscsi-target: Fix initial login PDU asynchronous socket close OOPs
	mmc: dw_mmc: Use device_property_read instead of of_property_read
	mmc: core: Use device_property_read instead of of_property_read
	media: lirc: LIRC_GET_REC_RESOLUTION should return microseconds
	f2fs: sanity check checkpoint segno and blkoff
	Btrfs: fix early ENOSPC due to delalloc
	saa7164: fix double fetch PCIe access condition
	tcp_bbr: cut pacing rate only if filled pipe
	tcp_bbr: introduce bbr_bw_to_pacing_rate() helper
	tcp_bbr: introduce bbr_init_pacing_rate_from_rtt() helper
	tcp_bbr: remove sk_pacing_rate=0 transient during init
	tcp_bbr: init pacing rate on first RTT sample
	ipv4: ipv6: initialize treq->txhash in cookie_v[46]_check()
	net: Zero terminate ifr_name in dev_ifname().
	ipv6: avoid overflow of offset in ip6_find_1stfragopt
	net: dsa: b53: Add missing ARL entries for BCM53125
	ipv4: initialize fib_trie prior to register_netdev_notifier call.
	rtnetlink: allocate more memory for dev_set_mac_address()
	mcs7780: Fix initialization when CONFIG_VMAP_STACK is enabled
	openvswitch: fix potential out of bound access in parse_ct
	packet: fix use-after-free in prb_retire_rx_blk_timer_expired()
	ipv6: Don't increase IPSTATS_MIB_FRAGFAILS twice in ip6_fragment()
	net: ethernet: nb8800: Handle all 4 RGMII modes identically
	dccp: fix a memleak that dccp_ipv6 doesn't put reqsk properly
	dccp: fix a memleak that dccp_ipv4 doesn't put reqsk properly
	dccp: fix a memleak for dccp_feat_init err process
	sctp: don't dereference ptr before leaving _sctp_walk_{params, errors}()
	sctp: fix the check for _sctp_walk_params and _sctp_walk_errors
	net/mlx5: Consider tx_enabled in all modes on remap
	net/mlx5: Fix command bad flow on command entry allocation failure
	net/mlx5e: Fix outer_header_zero() check size
	net/mlx5e: Fix wrong delay calculation for overflow check scheduling
	net/mlx5e: Schedule overflow check work to mlx5e workqueue
	net: phy: Correctly process PHY_HALTED in phy_stop_machine()
	xen-netback: correctly schedule rate-limited queues
	sparc64: Measure receiver forward progress to avoid send mondo timeout
	sparc64: Fix exception handling in UltraSPARC-III memcpy.
	wext: handle NULL extra data in iwe_stream_add_point better
	sh_eth: fix EESIPR values for SH77{34|63}
	sh_eth: R8A7740 supports packet shecksumming
	net: phy: dp83867: fix irq generation
	tg3: Fix race condition in tg3_get_stats64().
	x86/boot: Add missing declaration of string functions
	spi: spi-axi: Free resources on error path
	ASoC: rt5645: set sel_i2s_pre_div1 to 2
	netfilter: use fwmark_reflect in nf_send_reset
	phy state machine: failsafe leave invalid RUNNING state
	ipv4: make tcp_notsent_lowat sysctl knob behave as true unsigned int
	clk/samsung: exynos542x: mark some clocks as critical
	scsi: qla2xxx: Get mutex lock before checking optrom_state
	drm/virtio: fix framebuffer sparse warning
	ARM: dts: sun8i: Support DTB build for NanoPi M1
	ARM: dts: sunxi: Change node name for pwrseq pin on Olinuxino-lime2-emmc
	iw_cxgb4: do not send RX_DATA_ACK CPLs after close/abort
	nbd: blk_mq_init_queue returns an error code on failure, not NULL
	virtio_blk: fix panic in initialization error path
	ARM: 8632/1: ftrace: fix syscall name matching
	mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER
	lib/Kconfig.debug: fix frv build failure
	signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
	mm: don't dereference struct page fields of invalid pages
	net/mlx5: E-Switch, Re-enable RoCE on mode change only after FDB destroy
	ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output
	net: account for current skb length when deciding about UFO
	net: phy: Fix PHY unbind crash
	workqueue: implicit ordered attribute should be overridable
	Linux 4.9.42

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-08-11 13:55:02 -07:00
Tejun Heo
445ee6cdd9 cgroup: fix error return value from cgroup_subtree_control()
commit 3c74541777302eec43a0d1327c4d58b8659a776b upstream.

While refactoring, f7b2814bb9 ("cgroup: factor out
cgroup_{apply|finalize}_control() from
cgroup_subtree_control_write()") broke error return value from the
function.  The return value from the last operation is always
overridden to zero.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11 08:49:28 -07:00
Tejun Heo
4a99eac8d2 cgroup: create dfl_root files on subsys registration
commit 7af608e4f9530372aec6e940552bf76595f2e265 upstream.

On subsystem registration, css_populate_dir() is not called on the new
root css, so the interface files for the subsystem on cgrp_dfl_root
aren't created on registration.  This is a residue from the days when
cgrp_dfl_root was used only as the parking spot for unused subsystems,
which no longer is true as it's used as the root for cgroup2.

This is often fine as later operations tend to create them as a part
of mount (cgroup1) or subtree_control operations (cgroup2); however,
it's not difficult to mount cgroup2 with the controller interface
files missing as Waiman found out.

Fix it by invoking css_populate_dir() on the root css on subsys
registration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11 08:49:28 -07:00
Greg Kroah-Hartman
da3493c028 This is the 4.9.32 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAllBNNMACgkQONu9yGCS
 aT4rJA//XOtxp2nr01OqffSItCX0vq7vMofbjo7spI2LF3JyxKkx0hje0im82XoD
 cYhZHyeK9lCHRZt2O541xOp3ITconecRvlt6uq49zKNvlFAelnp0rRTOktluJK/n
 Xm7EP2KLHeHXv+rvCfzU0hDC9fvoZe82NIF7kddAWHX4D/K3C6Jw6FVc4/QPl2MG
 11R6pylPtMGObgPEmfDlohBitcC3KawXAhIcyTTAr3rcuO2Wm00H+uF0VOpyD9hZ
 S5A8ecuyZVQw3mIKAJ9vOwkVoln1E+/P/OltWEXElkBZONHpI4LB2IE7Nmwnx+pE
 5oO3afNZIYOejX+GQvK2Apc8VONDV2VZRoIstzl+uaKDNGFaSLIu8AQ/UQfmoqTE
 Pwrp/FxgDGjq6NOf/+MqNhiqS9433Xgmt4HT5xctzS0zFNGzLIokdjRH3x7n6cez
 2JvdtRvJfdNq31kh6xQE/8oXuZktRSw3GxC5K4Dw3NRh02/9KNAgEjNPEMNXziGw
 RhLhyXl8i5eGAu1gxruojzRyCvaMS6S8ZcAvSSoJze0kj8jfkZtQbBswvS1LF9EI
 FFemshd1sQxUeZiXGYPuUUfyDvxvs7KRIqAYux8lsIhSQTu16REqsZRrzidpDP+K
 xtH61YBWhwDHZKIoT+Jchmn8QCytI77MvFdGuTl+KRPV4OyyeJs=
 =NE4C
 -----END PGP SIGNATURE-----

Merge 4.9.32 into android-4.9

Changes in 4.9.32
	bnx2x: Fix Multi-Cos
	vxlan: eliminate cached dst leak
	ipv6: xfrm: Handle errors reported by xfrm6_find_1stfragopt()
	cxgb4: avoid enabling napi twice to the same queue
	tcp: disallow cwnd undo when switching congestion control
	vxlan: fix use-after-free on deletion
	ipv6: Fix leak in ipv6_gso_segment().
	net: ping: do not abuse udp_poll()
	net/ipv6: Fix CALIPSO causing GPF with datagram support
	net: ethoc: enable NAPI before poll may be scheduled
	net: stmmac: fix completely hung TX when using TSO
	net: bridge: start hello timer only if device is up
	sparc64: Add __multi3 for gcc 7.x and later.
	sparc64: mm: fix copy_tsb to correctly copy huge page TSBs
	sparc: Machine description indices can vary
	sparc64: reset mm cpumask after wrap
	sparc64: combine activate_mm and switch_mm
	sparc64: redefine first version
	sparc64: add per-cpu mm of secondary contexts
	sparc64: new context wrap
	sparc64: delete old wrap code
	arch/sparc: support NR_CPUS = 4096
	serial: ifx6x60: fix use-after-free on module unload
	ptrace: Properly initialize ptracer_cred on fork
	crypto: asymmetric_keys - handle EBUSY due to backlog correctly
	KEYS: fix dereferencing NULL payload with nonzero length
	KEYS: fix freeing uninitialized memory in key_update()
	KEYS: encrypted: avoid encrypting/decrypting stack buffers
	crypto: drbg - wait for crypto op not signal safe
	crypto: gcm - wait for crypto op not signal safe
	drm/amdgpu/ci: disable mclk switching for high refresh rates (v2)
	nfsd4: fix null dereference on replay
	nfsd: Fix up the "supattr_exclcreat" attributes
	efi: Don't issue error message when booted under Xen
	kvm: async_pf: fix rcu_irq_enter() with irqs enabled
	KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulation
	arm64: KVM: Preserve RES1 bits in SCTLR_EL2
	arm64: KVM: Allow unaligned accesses at EL2
	arm: KVM: Allow unaligned accesses at HYP
	KVM: async_pf: avoid async pf injection when in guest mode
	KVM: arm/arm64: vgic-v3: Do not use Active+Pending state for a HW interrupt
	KVM: arm/arm64: vgic-v2: Do not use Active+Pending state for a HW interrupt
	dmaengine: usb-dmac: Fix DMAOR AE bit definition
	dmaengine: ep93xx: Always start from BASE0
	dmaengine: ep93xx: Don't drain the transfers in terminate_all()
	dmaengine: mv_xor_v2: handle mv_xor_v2_prep_sw_desc() error properly
	dmaengine: mv_xor_v2: properly handle wrapping in the array of HW descriptors
	dmaengine: mv_xor_v2: do not use descriptors not acked by async_tx
	dmaengine: mv_xor_v2: enable XOR engine after its configuration
	dmaengine: mv_xor_v2: fix tx_submit() implementation
	dmaengine: mv_xor_v2: remove interrupt coalescing
	dmaengine: mv_xor_v2: set DMA mask to 40 bits
	cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode
	xen/privcmd: Support correctly 64KB page granularity when mapping memory
	ext4: fix SEEK_HOLE
	ext4: keep existing extra fields when inode expands
	ext4: fix data corruption with EXT4_GET_BLOCKS_ZERO
	ext4: fix fdatasync(2) after extent manipulation operations
	drm: Fix oops + Xserver hang when unplugging USB drm devices
	usb: gadget: f_mass_storage: Serialize wake and sleep execution
	usb: chipidea: udc: fix NULL pointer dereference if udc_start failed
	usb: chipidea: debug: check before accessing ci_role
	staging/lustre/lov: remove set_fs() call from lov_getstripe()
	iio: adc: bcm_iproc_adc: swap primary and secondary isr handler's
	iio: light: ltr501 Fix interchanged als/ps register field
	iio: proximity: as3935: fix AS3935_INT mask
	iio: proximity: as3935: fix iio_trigger_poll issue
	mei: make sysfs modalias format similar as uevent modalias
	cpufreq: cpufreq_register_driver() should return -ENODEV if init fails
	target: Re-add check to reject control WRITEs with overflow data
	drm/msm: Expose our reservation object when exporting a dmabuf.
	ahci: Acer SA5-271 SSD Not Detected Fix
	cgroup: Prevent kill_css() from being called more than once
	Input: elantech - add Fujitsu Lifebook E546/E557 to force crc_enabled
	cpuset: consider dying css as offline
	fs: add i_blocksize()
	ufs: restore proper tail allocation
	fix ufs_isblockset()
	ufs: restore maintaining ->i_blocks
	ufs: set correct ->s_maxsize
	ufs_extend_tail(): fix the braino in calling conventions of ufs_new_fragments()
	ufs_getfrag_block(): we only grab ->truncate_mutex on block creation path
	cxl: Fix error path on bad ioctl
	cxl: Avoid double free_irq() for psl,slice interrupts
	btrfs: use correct types for page indices in btrfs_page_exists_in_range
	btrfs: fix memory leak in update_space_info failure path
	KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages
	scsi: qla2xxx: don't disable a not previously enabled PCI device
	scsi: qla2xxx: Modify T262 FW dump template to specify same start/end to debug customer issues
	scsi: qla2xxx: Set bit 15 for DIAG_ECHO_TEST MBC
	scsi: qla2xxx: Fix mailbox pointer error in fwdump capture
	powerpc/sysdev/simple_gpio: Fix oops in gpio save_regs function
	powerpc/numa: Fix percpu allocations to be NUMA aware
	powerpc/hotplug-mem: Fix missing endian conversion of aa_index
	powerpc/kernel: Fix FP and vector register restoration
	powerpc/kernel: Initialize load_tm on task creation
	perf/core: Drop kernel samples even though :u is specified
	drm/vmwgfx: Handle vmalloc() failure in vmw_local_fifo_reserve()
	drm/vmwgfx: limit the number of mip levels in vmw_gb_surface_define_ioctl()
	drm/vmwgfx: Make sure backup_handle is always valid
	drm/nouveau/tmr: fully separate alarm execution/pending lists
	ALSA: timer: Fix race between read and ioctl
	ALSA: timer: Fix missing queue indices reset at SNDRV_TIMER_IOCTL_SELECT
	ASoC: Fix use-after-free at card unregistration
	cpu/hotplug: Drop the device lock on error
	drivers: char: mem: Fix wraparound check to allow mappings up to the end
	serial: sh-sci: Fix panic when serial console and DMA are enabled
	arm64: traps: fix userspace cache maintenance emulation on a tagged pointer
	arm64: hw_breakpoint: fix watchpoint matching for tagged pointers
	arm64: entry: improve data abort handling of tagged pointers
	ARM: 8636/1: Cleanup sanity_check_meminfo
	ARM: 8637/1: Adjust memory boundaries after reservations
	usercopy: Adjust tests to deal with SMAP/PAN
	drm/i915/vbt: don't propagate errors from intel_bios_init()
	drm/i915/vbt: split out defaults that are set when there is no VBT
	cpufreq: schedutil: move cached_raw_freq to struct sugov_policy
	cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
	netfilter: nft_set_rbtree: handle element re-addition after deletion
	Linux 4.9.32

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-06-14 16:42:56 +02:00
Waiman Long
dff4c8bb13 cgroup: Prevent kill_css() from being called more than once
commit 33c35aa4817864e056fd772230b0c6b552e36ea2 upstream.

The kill_css() function may be called more than once under the condition
that the css was killed but not physically removed yet followed by the
removal of the cgroup that is hosting the css. This patch prevents any
harmm from being done when that happens.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14 15:06:00 +02:00
Alexei Starovoitov
1ee2b4b803 BACKPORT: bpf: introduce BPF_F_ALLOW_OVERRIDE flag
If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
to the given cgroup the descendent cgroup will be able to override
effective bpf program that was inherited from this cgroup.
By default it's not passed, therefore override is disallowed.

Examples:
1.
prog X attached to /A with default
prog Y fails to attach to /A/B and /A/B/C
Everything under /A runs prog X

2.
prog X attached to /A with allow_override.
prog Y fails to attach to /A/B with default (non-override)
prog M attached to /A/B with allow_override.
Everything under /A/B runs prog M only.

3.
prog X attached to /A with allow_override.
prog Y fails to attach to /A with default.
The user has to detach first to switch the mode.

In the future this behavior may be extended with a chain of
non-overridable programs.

Also fix the bug where detach from cgroup where nothing is attached
was not throwing error. Return ENOENT in such case.

Add several testcases and adjust libbpf.

Fixes: 3007098494be ("cgroup: add support for eBPF programs")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Daniel Mack <daniel@zonque.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Fixes: Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28
       ("cgroup: add support for eBPF programs")
[AmitP: Refactored original patch for android-4.9 where libbpf sources
        are in samples/bpf/ and test_cgrp2_attach2, test_cgrp2_sock,
        and test_cgrp2_sock2 sample tests do not exist.]
(cherry picked from commit 7f677633379b4abb3281cdbe7e7006f049305c03)
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2017-05-30 17:27:28 -07:00
Daniel Mack
f791c42b63 UPSTREAM: cgroup: add support for eBPF programs
Cherry-pick from commit 3007098494bec614fb55dee7bc0410bb7db5ad18

This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
        \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack <daniel@zonque.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bug: 30950746
Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28
2017-05-22 15:30:56 -07:00
Dmitry Shmidt
5b9202d62b Revert "ANDROID: [RFC]cgroup: Change from CAP_SYS_NICE to CAP_SYS_RESOURCE for cgroup migration permissions"
This reverts commit 8cc698d951.

Change-Id: Iad523b2f7fa83c461a5e965272319fd8f65ef10b
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2017-04-25 12:42:05 -07:00
Greg Kroah-Hartman
a2659b2b78 This is the 4.9.24 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlj5tWEACgkQONu9yGCS
 aT4noA//ZKztJudzNoXOaYThKQrbKTJeB+cL4PusLqd7gv9yKt7IJD+5+V/+jYRS
 N5GlwNOS02r6tduhCz9I/XjRKzilvC2WHt3Akf8nPC2dadhzcOW3UPvT0IEoIrHy
 FkTnUn/1TvBA2I1uB8k93nk/jfx7RYrO+4axiyZ0XQCMJTc7M4MmPTGw2jrzxPug
 Ogy7taNpY+By7PxJyU4rH8Lh2jiIYGYDrRWExZIcL7A7IlWBPbVkagzn2PPgQbg4
 XX250JZ7bQyCGcrVxlTqaBScEkdICZTUqI3d3HQUwF39QBq9z+VlqP13T5j7XMea
 d7jHZswWClQj4xeGWq+iiLR0VF5OVCTkbhl61YTI9rPFEaewSwMiGrpHiidKTaOK
 KAxa34dwrz1IVI/p/a9DbZBj7IpwiH3jF3rEvx1ieEbTGHxLBi/COGK3aCES5TtX
 mjz/WjReo4aR3fudgNFJqNsmJmgEJ3yAO7Lk3tglBmYOgUpQhbRwClPjYvXVphzH
 ClYE1KCMMV4BTcDUqEfoBL46Qd2141E9Wxgmh+KYB+14X0QqQg+qRyZYyOvUGIuD
 A7J7MbXyBuM0dXS2t9olLbHb+tllGz6lBHANf8g141ot8jjlCBoP5yjBP6W0gZP8
 UsKSH/sLEPGMO3KD4D2lLixkoENilS+3DhJwmrrqZAhhaqlMPEI=
 =/rwk
 -----END PGP SIGNATURE-----

Merge 4.9.24 into android-4.9

Changes in 4.9.24:
	cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
	tcmu: Fix possible overwrite of t_data_sg's last iov[]
	tcmu: Fix wrongly calculating of the base_command_size
	tcmu: Skip Data-Out blocks before gathering Data-In buffer for BIDI case
	thp: fix MADV_DONTNEED vs. MADV_FREE race
	thp: fix MADV_DONTNEED vs clear soft dirty race
	zsmalloc: expand class bit
	orangefs: free superblock when mount fails
	drm/nouveau/mpeg: mthd returns true on success now
	drm/nouveau/mmu/nv4a: use nv04 mmu rather than the nv44 one
	drm/etnaviv: fix missing unlock on error in etnaviv_gpu_submit()
	CIFS: reconnect thread reschedule itself
	CIFS: store results of cifs_reopen_file to avoid infinite wait
	Input: xpad - add support for Razer Wildcat gamepad
	perf/x86: Avoid exposing wrong/stale data in intel_pmu_lbr_read_32()
	x86/efi: Don't try to reserve runtime regions
	x86/signals: Fix lower/upper bound reporting in compat siginfo
	x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
	x86/vdso: Ensure vdso32_enabled gets set to valid values only
	x86/vdso: Plug race between mapping and ELF header setup
	acpi, nfit, libnvdimm: fix interleave set cookie calculation (64-bit comparison)
	ACPI / scan: Set the visited flag for all enumerated devices
	parisc: fix bugs in pa_memcpy
	efi/libstub: Skip GOP with PIXEL_BLT_ONLY format
	efi/fb: Avoid reconfiguration of BAR that covers the framebuffer
	iscsi-target: Fix TMR reference leak during session shutdown
	iscsi-target: Drop work-around for legacy GlobalSAN initiator
	scsi: sr: Sanity check returned mode data
	scsi: sd: Consider max_xfer_blocks if opt_xfer_blocks is unusable
	scsi: qla2xxx: Add fix to read correct register value for ISP82xx.
	scsi: sd: Fix capacity calculation with 32-bit sector_t
	target: Avoid mappedlun symlink creation during lun shutdown
	xen, fbfront: fix connecting to backend
	new privimitive: iov_iter_revert()
	make skb_copy_datagram_msg() et.al. preserve ->msg_iter on error
	libnvdimm: fix blk free space accounting
	libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat
	can: ifi: use correct register to read rx status
	pwm: rockchip: State of PWM clock should synchronize with PWM enabled state
	cpufreq: Bring CPUs up even if cpufreq_online() failed
	irqchip/irq-imx-gpcv2: Fix spinlock initialization
	ftrace: Fix removing of second function probe
	char: lack of bool string made CONFIG_DEVPORT always on
	Revert "MIPS: Lantiq: Fix cascaded IRQ setup"
	kvm: fix page struct leak in handle_vmon
	zram: do not use copy_page with non-page aligned address
	ftrace: Fix function pid filter on instances
	crypto: algif_aead - Fix bogus request dereference in completion function
	crypto: ahash - Fix EINPROGRESS notification callback
	parisc: Fix get_user() for 64-bit value on 32-bit kernel
	ath9k: fix NULL pointer dereference
	dvb-usb-v2: avoid use-after-free
	ext4: fix inode checksum calculation problem if i_extra_size is small
	mm: memcontrol: use special workqueue for creating per-memcg caches
	drm/nouveau/disp/mcp7x: disable dptmds workaround
	nbd: use loff_t for blocksize and nbd_set_size args
	nbd: fix 64-bit division
	ASoC: Intel: select DW_DMAC_CORE since it's mandatory
	platform/x86: acer-wmi: setup accelerometer when machine has appropriate notify event
	x86/xen: Fix APIC id mismatch warning on Intel
	ACPI / EC: Use busy polling mode when GPE is not enabled
	rtc: tegra: Implement clock handling
	mm: Tighten x86 /dev/mem with zeroing reads
	dvb-usb: don't use stack for firmware load
	dvb-usb-firmware: don't do DMA on stack
	cxusb: Use a dma capable buffer also for reading
	virtio-console: avoid DMA from stack
	net: ipv6: check route protocol when deleting routes
	sctp: deny peeloff operation on asocs with threads sleeping on it
	Linux 4.9.24

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-04-21 09:48:33 +02:00
Tejun Heo
f44236a1b0 cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
commit 77f88796cee819b9c4562b0b6b44691b3b7755b1 upstream.

Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland.  kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed.  The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21 09:31:18 +02:00
Dmitry Shmidt
dcb6110067 This is the 4.9.9 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlicFX0ACgkQONu9yGCS
 aT7a2A//UGRG6hYhQwzHp25vRQmfXnHcet1cB9D2ADEY3EqrfWgcZcHWF4IbGtJ2
 OWGtsgp3eiOlgYZiQBj1VG90M+8RixM06utcjlQzISsFaFGDPIeAX+IDY3rG51qR
 Wpo0p9ujV5HMsUJEXZ1OPbqrsgKqKOQpecq9nSZJR5bhdb3SkvgQK8VFbavko60J
 5oUWM7n0CRy7VMNjv5UekUss7cNdc4OyqUbCcekYlmGtJ3y5EC9O0k07t7jMjKu8
 49MgoXKVEk/7nYapGV3AJISuZxHXgirM9HSHsKXDZoFG94hjSlVHrDBDIykTQQ4t
 7XKnCHveB+tGrE+pdDkFtYrN7k6pMdRMxq1NPA2eMpWzsvH4aQWaWXVMESkMIwLP
 x5C29xkOa9MmOTM4oqCrJ66CGar8Kcz156TEn4MR8Q4TSKoA4AHO85C8jCYOB+ID
 18KJVrNLOgybYWr/Ci2GM/lx4A5mD5yTDm++AZRt4hBBo5K9Ns/d9OatlVtTmzgT
 ollAUkl593/nJoIqA4o0mbaHrwawI9Gn3fgJQyEkiy0IksMkTzj2XcmwAbzOGWaz
 mIEWLPg9OYZXHQxqNTAGdHPYl2NEWqv5mE0A7B+3qhUsu8VoU3LWET6wTEzl5bHC
 obaygJ+r/K1pvey+P9BH35ModRSy/qA3YmENEQmuF3//BUGAP4g=
 =9AU7
 -----END PGP SIGNATURE-----

Merge tag 'v4.9.9' into android-4.9-aosp

This is the 4.9.9 stable release
2017-02-09 10:49:40 -08:00
Tejun Heo
1d88791d5e cgroup: don't online subsystems before cgroup_name/path() are operational
commit 07cd12945551b63ecb1a349d50a6d69d1d6feb4a upstream.

While refactoring cgroup creation, a5bca21520 ("cgroup: factor out
cgroup_create() out of cgroup_mkdir()") incorrectly onlined subsystems
before the new cgroup is associated with it kernfs_node.  This is fine
for cgroup proper but cgroup_name/path() depend on the associated
kernfs_node and if a subsystem makes the new cgroup_subsys_state
visible, which they're allowed to after onlining, it can lead to NULL
dereference.

The current code performs cgroup creation and subsystem onlining in
cgroup_create() and cgroup_mkdir() makes the cgroup and subsystems
visible afterwards.  There's no reason to online the subsystems early
and we can simply drop cgroup_apply_control_enable() call from
cgroup_create() so that the subsystems are onlined and made visible at
the same time.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: a5bca21520 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-09 08:08:28 +01:00
John Stultz
8cc698d951 ANDROID: [RFC]cgroup: Change from CAP_SYS_NICE to CAP_SYS_RESOURCE for cgroup migration permissions
Try to better match what we're pushing upstream, use CAP_SYS_RESOURCE
instead of CAP_SYS_NICE, which shoudln't affect Android as Zygote and
system_server already use CAP_SYS_RESOURCE.

Change-Id: I9b7ba2d9be1a469c9636497a6287f840891a91a8
Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-01-27 13:55:31 -08:00
Dmitry Torokhov
579a63bf28 CHROMIUM: cgroups: relax permissions on moving tasks between cgroups
Android expects system_server to be able to move tasks between different
cgroups/cpusets, but does not want to be running as root. Let's relax
permission check so that processes can move other tasks if they have
CAP_SYS_NICE in the affected task's user namespace.

BUG=b:31790445,chromium:647994
TEST=Boot android container, examine logcat

Change-Id: Ia919c66ab6ed6a6daf7c4cf67feb38b13b1ad09b
Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/394927
Reviewed-by: Ricky Zhou <rickyz@chromium.org>
2017-01-27 13:55:30 -08:00
Linus Torvalds
f34d3606f7 Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - tracepoints for basic cgroup management operations added

 - kernfs and cgroup path formatting functions updated to behave in the
   style of strlcpy()

 - non-critical bug fixes

* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
  cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
  cpuset: fix error handling regression in proc_cpuset_show()
  cgroup: add tracepoints for basic operations
  cgroup: make cgroup_path() and friends behave in the style of strlcpy()
  kernfs: remove kernfs_path_len()
  kernfs: make kernfs_path*() behave in the style of strlcpy()
  kernfs: add dummy implementation of kernfs_path_from_node()
2016-10-14 12:18:50 -07:00
Linus Torvalds
14986a34e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This set of changes is a number of smaller things that have been
  overlooked in other development cycles focused on more fundamental
  change. The devpts changes are small things that were a distraction
  until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
  trivial regression fix to autofs for the unprivileged mount changes
  that went in last cycle. A pair of ioctls has been added by Andrey
  Vagin making it is possible to discover the relationships between
  namespaces when referring to them through file descriptors.

  The big user visible change is starting to add simple resource limits
  to catch programs that misbehave. With namespaces in general and user
  namespaces in particular allowing users to use more kinds of
  resources, it has become important to have something to limit errant
  programs. Because the purpose of these limits is to catch errant
  programs the code needs to be inexpensive to use as it always on, and
  the default limits need to be high enough that well behaved programs
  on well behaved systems don't encounter them.

  To this end, after some review I have implemented per user per user
  namespace limits, and use them to limit the number of namespaces. The
  limits being per user mean that one user can not exhause the limits of
  another user. The limits being per user namespace allow contexts where
  the limit is 0 and security conscious folks can remove from their
  threat anlysis the code used to manage namespaces (as they have
  historically done as it root only). At the same time the limits being
  per user namespace allow other parts of the system to use namespaces.

  Namespaces are increasingly being used in application sand boxing
  scenarios so an all or nothing disable for the entire system for the
  security conscious folks makes increasing use of these sandboxes
  impossible.

  There is also added a limit on the maximum number of mounts present in
  a single mount namespace. It is nontrivial to guess what a reasonable
  system wide limit on the number of mount structure in the kernel would
  be, especially as it various based on how a system is using
  containers. A limit on the number of mounts in a mount namespace
  however is much easier to understand and set. In most cases in
  practice only about 1000 mounts are used. Given that some autofs
  scenarious have the potential to be 30,000 to 50,000 mounts I have set
  the default limit for the number of mounts at 100,000 which is well
  above every known set of users but low enough that the mount hash
  tables don't degrade unreaonsably.

  These limits are a start. I expect this estabilishes a pattern that
  other limits for resources that namespaces use will follow. There has
  been interest in making inotify event limits per user per user
  namespace as well as interest expressed in making details about what
  is going on in the kernel more visible"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
  autofs:  Fix automounts by using current_real_cred()->uid
  mnt: Add a per mount namespace limit on the number of mounts
  netns: move {inc,dec}_net_namespaces into #ifdef
  nsfs: Simplify __ns_get_path
  tools/testing: add a test to check nsfs ioctl-s
  nsfs: add ioctl to get a parent namespace
  nsfs: add ioctl to get an owning user namespace for ns file descriptor
  kernel: add a helper to get an owning user namespace for a namespace
  devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
  devpts: Remove sync_filesystems
  devpts: Make devpts_kill_sb safe if fsi is NULL
  devpts: Simplify devpts_mount by using mount_nodev
  devpts: Move the creation of /dev/pts/ptmx into fill_super
  devpts: Move parse_mount_options into fill_super
  userns: When the per user per user namespace limit is reached return ENOSPC
  userns; Document per user per user namespace limits.
  mntns: Add a limit on the number of mount namespaces.
  netns: Add a limit on the number of net namespaces
  cgroupns: Add a limit on the number of cgroup namespaces
  ipcns: Add a  limit on the number of ipc namespaces
  ...
2016-10-06 09:52:23 -07:00
Ingo Molnar
0b429e18c2 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:54:46 +02:00
Tejun Heo
e0223003e6 cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
4c737b41de ("cgroup: make cgroup_path() and friends behave in the
style of strlcpy()") broke error handling in proc_cgroup_show() and
cgroup_release_agent() by not handling negative return values from
cgroup_path_ns_locked().  Fix it.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-09-29 15:55:16 +02:00
Linus Torvalds
8ab293e3a1 Merge branch 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Three late fixes for cgroup: Two cpuset ones, one trivial and the
  other pretty obscure, and a cgroup core fix for a bug which impacts
  cgroup v2 namespace users"

* 'for-4.8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix invalid controller enable rejections with cgroup namespace
  cpuset: fix non static symbol warning
  cpuset: handle race between CPU hotplug and cpuset_hotplug_work
2016-09-27 16:43:11 -07:00
Tejun Heo
9157056da8 cgroup: fix invalid controller enable rejections with cgroup namespace
On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908fd2 ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Evgeny Vereshchagin <evvers@ya.ru>
Cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Cc: Aditya Kali <adityakali@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541
2016-09-23 16:55:49 -04:00
Eric W. Biederman
7872559664 Merge branch 'nsfs-ioctls' into HEAD
From: Andrey Vagin <avagin@openvz.org>

Each namespace has an owning user namespace and now there is not way
to discover these relationships.

Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.

Why we may want to know relationships between namespaces?

One use would be visualization, in order to understand the running
system.  Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?

One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.

There [1] was a discussion about which interface to choose to determing
relationships between namespaces.

Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble.  I think this may actually a case for creating ioctls
> for these two cases.  Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.

Here is an implementaions of these ioctl-s.

$ man man7/namespaces.7
...
Since  Linux  4.X,  the  following  ioctl(2)  calls are supported for
namespace file descriptors.  The correct syntax is:

      fd = ioctl(ns_fd, ioctl_type);

where ioctl_type is one of the following:

NS_GET_USERNS
      Returns a file descriptor that refers to an owning user names‐
      pace.

NS_GET_PARENT
      Returns  a  file descriptor that refers to a parent namespace.
      This ioctl(2) can be used for pid  and  user  namespaces.  For
      user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
      meaning.

In addition to generic ioctl(2) errors, the following  specific  ones
can occur:

EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.

EPERM  The  requested  namespace  is outside of the current namespace
      scope.

[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101

Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
  outside of the init namespace, so we can return EPERM in this case too.
  > The fewer special cases the easier the code is to get
  > correct, and the easier it is to read. // Eric

Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
  grabs a reference.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: "W. Trevor King" <wking@tremily.us>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
2016-09-22 20:00:36 -05:00
Andrey Vagin
bcac25a58b kernel: add a helper to get an owning user namespace for a namespace
Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.

Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-22 19:59:39 -05:00
Eric W. Biederman
df75e7748b userns: When the per user per user namespace limit is reached return ENOSPC
The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong.  I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.

The best general error code I have found for hitting a resource limit
is ENOSPC.  It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-09-22 13:25:56 -05:00
Ingo Molnar
7cf0f1426a Merge branch 'locking/urgent' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 15:21:48 +02:00
Johannes Weiner
d979a39d72 cgroup: duplicate cgroup reference when cloning sockets
When a socket is cloned, the associated sock_cgroup_data is duplicated
but not its reference on the cgroup.  As a result, the cgroup reference
count will underflow when both sockets are destroyed later on.

Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Peter Zijlstra
3942a9bd7b locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()
The current percpu-rwsem read side is entirely free of serializing insns
at the cost of having a synchronize_sched() in the write path.

The latency of the synchronize_sched() is too high for cgroups. The
commit 1ed1328792 talks about the write path being a fairly cold path
but this is not the case for Android which moves task to the foreground
cgroup and back around binder IPC calls from foreground processes to
background processes, so it is significantly hotter than human initiated
operations.

Switch cgroup_threadgroup_rwsem into the slow mode for now to avoid the
problem, hopefully it should not be that slow after another commit:

  80127a3968 ("locking/percpu-rwsem: Optimize readers and reduce global impact").

We could just add rcu_sync_enter() into cgroup_init() but we do not want
another synchronize_sched() at boot time, so this patch adds the new helper
which doesn't block but currently can only be called before the first use.

Reported-by: John Stultz <john.stultz@linaro.org>
Reported-by: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Colin Cross <ccross@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rom Lemarchand <romlem@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Link: http://lkml.kernel.org/r/20160811165413.GA22807@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 15:36:59 +02:00
Tejun Heo
ed1777de25 cgroup: add tracepoints for basic operations
Debugging what goes wrong with cgroup setup can get hairy.  Add
tracepoints for cgroup hierarchy mount, cgroup creation/destruction
and task migration operations for better visibility.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-08-10 11:23:44 -04:00
Tejun Heo
4c737b41de cgroup: make cgroup_path() and friends behave in the style of strlcpy()
cgroup_path() and friends used to format the path from the end and
thus the resulting path usually didn't start at the start of the
passed in buffer.  Also, when the buffer was too small, the partial
result was truncated from the head rather than tail and there was no
way to tell how long the full path would be.  These make the functions
less robust and more awkward to use.

With recent updates to kernfs_path(), cgroup_path() and friends can be
made to behave in strlcpy() style.

* cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
  always return the length of the full path.  If buffer is too small,
  it contains nul terminated truncated output.

* All users updated accordingly.

v2: cgroup_path() usage in kernel/sched/debug.c converted.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-08-10 11:23:44 -04:00
Eric W. Biederman
d08311dd6f cgroupns: Add a limit on the number of cgroup namespaces
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-08-08 14:42:03 -05:00
Linus Torvalds
574c7e2333 Merge branch 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull more cgroup updates from Tejun Heo:
 "I forgot to include the patches which got applied to for-4.7-fixes
  late during last cycle.

  Eric's three patches fix bugs introduced with the namespace support"

* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroupns: Only allow creation of hierarchies in the initial cgroup namespace
  cgroupns: Close race between cgroup_post_fork and copy_cgroup_ns
  cgroupns: Fix the locking in copy_cgroup_ns
2016-07-29 14:29:04 -07:00
Linus Torvalds
468fc7ed55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

 2) Make DSA binding more sane, from Andrew Lunn.

 3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

 4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

 5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface.  From Brenden Blanco and
    others.

 6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

 7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

 8) Simplify netlink conntrack entry layout, from Florian Westphal.

 9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

11) Support qdisc packet injection in pktgen, from John Fastabend.

12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

13) Add NV congestion control support to TCP, from Lawrence Brakmo.

14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

16) Support MPLS over IPV4, from Simon Horman.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
  xgene: Fix build warning with ACPI disabled.
  be2net: perform temperature query in adapter regardless of its interface state
  l2tp: Correctly return -EBADF from pppol2tp_getname.
  net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
  net: ipmr/ip6mr: update lastuse on entry change
  macsec: ensure rx_sa is set when validation is disabled
  tipc: dump monitor attributes
  tipc: add a function to get the bearer name
  tipc: get monitor threshold for the cluster
  tipc: make cluster size threshold for monitoring configurable
  tipc: introduce constants for tipc address validation
  net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
  MAINTAINERS: xgene: Add driver and documentation path
  Documentation: dtb: xgene: Add MDIO node
  dtb: xgene: Add MDIO node
  drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
  drivers: net: xgene: Use exported functions
  drivers: net: xgene: Enable MDIO driver
  drivers: net: xgene: Add backward compatibility
  drivers: net: phy: xgene: Add MDIO driver
  ...
2016-07-27 12:03:20 -07:00
Linus Torvalds
b55b048718 Merge branch 'for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Nothing too exciting.

   - updates to the pids controller so that pid limit breaches can be
     noticed and monitored from userland.

   - cleanups and non-critical bug fixes"

* 'for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: remove duplicated include from cgroup.c
  cgroup: Use lld instead of ld when printing pids controller events_limit
  cgroup: Add pids controller event when fork fails because of pid limit
  cgroup: allow NULL return from ss->css_alloc()
  cgroup: remove unnecessary 0 check from css_from_id()
  cgroup: fix idr leak for the first cgroup root
2016-07-26 14:34:17 -07:00