Commit graph

762 commits

Author SHA1 Message Date
xxmustafacooTR
71e430b7c9
gaming_control, mali: more controls, optimizations 2024-09-27 17:20:00 +03:00
Diep Quynh
609163c324
drivers: Introduce brand new kernel gaming mode
How to trigger gaming mode? Just open a game that is supported in games list
How to exit? Kill the game from recents, or simply back to homescreen

What does this gaming mode do?
- It limits big cluster maximum frequency to 2,0GHz, and little cluster
  to values matching GPU frequencies as below:
+ 338MHz: 455MHz
+ 385MHz and above: 1053MHz
- As for the cluster freq limits, it overcomes heating issue while playing
  heavy games, as well as saves battery juice

Big thanks to [kerneltoast] for the following commits on his wahoo kernel
5ac1e81d3d
e13e2c4554

Gaming control's idea was based on these

Signed-off-by: Diep Quynh <remilia.1505@gmail.com>
2024-09-27 17:19:55 +03:00
Diep Quynh
665c080acf
sched: ems: Update EMS to latest beyond2lte source
Signed-off-by: Diep Quynh <remilia.1505@gmail.com>
2024-09-27 17:17:49 +03:00
Diep Quynh
13a21ca507
universal9810: Reset scheduler to android-4.9-q state
We will adapt new scheduler mechanism after this

Signed-off-by: Diep Quynh <remilia.1505@gmail.com>
2024-09-27 17:17:26 +03:00
FAROVITUS
af1d3ae977 Merge 4.9.212 branch 'android-4.9-q' into tw10-android-4.9-q
Documentation/filesystems/fscrypt.rst
	arch/arm/common/Kconfig
	arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
	arch/arm64/boot/dts/amd/amd-seattle-soc.dtsi
	arch/arm64/boot/dts/arm/juno-clocks.dtsi
	arch/arm64/boot/dts/broadcom/ns2.dtsi
	arch/arm64/boot/dts/lg/lg1312.dtsi
	arch/arm64/boot/dts/lg/lg1313.dtsi
	arch/arm64/boot/dts/marvell/armada-37xx.dtsi
	arch/arm64/boot/dts/nvidia/tegra210-p2180.dtsi
	arch/arm64/boot/dts/nvidia/tegra210-p2597.dtsi
	arch/arm64/boot/dts/nvidia/tegra210.dtsi
	arch/arm64/boot/dts/qcom/apq8016-sbc.dtsi
	arch/arm64/boot/dts/qcom/msm8996.dtsi
	arch/arm64/configs/ranchu64_defconfig
	arch/arm64/include/asm/cpucaps.h
	arch/arm64/kernel/cpufeature.c
	arch/arm64/kernel/traps.c
	arch/arm64/mm/mmu.c
	crypto/Makefile
	crypto/ablkcipher.c
	crypto/blkcipher.c
	crypto/testmgr.h
	crypto/zstd.c
	drivers/android/binder.c
	drivers/android/binder_alloc.c
	drivers/char/random.c
	drivers/clocksource/exynos_mct.c
	drivers/dma/pl330.c
	drivers/hid/hid-sony.c
	drivers/hid/uhid.c
	drivers/hid/usbhid/hiddev.c
	drivers/i2c/i2c-core.c
	drivers/md/dm-crypt.c
	drivers/media/v4l2-core/videobuf2-v4l2.c
	drivers/mmc/host/dw_mmc.c
	drivers/net/ethernet/broadcom/tg3.c
	drivers/net/usb/r8152.c
	drivers/scsi/scsi_logging.c
	drivers/scsi/sd.c
	drivers/scsi/ufs/ufshcd-pci.c
	drivers/scsi/ufs/ufshcd-pltfrm.c
	drivers/staging/android/Kconfig
	drivers/staging/android/ion/ion.c
	drivers/staging/android/ion/ion_priv.h
	drivers/staging/android/ion/ion_system_heap.c
	drivers/staging/android/lowmemorykiller.c
	drivers/tty/serial/samsung.c
	drivers/usb/dwc3/core.c
	drivers/usb/dwc3/gadget.c
	drivers/usb/host/xhci-hub.c
	drivers/video/fbdev/core/fbmon.c
	drivers/video/fbdev/core/modedb.c
	fs/crypto/fname.c
	fs/crypto/fscrypt_private.h
	fs/crypto/keyinfo.c
	fs/ext4/ialloc.c
	fs/ext4/namei.c
	fs/ext4/xattr.c
	fs/f2fs/checkpoint.c
	fs/f2fs/data.c
	fs/f2fs/debug.c
	fs/f2fs/dir.c
	fs/f2fs/f2fs.h
	fs/f2fs/file.c
	fs/f2fs/gc.c
	fs/f2fs/inline.c
	fs/f2fs/inode.c
	fs/f2fs/namei.c
	fs/f2fs/node.c
	fs/f2fs/recovery.c
	fs/f2fs/segment.c
	fs/f2fs/segment.h
	fs/f2fs/super.c
	fs/f2fs/sysfs.c
	fs/fat/dir.c
	fs/fat/fatent.c
	fs/file.c
	fs/namespace.c
	fs/pnode.c
	fs/proc/inode.c
	fs/proc/root.c
	fs/proc/task_mmu.c
	fs/sdcardfs/dentry.c
	fs/sdcardfs/derived_perm.c
	fs/sdcardfs/file.c
	fs/sdcardfs/inode.c
	fs/sdcardfs/lookup.c
	fs/sdcardfs/main.c
	fs/sdcardfs/sdcardfs.h
	fs/sdcardfs/super.c
	include/linux/blk_types.h
	include/linux/cpuhotplug.h
	include/linux/cred.h
	include/linux/fb.h
	include/linux/power_supply.h
	include/linux/sched.h
	include/linux/zstd.h
	include/trace/events/sched.h
	include/uapi/linux/android/binder.h
	init/Kconfig
	init/main.c
	kernel/bpf/hashtab.c
	kernel/cpu.c
	kernel/cred.c
	kernel/fork.c
	kernel/locking/spinlock_debug.c
	kernel/panic.c
	kernel/printk/printk.c
	kernel/sched/Makefile
	kernel/sched/core.c
	kernel/sched/fair.c
	kernel/sched/rt.c
	kernel/sched/walt.c
	kernel/sched/walt.h
	kernel/trace/trace.c
	lib/bug.c
	lib/list_debug.c
	lib/vsprintf.c
	lib/zstd/bitstream.h
	lib/zstd/compress.c
	lib/zstd/decompress.c
	lib/zstd/fse.h
	lib/zstd/fse_compress.c
	lib/zstd/fse_decompress.c
	lib/zstd/huf_compress.c
	lib/zstd/huf_decompress.c
	lib/zstd/zstd_internal.h
	mm/debug.c
	mm/filemap.c
	mm/rmap.c
	net/core/filter.c
	net/ipv4/sysctl_net_ipv4.c
	net/ipv4/sysfs_net_ipv4.c
	net/ipv4/tcp_input.c
	net/ipv4/tcp_output.c
	net/ipv4/udp.c
	net/ipv6/netfilter/nf_conntrack_reasm.c
	net/netfilter/Kconfig
	net/netfilter/Makefile
	net/netfilter/xt_qtaguid.c
	net/netfilter/xt_qtaguid_internal.h
	net/xfrm/xfrm_policy.c
	net/xfrm/xfrm_state.c
	scripts/checkpatch.pl
	security/selinux/hooks.c
	sound/core/compress_offload.c
2020-02-12 12:32:38 +02:00
FAROVITUS
2b92eefa41 import G965FXXU7DTAA OSRC
*First release for Android (Q).

Signed-off-by: FAROVITUS <farovitus@gmail.com>
2020-02-04 13:50:09 +02:00
Greg Kroah-Hartman
e9766ef8f1 This is the 4.9.197 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl2o0kcACgkQONu9yGCS
 aT4nAQ//UOJDaAeVD3HlVI52sv8ZS7lCck8MCrVXeOWfgP5BPgVTcQairhBxtCiz
 eiIfQclHWTqoZygCHAf2mGNgSymYT13w0EJ/pnrGIBs66+mWTJcZ/U31iD8ftGo+
 LBC8nUL4ocmLN+eNFRhDpFjRksFZnYAcE5L3A3Y+oMat/QplYT/R+T6dgnN08VHI
 oBDNen1y8Q+4PG3LnBBtS1PjvnSJMVPxX4MkRfBD+wjSNZdPCDqQ/piuZCmeBSsy
 nkkW/fEsWAKbsXT3sybYTmjJpaJZvrhTGi/E/LCFAsAE1iQTawovC4xdT4LwGbID
 49qg/Fg9l8485zv7sTJOxjJBQKeePKVWuU0iSuOg9d1lc62ilO0A1NQ61Ys8wqwn
 DulsHART0d47sQzTwitMBJxX45iW+z+EUSi4WXTZSi1VpAvi8pIYHsnsPLaAOJXR
 k/vJDQJv0zCZD488pHpMIg3CQHCgavYh7V3bpcpkUcjI/YnnBVGjwn1X7ZT88FTh
 q9KK3mhz5ZfpNu7ZCrmn7uXdMOa8SL7r6NBhNjUAAoU4bOIg782AxIMOzBYJ9p0w
 BqcmgKcIaClBMW+q1zYuDh1l221G2VAPvwjaTr9EZ2jEjr5EyqWwZU+icOvZG+vl
 ETfH4NbRy5Zi5BinHkgYu2GRhk4MmJm/9QNpd/fu14sdnOwQUZs=
 =rLBB
 -----END PGP SIGNATURE-----

Merge 4.9.197 into android-4.9-q

Changes in 4.9.197
	KVM: s390: Test for bad access register and size at the start of S390_MEM_OP
	s390/topology: avoid firing events before kobjs are created
	s390/cio: avoid calling strlen on null pointer
	s390/cio: exclude subchannels with no parent from pseudo check
	KVM: nVMX: handle page fault in vmread fix
	ASoC: Define a set of DAPM pre/post-up events
	powerpc/powernv: Restrict OPAL symbol map to only be readable by root
	can: mcp251x: mcp251x_hw_reset(): allow more time after a reset
	crypto: qat - Silence smp_processor_id() warning
	usercopy: Avoid HIGHMEM pfn warning
	timer: Read jiffies once when forwarding base clk
	watchdog: imx2_wdt: fix min() calculation in imx2_wdt_set_timeout
	ieee802154: atusb: fix use-after-free at disconnect
	cfg80211: initialize on-stack chandefs
	ima: always return negative code for error
	fs: nfs: Fix possible null-pointer dereferences in encode_attrs()
	9p: avoid attaching writeback_fid on mmap with type PRIVATE
	xen/pci: reserve MCFG areas earlier
	ceph: fix directories inode i_blkbits initialization
	ceph: reconnect connection if session hang in opening state
	drm/amdgpu: Check for valid number of registers to read
	thermal: Fix use-after-free when unregistering thermal zone device
	fuse: fix memleak in cuse_channel_open
	sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
	kernel/elfcore.c: include proper prototypes
	tools lib traceevent: Do not free tep->cmdlines in add_new_comm() on failure
	perf tools: Fix segfault in cpu_cache_level__read()
	perf stat: Fix a segmentation fault when using repeat forever
	perf stat: Reset previous counts on repeat with interval
	crypto: caam - fix concurrency issue in givencrypt descriptor
	coresight: etm4x: Use explicit barriers on enable/disable
	cfg80211: add and use strongly typed element iteration macros
	cfg80211: Use const more consistently in for_each_element macros
	nl80211: validate beacon head
	ASoC: sgtl5000: Improve VAG power and mute control
	panic: ensure preemption is disabled during panic()
	USB: rio500: Remove Rio 500 kernel driver
	USB: yurex: Don't retry on unexpected errors
	USB: yurex: fix NULL-derefs on disconnect
	USB: usb-skeleton: fix runtime PM after driver unbind
	USB: usb-skeleton: fix NULL-deref on disconnect
	xhci: Fix false warning message about wrong bounce buffer write length
	xhci: Prevent device initiated U1/U2 link pm if exit latency is too long
	xhci: Check all endpoints for LPM timeout
	usb: xhci: wait for CNR controller not ready bit in xhci resume
	xhci: Increase STS_SAVE timeout in xhci_suspend()
	USB: adutux: remove redundant variable minor
	USB: adutux: fix use-after-free on disconnect
	USB: adutux: fix NULL-derefs on disconnect
	USB: adutux: fix use-after-free on release
	USB: iowarrior: fix use-after-free on disconnect
	USB: iowarrior: fix use-after-free on release
	USB: iowarrior: fix use-after-free after driver unbind
	USB: usblp: fix runtime PM after driver unbind
	USB: chaoskey: fix use-after-free on release
	USB: ldusb: fix NULL-derefs on driver unbind
	serial: uartlite: fix exit path null pointer
	USB: serial: keyspan: fix NULL-derefs on open() and write()
	USB: serial: ftdi_sio: add device IDs for Sienna and Echelon PL-20
	USB: serial: option: add Telit FN980 compositions
	USB: serial: option: add support for Cinterion CLS8 devices
	USB: serial: fix runtime PM after driver unbind
	USB: usblcd: fix I/O after disconnect
	USB: microtek: fix info-leak at probe
	USB: dummy-hcd: fix power budget for SuperSpeed mode
	usb: renesas_usbhs: gadget: Do not discard queues in usb_ep_set_{halt,wedge}()
	usb: renesas_usbhs: gadget: Fix usb_ep_set_{halt,wedge}() behavior
	USB: legousbtower: fix slab info leak at probe
	USB: legousbtower: fix deadlock on disconnect
	USB: legousbtower: fix potential NULL-deref on disconnect
	USB: legousbtower: fix open after failed reset request
	USB: legousbtower: fix use-after-free on release
	staging: vt6655: Fix memory leak in vt6655_probe
	iio: adc: ad799x: fix probe error handling
	iio: light: opt3001: fix mutex unlock race
	efivar/ssdt: Don't iterate over EFI vars if no SSDT override was specified
	perf llvm: Don't access out-of-scope array
	perf inject jit: Fix JIT_CODE_MOVE filename
	CIFS: Gracefully handle QueryInfo errors during open
	CIFS: Force revalidate inode when dentry is stale
	CIFS: Force reval dentry if LOOKUP_REVAL flag is set
	kernel/sysctl.c: do not override max_threads provided by userspace
	staging: fbtft: Stop using BL_CORE_DRIVER1
	Staging: fbtft: fix memory leak in fbtft_framebuffer_alloc
	MIPS: Disable Loongson MMI instructions for kernel build
	Fix the locking in dcache_readdir() and friends
	media: stkwebcam: fix runtime PM after driver unbind
	tracing/hwlat: Report total time spent in all NMIs during the sample
	tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
	tracing: Get trace_array reference for available_tracers files
	x86/asm: Fix MWAITX C-state hint value
	xfs: clear sb->s_fs_info on mount failure
	Linux 4.9.197

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-10-17 13:54:42 -07:00
Michal Hocko
5a4a1217c0 kernel/sysctl.c: do not override max_threads provided by userspace
commit b0f53dbc4bc4c371f38b14c391095a3bb8a0bb40 upstream.

Partially revert 16db3d3f11 ("kernel/sysctl.c: threads-max observe
limits") because the patch is causing a regression to any workload which
needs to override the auto-tuning of the limit provided by kernel.

set_max_threads is implementing a boot time guesstimate to provide a
sensible limit of the concurrently running threads so that runaways will
not deplete all the memory.  This is a good thing in general but there
are workloads which might need to increase this limit for an application
to run (reportedly WebSpher MQ is affected) and that is simply not
possible after the mentioned change.  It is also very dubious to
override an admin decision by an estimation that doesn't have any direct
relation to correctness of the kernel operation.

Fix this by dropping set_max_threads from sysctl_max_threads so any
value is accepted as long as it fits into MAX_THREADS which is important
to check because allowing more threads could break internal robust futex
restriction.  While at it, do not use MIN_THREADS as the lower boundary
because it is also only a heuristic for automatic estimation and admin
might have a good reason to stop new threads to be created even when
below this limit.

This became more severe when we switched x86 from 4k to 8k kernel
stacks.  Starting since 6538b8ea88 ("x86_64: expand kernel stack to
16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
value.

In the particular case

  3.12
  kernel.threads-max = 515561

  4.4
  kernel.threads-max = 200000

Neither of the two values is really insane on 32GB machine.

I am not sure we want/need to tune the max_thread value further.  If
anything the tuning should be removed altogether if proven not useful in
general.  But we definitely need a way to override this auto-tuning.

Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
Fixes: 16db3d3f11 ("kernel/sysctl.c: threads-max observe limits")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-17 13:42:44 -07:00
Joel Fernandes (Google)
af1070fbf2 UPSTREAM: pidfd: add polling support
This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.

We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
   it in task_struct would not be option in this case because the task can
   be reaped and destroyed before the poll returns.

2. By including the struct pid for the waitqueue means that during
   de_thread(), the new thread group leader automatically gets the new
   waitqueue/pid even though its task_struct is different.

Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@android.com
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Co-developed-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Christian Brauner <christian@brauner.io>

(cherry picked from commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I02f259d2875bec46b198d580edfbb067f077084e
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-09-03 13:45:29 -07:00
Christian Brauner
3941f126e3 BACKPORT: fork: do not release lock that wasn't taken
Avoid calling cgroup_threadgroup_change_end() without having called
cgroup_threadgroup_change_begin() first.

During process creation we need to check whether the cgroup we are in
allows us to fork. To perform this check the cgroup needs to guard itself
against threadgroup changes and takes a lock.
Prior to CLONE_PIDFD the cleanup target "bad_fork_free_pid" would also need
to call cgroup_threadgroup_change_end() because said lock had already been
taken.
However, this is not the case anymore with the addition of CLONE_PIDFD. We
are now allocating a pidfd before we check whether the cgroup we're in can
fork and thus prior to taking the lock. So when copy_process() fails at the
right step it would release a lock we haven't taken.
This bug is not even very subtle to be honest. It's just not very clear
from the naming of cgroup_threadgroup_change_{begin,end}() that a lock is
taken.

Here's the relevant splat:

entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(depth <= 0)
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052 __lock_release
kernel/locking/lockdep.c:4052 [inline]
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052
lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 7744 Comm: syz-executor007 Not tainted 5.1.0+ #4
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2cb/0x65c kernel/panic.c:214
  __warn.cold+0x20/0x45 kernel/panic.c:566
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:972
RIP: 0010:__lock_release kernel/locking/lockdep.c:4052 [inline]
RIP: 0010:lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Code: 0f 85 a0 03 00 00 8b 35 77 66 08 08 85 f6 75 23 48 c7 c6 a0 55 6b 87
48 c7 c7 40 25 6b 87 4c 89 85 70 ff ff ff e8 b7 a9 eb ff <0f> 0b 4c 8b 85
70 ff ff ff 4c 89 ea 4c 89 e6 4c 89 c7 e8 52 63 ff
RSP: 0018:ffff888094117b48 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 1ffff11012822f6f RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815af236 RDI: ffffed1012822f5b
RBP: ffff888094117c00 R08: ffff888092bfc400 R09: fffffbfff113301d
R10: fffffbfff113301c R11: ffffffff889980e3 R12: ffffffff8a451df8
R13: ffffffff8142e71f R14: ffffffff8a44cc80 R15: ffff888094117bd8
  percpu_up_read.constprop.0+0xcb/0x110 include/linux/percpu-rwsem.h:92
  cgroup_threadgroup_change_end include/linux/cgroup-defs.h:712 [inline]
  copy_process.part.0+0x47ff/0x6710 kernel/fork.c:2222
  copy_process kernel/fork.c:1772 [inline]
  _do_fork+0x25d/0xfd0 kernel/fork.c:2338
  __do_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:240 [inline]
  __se_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:236 [inline]
  __ia32_compat_sys_x86_clone+0xbc/0x140 arch/x86/ia32/sys_ia32.c:236
  do_syscall_32_irqs_on arch/x86/entry/common.c:334 [inline]
  do_fast_syscall_32+0x281/0xd54 arch/x86/entry/common.c:405
  entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..

Reported-and-tested-by: syzbot+3286e58549edc479faae@syzkaller.appspotmail.com
Fixes: b3e583825266 ("clone: add CLONE_PIDFD")
Signed-off-by: Christian Brauner <christian@brauner.io>

(cherry picked from commit c3b7112df86b769927a60a6d7175988ca3d60f09)

Conflicts:
        kernel/fork.c

(1. Replaced cgroup_threadgroup_change_end with threadgroup_change_end)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: Ib9ecb1e5c0c6e2d062b89c25109ec571570eb497
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-09-03 13:45:09 -07:00
Christian Brauner
0e020c19bb BACKPORT: clone: add CLONE_PIDFD
This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call.  Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call.  As
spotted by Linus, there is exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api.  They serve as a simple opaque handle on pids.  Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending).  Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege.  This does not imply it
cannot ever be that way but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs.  The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone.  This has the advantage that we can
give back the associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>.  The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>

(cherry picked from commit b3e5838252665ee4cfa76b82bdf1198dca81e5be)

Conflicts:
        kernel/fork.c

(1. Replaced proc_pid_ns() with its direct implementation.)

Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I3c804a92faea686e5bf7f99df893fe3a5d87ddf7
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-09-03 13:44:48 -07:00
Greg Kroah-Hartman
0eb90dd8f7 This is the 4.9.187 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl1GilkACgkQONu9yGCS
 aT6ibw//QcoRk+6jjWIzW2R83tvcSpDREIYKF5xDPULqmaUKv7qcbGqdCT4OI0Vf
 CySao6jvmN+1V+5OlQHHw1qmithHF/Y4yEVi+eL0CcMgB6qA2tQxeA0ffFu0Fzxe
 oKrAIDANAJ2FKymqbIzTJU8AChNuy+Rc+C60L9O0ZhBHykLYvO7/VepTxdO2aMGs
 a9tdZyXnOAr/ZgKx+uN1F8Rx2ZorNfFmP2rDxEZ/B8pVrXsOjMAcxWRsZBv+w9mc
 zWaMPEL1vIR/kG35L18l2EFwZ++uIGvfGR2HxhNRUlTqN4m/3trATIc0eRn5PyAC
 qBlVdcwUeJKavBqK3cgHvK9CzWQdSMxNefk9A0H66ZGpfSuNCiFjw54AXEiRzx3t
 OvzUvDjfa36jsKW7yiYXdLbEa52gT14UDzmSIzlAYQLBplEadHikJ0FxCFzAWpFt
 C13gG1e4G0ZKEHH8wZMuRdanHbN87c/0Sm4rgTLMwCO6I1PeILcUwIq9El9M4RVL
 MmHSXLKduaHbPJWx+Qopl6NxjqTe/VR8paT6q1QnU00/4kP15aIVE08vZSiGSNP+
 Gp6xbv12skAx1m1zP7K82oGdwnCVCJUqlC9wafcjsaxcoCVBWvIgPkyMWAzUeFzc
 Ub2yIS9iulUeYOxPXBZKWCqgwpl7kovGztf4gwX2Kuy50mXSZQg=
 =ozfF
 -----END PGP SIGNATURE-----

Merge 4.9.187 into android-4.9-q

Changes in 4.9.187
	MIPS: ath79: fix ar933x uart parity mode
	MIPS: fix build on non-linux hosts
	arm64/efi: Mark __efistub_stext_offset as an absolute symbol explicitly
	dmaengine: imx-sdma: fix use-after-free on probe error path
	ath10k: Do not send probe response template for mesh
	ath9k: Check for errors when reading SREV register
	ath6kl: add some bounds checking
	ath: DFS JP domain W56 fixed pulse type 3 RADAR detection
	batman-adv: fix for leaked TVLV handler.
	media: dvb: usb: fix use after free in dvb_usb_device_exit
	crypto: talitos - fix skcipher failure due to wrong output IV
	media: marvell-ccic: fix DMA s/g desc number calculation
	media: vpss: fix a potential NULL pointer dereference
	media: media_device_enum_links32: clean a reserved field
	net: stmmac: dwmac1000: Clear unused address entries
	net: stmmac: dwmac4/5: Clear unused address entries
	signal/pid_namespace: Fix reboot_pid_ns to use send_sig not force_sig
	af_key: fix leaks in key_pol_get_resp and dump_sp.
	xfrm: Fix xfrm sel prefix length validation
	media: mc-device.c: don't memset __user pointer contents
	media: staging: media: davinci_vpfe: - Fix for memory leak if decoder initialization fails.
	net: phy: Check against net_device being NULL
	crypto: talitos - properly handle split ICV.
	crypto: talitos - Align SEC1 accesses to 32 bits boundaries.
	tua6100: Avoid build warnings.
	locking/lockdep: Fix merging of hlocks with non-zero references
	media: wl128x: Fix some error handling in fm_v4l2_init_video_device()
	cpupower : frequency-set -r option misses the last cpu in related cpu list
	net: fec: Do not use netdev messages too early
	net: axienet: Fix race condition causing TX hang
	s390/qdio: handle PENDING state for QEBSM devices
	perf cs-etm: Properly set the value of 'old' and 'head' in snapshot mode
	perf test 6: Fix missing kvm module load for s390
	gpio: omap: fix lack of irqstatus_raw0 for OMAP4
	gpio: omap: ensure irq is enabled before wakeup
	regmap: fix bulk writes on paged registers
	bpf: silence warning messages in core
	rcu: Force inlining of rcu_read_lock()
	blkcg, writeback: dead memcgs shouldn't contribute to writeback ownership arbitration
	xfrm: fix sa selector validation
	perf evsel: Make perf_evsel__name() accept a NULL argument
	vhost_net: disable zerocopy by default
	ipoib: correcly show a VF hardware address
	EDAC/sysfs: Fix memory leak when creating a csrow object
	ipsec: select crypto ciphers for xfrm_algo
	media: i2c: fix warning same module names
	ntp: Limit TAI-UTC offset
	timer_list: Guard procfs specific code
	acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
	media: coda: fix mpeg2 sequence number handling
	media: coda: increment sequence offset for the last returned frame
	mt7601u: do not schedule rx_tasklet when the device has been disconnected
	x86/build: Add 'set -e' to mkcapflags.sh to delete broken capflags.c
	mt7601u: fix possible memory leak when the device is disconnected
	ath10k: fix PCIE device wake up failed
	perf tools: Increase MAX_NR_CPUS and MAX_CACHES
	libata: don't request sense data on !ZAC ATA devices
	clocksource/drivers/exynos_mct: Increase priority over ARM arch timer
	rslib: Fix decoding of shortened codes
	rslib: Fix handling of of caller provided syndrome
	ixgbe: Check DDM existence in transceiver before access
	crypto: asymmetric_keys - select CRYPTO_HASH where needed
	EDAC: Fix global-out-of-bounds write when setting edac_mc_poll_msec
	bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush()
	iwlwifi: mvm: Drop large non sta frames
	net: usb: asix: init MAC address buffers
	gpiolib: Fix references to gpiod_[gs]et_*value_cansleep() variants
	Bluetooth: hci_bcsp: Fix memory leak in rx_skb
	Bluetooth: 6lowpan: search for destination address in all peers
	Bluetooth: Check state in l2cap_disconnect_rsp
	Bluetooth: validate BLE connection interval updates
	gtp: fix Illegal context switch in RCU read-side critical section.
	gtp: fix use-after-free in gtp_newlink()
	xen: let alloc_xenballooned_pages() fail if not enough memory free
	scsi: NCR5380: Reduce goto statements in NCR5380_select()
	scsi: NCR5380: Always re-enable reselection interrupt
	scsi: mac_scsi: Increase PIO/PDMA transfer length threshold
	crypto: ghash - fix unaligned memory access in ghash_setkey()
	crypto: arm64/sha1-ce - correct digest for empty data in finup
	crypto: arm64/sha2-ce - correct digest for empty data in finup
	crypto: chacha20poly1305 - fix atomic sleep when using async algorithm
	crypto: crypto4xx - fix a potential double free in ppc4xx_trng_probe
	Input: gtco - bounds check collection indent level
	regulator: s2mps11: Fix buck7 and buck8 wrong voltages
	arm64: tegra: Update Jetson TX1 GPU regulator timings
	iwlwifi: pcie: don't service an interrupt that was masked
	tracing/snapshot: Resize spare buffer if size changed
	NFSv4: Handle the special Linux file open access mode
	lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE
	ALSA: seq: Break too long mutex context in the write loop
	ALSA: hda/realtek: apply ALC891 headset fixup to one Dell machine
	media: v4l2: Test type instead of cfg->type in v4l2_ctrl_new_custom()
	media: coda: Remove unbalanced and unneeded mutex unlock
	KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed
	arm64: tegra: Fix AGIC register range
	fs/proc/proc_sysctl.c: fix the default values of i_uid/i_gid on /proc/sys inodes.
	drm/nouveau/i2c: Enable i2c pads & busses during preinit
	padata: use smp_mb in padata_reorder to avoid orphaned padata jobs
	9p/virtio: Add cleanup path in p9_virtio_init
	PCI: Do not poll for PME if the device is in D3cold
	Btrfs: add missing inode version, ctime and mtime updates when punching hole
	libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
	take floppy compat ioctls to sodding floppy.c
	floppy: fix div-by-zero in setup_format_params
	floppy: fix out-of-bounds read in next_valid_format
	floppy: fix invalid pointer dereference in drive_name
	floppy: fix out-of-bounds read in copy_buffer
	coda: pass the host file in vma->vm_file on mmap
	gpu: ipu-v3: ipu-ic: Fix saturation bit offset in TPMEM
	crypto: ccp - Validate the the error value used to index error messages
	PCI: hv: Delete the device earlier from hbus->children for hot-remove
	PCI: hv: Fix a use-after-free bug in hv_eject_device_work()
	crypto: caam - limit output IV to CBC to work around CTR mode DMA issue
	um: Allow building and running on older hosts
	um: Fix FP register size for XSTATE/XSAVE
	parisc: Ensure userspace privilege for ptraced processes in regset functions
	parisc: Fix kernel panic due invalid values in IAOQ0 or IAOQ1
	powerpc/32s: fix suspend/resume when IBATs 4-7 are used
	powerpc/watchpoint: Restore NV GPRs while returning from exception
	eCryptfs: fix a couple type promotion bugs
	intel_th: msu: Fix single mode with disabled IOMMU
	Bluetooth: Add SMP workaround Microsoft Surface Precision Mouse bug
	usb: Handle USB3 remote wakeup for LPM enabled devices correctly
	dm bufio: fix deadlock with loop device
	compiler.h, kasan: Avoid duplicating __read_once_size_nocheck()
	compiler.h: Add read_word_at_a_time() function.
	lib/strscpy: Shut up KASAN false-positives in strscpy()
	ext4: allow directory holes
	bnx2x: Prevent load reordering in tx completion processing
	bnx2x: Prevent ptp_task to be rescheduled indefinitely
	caif-hsi: fix possible deadlock in cfhsi_exit_module()
	igmp: fix memory leak in igmpv3_del_delrec()
	ipv4: don't set IPv6 only flags to IPv4 addresses
	net: bcmgenet: use promisc for unsupported filters
	net: dsa: mv88e6xxx: wait after reset deactivation
	net: neigh: fix multiple neigh timer scheduling
	net: openvswitch: fix csum updates for MPLS actions
	nfc: fix potential illegal memory access
	rxrpc: Fix send on a connected, but unbound socket
	sky2: Disable MSI on ASUS P6T
	vrf: make sure skb->data contains ip header to make routing
	macsec: fix use-after-free of skb during RX
	macsec: fix checksumming after decryption
	netrom: fix a memory leak in nr_rx_frame()
	netrom: hold sock when setting skb->destructor
	bonding: validate ip header before check IPPROTO_IGMP
	tcp: Reset bytes_acked and bytes_received when disconnecting
	net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling
	net: bridge: mcast: fix stale ipv6 hdr pointer when handling v6 query
	net: bridge: stp: don't cache eth dest pointer before skb pull
	perf/x86/amd/uncore: Rename 'L2' to 'LLC'
	perf/x86/amd/uncore: Get correct number of cores sharing last level cache
	perf/events/amd/uncore: Fix amd_uncore_llc ID to use pre-defined cpu_llc_id
	NFSv4: Fix open create exclusive when the server reboots
	nfsd: increase DRC cache limit
	nfsd: give out fewer session slots as limit approaches
	nfsd: fix performance-limiting session calculation
	nfsd: Fix overflow causing non-working mounts on 1 TB machines
	drm/panel: simple: Fix panel_simple_dsi_probe
	usb: core: hub: Disable hub-initiated U1/U2
	tty: max310x: Fix invalid baudrate divisors calculator
	pinctrl: rockchip: fix leaked of_node references
	tty: serial: cpm_uart - fix init when SMC is relocated
	drm/bridge: tc358767: read display_props in get_modes()
	drm/bridge: sii902x: pixel clock unit is 10kHz instead of 1kHz
	memstick: Fix error cleanup path of memstick_init
	tty/serial: digicolor: Fix digicolor-usart already registered warning
	tty: serial: msm_serial: avoid system lockup condition
	serial: 8250: Fix TX interrupt handling condition
	drm/virtio: Add memory barriers for capset cache.
	phy: renesas: rcar-gen2: Fix memory leak at error paths
	drm/rockchip: Properly adjust to a true clock in adjusted_mode
	tty: serial_core: Set port active bit in uart_port_activate
	usb: gadget: Zero ffs_io_data
	powerpc/pci/of: Fix OF flags parsing for 64bit BARs
	PCI: sysfs: Ignore lockdep for remove attribute
	kbuild: Add -Werror=unknown-warning-option to CLANG_FLAGS
	PCI: xilinx-nwl: Fix Multi MSI data programming
	iio: iio-utils: Fix possible incorrect mask calculation
	recordmcount: Fix spurious mcount entries on powerpc
	mfd: core: Set fwnode for created devices
	mfd: arizona: Fix undefined behavior
	mfd: hi655x-pmic: Fix missing return value check for devm_regmap_init_mmio_clk
	um: Silence lockdep complaint about mmap_sem
	powerpc/4xx/uic: clear pending interrupt after irq type/pol change
	RDMA/i40iw: Set queue pair state when being queried
	serial: sh-sci: Terminate TX DMA during buffer flushing
	serial: sh-sci: Fix TX DMA buffer flushing and workqueue races
	kallsyms: exclude kasan local symbols on s390
	perf test mmap-thread-lookup: Initialize variable to suppress memory sanitizer warning
	RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
	powerpc/boot: add {get, put}_unaligned_be32 to xz_config.h
	f2fs: avoid out-of-range memory access
	mailbox: handle failed named mailbox channel request
	powerpc/eeh: Handle hugepages in ioremap space
	sh: prevent warnings when using iounmap
	mm/kmemleak.c: fix check for softirq context
	9p: pass the correct prototype to read_cache_page
	mm/mmu_notifier: use hlist_add_head_rcu()
	locking/lockdep: Fix lock used or unused stats error
	locking/lockdep: Hide unused 'class' variable
	usb: wusbcore: fix unbalanced get/put cluster_id
	usb: pci-quirks: Correct AMD PLL quirk detection
	x86/sysfb_efi: Add quirks for some devices with swapped width and height
	x86/speculation/mds: Apply more accurate check on hypervisor platform
	hpet: Fix division by zero in hpet_time_div()
	ALSA: line6: Fix wrong altsetting for LINE6_PODHD500_1
	ALSA: hda - Add a conexant codec entry to let mute led work
	powerpc/tm: Fix oops on sigreturn on systems without TM
	access: avoid the RCU grace period for the temporary subjective credentials
	ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt
	tcp: reset sk_send_head in tcp_write_queue_purge
	arm64: dts: marvell: Fix A37xx UART0 register size
	i2c: qup: fixed releasing dma without flush operation completion
	arm64: compat: Provide definition for COMPAT_SIGMINSTKSZ
	ISDN: hfcsusb: checking idx of ep configuration
	media: au0828: fix null dereference in error path
	media: cpia2_usb: first wake up, then free in disconnect
	media: radio-raremono: change devm_k*alloc to k*alloc
	Bluetooth: hci_uart: check for missing tty operations
	sched/fair: Don't free p->numa_faults with concurrent readers
	drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
	ceph: hold i_ceph_lock when removing caps for freeing inode
	Linux 4.9.187

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-08-04 09:50:32 +02:00
Jann Horn
837ffc9723 sched/fair: Don't free p->numa_faults with concurrent readers
commit 16d51a590a8ce3befb1308e0e7ab77f3b661af33 upstream.

When going through execve(), zero out the NUMA fault statistics instead of
freeing them.

During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.

Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 82727018b0 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04 09:33:45 +02:00
Johannes Weiner
3df0e59afa UPSTREAM: psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2)

Bug: 111308141
Test: modified lmkd to use PSI and tested using lmkd_unit_test

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I54a65620b3ed6f8172fdec789a237a99f8c82156
2019-03-22 14:10:35 -07:00
Greg Kroah-Hartman
c7b283dd04 This is the 4.9.150 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlw6/vsACgkQONu9yGCS
 aT7Qmw//aTFzpEg2elxZ2o2U+363AGxtHcA8o/4pInzdfeKnnUL3fcdFd2EyGcuW
 33w9mCRqSTgy5xln7E0PFX6s2NkfDTlMIrnywu0Bp6zhT2mkljueYK3FhboiJb3b
 mZazz+PoOAQQ6OtRiGtJnjuM0YCrnwZKkHZgM0IQ1rctl5f9h6eukqEL/Hr2lNgd
 /AmSwP+AgGNSStPMsyNWkbXGsWNfOxT/VIksBqAZ4giaIAgi89OHlDHhAe49W/Am
 Ak/R7Rd+ny7jb0NGTONd1J/B67sO0FZ7InayfzyMeJ4l9q/TimKdcXHg2nUoh1h/
 dqIE1GmqVhIPsKAbNYmMgFrdBgmQY3LGL2nB5h5V77u86fX70jsXqwKJMJsQcy+F
 Uiqx6C3TU/yFW4Rg4mZfO6vliP97H6H3ohUQUetu1Pt5Hx4hxSiGc40mZPr80pBf
 3d2ntaRdh7tSM8ypyO3Cp8CHG3Lzy1/T5x5tFYi7kMdZ5P86jCaZupOrCIjje6fz
 pGwsvJA/9pNr/DK2KbaFpYbzVNXqRCHgUsLbsbrJ4u+Cwzca5+khTQnGWaUy1exY
 O6ulxr6wNyvDPxZHnbbZF7Ikra+6i0KoPWAv9NeNJdH9/zldpKAK6RRhFg4EikTW
 5VuXm+805MyPc+UI63MU5Hk83MDl7hYRd9WhKWdbpfgbEYpfzac=
 =rFBu
 -----END PGP SIGNATURE-----

Merge 4.9.150 into android-4.9

Changes in 4.9.150
	pinctrl: meson: fix pull enable register calculation
	powerpc: Fix COFF zImage booting on old powermacs
	ARM: imx: update the cpu power up timing setting on i.mx6sx
	ARM: dts: imx7d-nitrogen7: Fix the description of the Wifi clock
	Input: restore EV_ABS ABS_RESERVED
	checkstack.pl: fix for aarch64
	xfrm: Fix bucket count reported to userspace
	netfilter: seqadj: re-load tcp header pointer after possible head reallocation
	scsi: bnx2fc: Fix NULL dereference in error handling
	Input: omap-keypad - fix idle configuration to not block SoC idle states
	netfilter: ipset: do not call ipset_nest_end after nla_nest_cancel
	bnx2x: Clear fip MAC when fcoe offload support is disabled
	bnx2x: Remove configured vlans as part of unload sequence.
	bnx2x: Send update-svid ramrod with retry/poll flags enabled
	scsi: target: iscsi: cxgbit: fix csk leak
	scsi: target: iscsi: cxgbit: add missing spin_lock_init()
	drivers: net: xgene: Remove unnecessary forward declarations
	w90p910_ether: remove incorrect __init annotation
	net: hns: Incorrect offset address used for some registers.
	net: hns: All ports can not work when insmod hns ko after rmmod.
	net: hns: Some registers use wrong address according to the datasheet.
	net: hns: Fixed bug that netdev was opened twice
	net: hns: Clean rx fbd when ae stopped.
	net: hns: Free irq when exit from abnormal branch
	net: hns: Avoid net reset caused by pause frames storm
	net: hns: Fix ntuple-filters status error.
	net: hns: Add mac pcs config when enable|disable mac
	SUNRPC: Fix a race with XPRT_CONNECTING
	lan78xx: Resolve issue with changing MAC address
	vxge: ensure data0 is initialized in when fetching firmware version information
	net: netxen: fix a missing check and an uninitialized use
	serial/sunsu: fix refcount leak
	scsi: zfcp: fix posting too many status read buffers leading to adapter shutdown
	libceph: fix CEPH_FEATURE_CEPHX_V2 check in calc_signature()
	fork: record start_time late
	hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
	mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
	mm, devm_memremap_pages: kill mapping "System RAM" support
	sunrpc: fix cache_head leak due to queued request
	sunrpc: use SVC_NET() in svcauth_gss_* functions
	MIPS: math-emu: Write-protect delay slot emulation pages
	crypto: x86/chacha20 - avoid sleeping with preemption disabled
	vhost/vsock: fix uninitialized vhost_vsock->guest_cid
	IB/hfi1: Incorrect sizing of sge for PIO will OOPs
	ALSA: cs46xx: Potential NULL dereference in probe
	ALSA: usb-audio: Avoid access before bLength check in build_audio_procunit()
	ALSA: usb-audio: Fix an out-of-bound read in create_composite_quirks
	dlm: fixed memory leaks after failed ls_remove_names allocation
	dlm: possible memory leak on error path in create_lkb()
	dlm: lost put_lkb on error path in receive_convert() and receive_unlock()
	dlm: memory leaks on error path in dlm_user_request()
	gfs2: Get rid of potential double-freeing in gfs2_create_inode
	gfs2: Fix loop in gfs2_rbm_find
	b43: Fix error in cordic routine
	powerpc/tm: Set MSR[TS] just prior to recheckpoint
	9p/net: put a lower bound on msize
	rxe: fix error completion wr_id and qp_num
	iommu/vt-d: Handle domain agaw being less than iommu agaw
	ceph: don't update importing cap's mseq when handing cap export
	genwqe: Fix size check
	intel_th: msu: Fix an off-by-one in attribute store
	power: supply: olpc_battery: correct the temperature units
	drm/vc4: Set ->is_yuv to false when num_planes == 1
	bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw
	Linux 4.9.150

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-01-13 10:34:18 +01:00
David Herrmann
0ea6030b55 fork: record start_time late
commit 7b55851367136b1efd84d98fea81ba57a98304cf upstream.

This changes the fork(2) syscall to record the process start_time after
initializing the basic task structure but still before making the new
process visible to user-space.

Technically, we could record the start_time anytime during fork(2).  But
this might lead to scenarios where a start_time is recorded long before
a process becomes visible to user-space.  For instance, with
userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
for an indefinite amount of time (and will, if this causes network
access, or similar).

By recording the start_time late, it much closer reflects the point in
time where the process becomes live and can be observed by other
processes.

Lastly, this makes it much harder for user-space to predict and control
the start_time they get assigned.  Previously, user-space could fork a
process and stall it in copy_thread_tls() before its pid is allocated,
but after its start_time is recorded.  This can be misused to later-on
cycle through PIDs and resume the stalled fork(2) yielding a process
that has the same pid and start_time as a process that existed before.
This can be used to circumvent security systems that identify processes
by their pid+start_time combination.

Even though user-space was always aware that start_time recording is
flaky (but several projects are known to still rely on start_time-based
identification), changing the start_time to be recorded late will help
mitigate existing attacks and make it much harder for user-space to
control the start_time a process gets assigned.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13 10:03:51 +01:00
Greg Kroah-Hartman
ba01a4255d This is the 4.9.128 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluitdUACgkQONu9yGCS
 aT6k7g//fxvMjnxYwVZ6LQjFkBakhP4gU/unxKppIFb0Iti+LBpx3Fg5JbtpRM8y
 oCAAMb9F5Dxl38691BqBpCdihmrGurh9EDn9ZZDsoUtipQ7UQozmGDbbc4fV0cQW
 /2Fs7GjFp1XopOAvFp7z/Xn5Xi8fSxgWg6l6aQ5VzsUtK69FM4TmNPJ2x1adow+N
 UsewbUkLkD9dMIrrgIajeCsErRYnJHy6FuY/pWSXVuY1gyTlCW/lsxKU2OoS5l5Z
 u/ZxprWl7z6FjSfzHc4AYrBR6t2zCAV2DNZhX5lDU/EHzTAttq0sSaeEfWzqYrJM
 q/bStS8eEWj+MkqfrppWRc1I/ZwUrjMB1P8OJEW8UUFTjAcRvlBz/Z1poOCRkBRp
 flsohLfao2/dbPxT0oLQMidBRX9t90TQLxwGY2SJ2/dL6uU/sapQDo8/iwyCZvC6
 u57DLf82P9tUvZgnUOoNuCPSqiZQG/tGPGFRGU0EgHVtoE7S285FvRJZoNlRH+oT
 EyAHtavzbgsoJb0oP12eOBfq0dwvmXt6i7ypzPoy/Fh+zQTb5Fk1fH+4GF+gFqWY
 //CpKO+zIdeFzWZ/r8eAkS+uLgvgywNbZV6N8EJYqviajKfU5A/RX5GsQ4oaFfk6
 ++Anyn58UYYSkw7ov5ynLNmNr+tlMjKUuRXw0u9rtAhcTQcnHYc=
 =YQcR
 -----END PGP SIGNATURE-----

Merge 4.9.128 into android-4.9

Changes in 4.9.128
	i2c: xiic: Make the start and the byte count write atomic
	i2c: i801: fix DNV's SMBCTRL register offset
	KVM: s390: vsie: copy wrapping keys to right place
	ALSA: hda - Fix cancel_work_sync() stall from jackpoll work
	cfq: Give a chance for arming slice idle timer in case of group_idle
	kthread: Fix use-after-free if kthread fork fails
	kthread: fix boot hang (regression) on MIPS/OpenRISC
	staging: rt5208: Fix a sleep-in-atomic bug in xd_copy_page
	staging/rts5208: Fix read overflow in memcpy
	IB/rxe: do not copy extra stack memory to skb
	block,blkcg: use __GFP_NOWARN for best-effort allocations in blkcg
	nl80211: fix null-ptr dereference on invalid mesh configuration
	locking/rwsem-xadd: Fix missed wakeup due to reordering of load
	selinux: use GFP_NOWAIT in the AVC kmem_caches
	locking/osq_lock: Fix osq_lock queue corruption
	mm, vmscan: clear PGDAT_WRITEBACK when zone is balanced
	mm: remove seemingly spurious reclaimability check from laptop_mode gating
	ARC: [plat-axs*]: Enable SWAP
	misc: mic: SCIF Fix scif_get_new_port() error handling
	ethtool: Remove trailing semicolon for static inline
	Bluetooth: h5: Fix missing dependency on BT_HCIUART_SERDEV
	gpio: tegra: Move driver registration to subsys_init level
	net: phy: Fix the register offsets in Broadcom iProc mdio mux driver
	scsi: target: fix __transport_register_session locking
	md/raid5: fix data corruption of replacements after originals dropped
	timers: Clear timer_base::must_forward_clk with timer_base::lock held
	misc: ti-st: Fix memory leak in the error path of probe()
	uio: potential double frees if __uio_register_device() fails
	tty: rocket: Fix possible buffer overwrite on register_PCI
	f2fs: do not set free of current section
	perf tools: Allow overriding MAX_NR_CPUS at compile time
	NFSv4.0 fix client reference leak in callback
	macintosh/via-pmu: Add missing mmio accessors
	ath9k: report tx status on EOSP
	ath9k_hw: fix channel maximum power level test
	ath10k: prevent active scans on potential unusable channels
	wlcore: Set rx_status boottime_ns field on rx
	MIPS: Fix ISA virt/bus conversion for non-zero PHYS_OFFSET
	ata: libahci: Correct setting of DEVSLP register
	scsi: 3ware: fix return 0 on the error path of probe
	ath10k: disable bundle mgmt tx completion event support
	Bluetooth: hidp: Fix handling of strncpy for hid->name information
	x86/mm: Remove in_nmi() warning from vmalloc_fault()
	gpio: ml-ioh: Fix buffer underwrite on probe error path
	net: mvneta: fix mtu change on port without link
	f2fs: try grabbing node page lock aggressively in sync scenario
	f2fs: fix to skip GC if type in SSA and SIT is inconsistent
	tpm_tis_spi: Pass the SPI IRQ down to the driver
	tpm/tpm_i2c_infineon: switch to i2c_lock_bus(..., I2C_LOCK_SEGMENT)
	f2fs: fix to do sanity check with reserved blkaddr of inline inode
	MIPS: Octeon: add missing of_node_put()
	MIPS: generic: fix missing of_node_put()
	net: dcb: For wild-card lookups, use priority -1, not 0
	Input: atmel_mxt_ts - only use first T9 instance
	media: s5p-mfc: Fix buffer look up in s5p_mfc_handle_frame_{new, copy_time} functions
	partitions/aix: append null character to print data from disk
	partitions/aix: fix usage of uninitialized lv_info and lvname structures
	media: helene: fix xtal frequency setting at power on
	f2fs: Fix uninitialized return in f2fs_ioc_shutdown()
	iommu/ipmmu-vmsa: Fix allocation in atomic context
	mfd: ti_am335x_tscadc: Fix struct clk memory leak
	f2fs: fix to do sanity check with {sit,nat}_ver_bitmap_bytesize
	NFSv4.1: Fix a potential layoutget/layoutrecall deadlock
	MIPS: WARN_ON invalid DMA cache maintenance, not BUG_ON
	RDMA/cma: Do not ignore net namespace for unbound cm_id
	xhci: Fix use-after-free in xhci_free_virt_device
	netfilter: x_tables: avoid stack-out-of-bounds read in xt_copy_counters_from_user
	mtd: ubi: wl: Fix error return code in ubi_wl_init()
	autofs: fix autofs_sbi() does not check super block type
	mm: get rid of vmacache_flush_all() entirely
	Linux 4.9.128

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-09-20 11:14:01 +02:00
Vegard Nossum
a70e46bcea kthread: Fix use-after-free if kthread fork fails
commit 4d6501dce079c1eb6bf0b1d8f528a5e81770109e upstream.

If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but
fails in copy_process() between calling dup_task_struct() and setting
p->set_child_tid, then the value of p->set_child_tid will be inherited
from the parent and get prematurely freed by free_kthread_struct().

    kthread()
     - worker_thread()
        - process_one_work()
        |  - call_usermodehelper_exec_work()
        |     - kernel_thread()
        |        - _do_fork()
        |           - copy_process()
        |              - dup_task_struct()
        |                 - arch_dup_task_struct()
        |                    - tsk->set_child_tid = current->set_child_tid // implied
        |              - ...
        |              - goto bad_fork_*
        |              - ...
        |              - free_task(tsk)
        |                 - free_kthread_struct(tsk)
        |                    - kfree(tsk->set_child_tid)
        - ...
        - schedule()
           - __schedule()
              - wq_worker_sleeping()
                 - kthread_data(task)->flags // UAF

The problem started showing up with commit 1da5c46fa965 since it reused
->set_child_tid for the kthread worker data.

A better long-term solution might be to get rid of the ->set_child_tid
abuse. The comment in set_kthread_struct() also looks slightly wrong.

Debugged-by: Jamie Iles <jamie.iles@oracle.com>
Fixes: 1da5c46fa965 ("kthread: Make struct kthread kmalloc'ed")
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jamie Iles <jamie.iles@oracle.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-19 22:47:10 +02:00
Greg Kroah-Hartman
be4935d541 This is the 4.9.127 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlucuBEACgkQONu9yGCS
 aT4HlBAAnXzLh/c8dKPdWzU3Bx6L9If7syeW0Gl06AtGsUrVqrggOnumdXRgPV3y
 /4Br6piwrpjHP9P7m/pS03Ol3+DvD4JKhMRFr0VzGwk4e32wZ8cNd7qx5RF/4z9s
 tpebYJVJoRRmefi1EJyj+NAGgDCk+7A1pzW3W2NK/mirQdkHywfLJHhQDyHgmWQy
 jyDpz2ihNdorKHN8ogK7FYQ7JZC9tnooB4GZFObaB8nkaeMaRGD2V2jrHRi8a4tv
 5Bm9HmlGJOqtsstsr1Em6W+UNRKfztfDmYuk+H9xCo8odCSHeyx30nUDxjp6Bo2f
 OvC/x1x6JIVNsoAgzlRZKMJR1Lv0IK3SEVPrecOOSWeardtt92JUQjoLaLmW03xR
 dlnXZyDeO75Qbdgo/5b/d6RiVNGSv0CPDOQSREDMX6Rvf9P8ezceF/iDz0SrH9Zm
 GkjQCQ2yyqrrSID5baB562AHszBa+rdqgp4RBheE+M5s7UALL09tFJquaijZ8jvO
 dMs6f847MeMARFIGLNKff9cvWI2zanTwOPX+E3rcj88rN79jvIsyD/hDzkqHoR7J
 Rhb2K7478gC1N2XTOu97hNmB9qTNIn9U1hvS/f2d7SGyjOU9WR+XgCgDUWZsZnPP
 H4eeblr9iaCUHI/4Q+nWUuQPKnCdGb9Ay9lyq9uqHIQBz+uVZ4I=
 =hV8i
 -----END PGP SIGNATURE-----

Merge 4.9.127 into android-4.9

Changes in 4.9.127
	x86/speculation/l1tf: Fix up pte->pfn conversion for PAE
	act_ife: fix a potential use-after-free
	ipv4: tcp: send zero IPID for RST and ACK sent in SYN-RECV and TIME-WAIT state
	net: bcmgenet: use MAC link status for fixed phy
	net: sched: Fix memory exposure from short TCA_U32_SEL
	qlge: Fix netdev features configuration.
	r8169: add support for NCube 8168 network card
	tcp: do not restart timewait timer on rst reception
	vti6: remove !skb->ignore_df check from vti6_xmit()
	sctp: hold transport before accessing its asoc in sctp_transport_get_next
	vhost: correctly check the iova range when waking virtqueue
	hv_netvsc: ignore devices that are not PCI
	act_ife: move tcfa_lock down to where necessary
	act_ife: fix a potential deadlock
	net: sched: action_ife: take reference to meta module
	cifs: check if SMB2 PDU size has been padded and suppress the warning
	hfsplus: don't return 0 when fill_super() failed
	hfs: prevent crash on exit from failed search
	sunrpc: Don't use stack buffer with scatterlist
	fork: don't copy inconsistent signal handler state to child
	reiserfs: change j_timestamp type to time64_t
	hfsplus: fix NULL dereference in hfsplus_lookup()
	fat: validate ->i_start before using
	scripts: modpost: check memory allocation results
	virtio: pci-legacy: Validate queue pfn
	mm/fadvise.c: fix signed overflow UBSAN complaint
	fs/dcache.c: fix kmemcheck splat at take_dentry_name_snapshot()
	platform/x86: intel_punit_ipc: fix build errors
	s390/kdump: Fix memleak in nt_vmcoreinfo
	ipvs: fix race between ip_vs_conn_new() and ip_vs_del_dest()
	mfd: sm501: Set coherent_dma_mask when creating subdevices
	platform/x86: asus-nb-wmi: Add keymap entry for lid flip action on UX360
	RDMA/hns: Fix usage of bitmap allocation functions return values
	irqchip/bcm7038-l1: Hide cpu offline callback when building for !SMP
	net/9p/trans_fd.c: fix race by holding the lock
	net/9p: fix error path of p9_virtio_probe
	powerpc: Fix size calculation using resource_size()
	perf probe powerpc: Fix trace event post-processing
	block: bvec_nr_vecs() returns value for wrong slab
	s390/dasd: fix hanging offline processing due to canceled worker
	s390/dasd: fix panic for failed online processing
	ACPI / scan: Initialize status to ACPI_STA_DEFAULT
	scsi: aic94xx: fix an error code in aic94xx_init()
	PCI: mvebu: Fix I/O space end address calculation
	dm kcopyd: avoid softlockup in run_complete_job
	staging: comedi: ni_mio_common: fix subdevice flags for PFI subdevice
	selftests/powerpc: Kill child processes on SIGINT
	RDS: IB: fix 'passing zero to ERR_PTR()' warning
	smb3: fix reset of bytes read and written stats
	SMB3: Number of requests sent should be displayed for SMB3 not just CIFS
	powerpc/pseries: Avoid using the size greater than RTAS_ERROR_LOG_MAX.
	clk: rockchip: Add pclk_rkpwm_pmu to PMU critical clocks in rk3399
	btrfs: replace: Reset on-disk dev stats value after replace
	btrfs: relocation: Only remove reloc rb_trees if reloc control has been initialized
	btrfs: Don't remove block group that still has pinned down bytes
	arm64: rockchip: Force CONFIG_PM on Rockchip systems
	ARM: rockchip: Force CONFIG_PM on Rockchip systems
	drm/edid: Add 6 bpc quirk for SDC panel in Lenovo B50-80
	tcp: Revert "tcp: tcp_probe: use spin_lock_bh()"
	debugobjects: Make stack check warning more informative
	x86/pae: use 64 bit atomic xchg function in native_ptep_get_and_clear
	kbuild: make missing $DEPMOD a Warning instead of an Error
	irda: Fix memory leak caused by repeated binds of irda socket
	irda: Only insert new objects into the global database via setsockopt
	Revert "ARM: imx_v6_v7_defconfig: Select ULPI support"
	enic: do not call enic_change_mtu in enic_probe
	Fixes: Commit 2aa6d036b7 ("mm: numa: avoid waiting on freed migrated pages")
	sch_htb: fix crash on init failure
	sch_multiq: fix double free on init failure
	sch_hhf: fix null pointer dereference on init failure
	sch_netem: avoid null pointer deref on init failure
	sch_tbf: fix two null pointer dereferences on init failure
	mei: me: allow runtime pm for platform with D0i3
	s390/lib: use expoline for all bcr instructions
	ASoC: wm8994: Fix missing break in switch
	btrfs: use correct compare function of dirty_metadata_bytes
	arm64: Fix mismatched cache line size detection
	arm64: Handle mismatched cache type
	Linux 4.9.127

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-09-15 12:17:13 +02:00
Jann Horn
015fd7e0a6 fork: don't copy inconsistent signal handler state to child
[ Upstream commit 06e62a46bbba20aa5286102016a04214bb446141 ]

Before this change, if a multithreaded process forks while one of its
threads is changing a signal handler using sigaction(), the memcpy() in
copy_sighand() can race with the struct assignment in do_sigaction().  It
isn't clear whether this can cause corruption of the userspace signal
handler pointer, but it definitely can cause inconsistency between
different fields of struct sigaction.

Take the appropriate spinlock to avoid this.

I have tested that this patch prevents inconsistency between sa_sigaction
and sa_flags, which is possible before this patch.

Link: http://lkml.kernel.org/r/20180702145108.73189-1-jannh@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-15 09:42:57 +02:00
Greg Kroah-Hartman
92e87041ed This is the 4.9.119 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAltsFNgACgkQONu9yGCS
 aT6ZSBAAsAf1VBFKXLJDnCfMdft7Ecrvks3Lb+55qeoPbHgCPPR7ci9R0afaifUn
 46O2RFCO5PJAPSGKnLzT6vRUozNIwXbR5zsu/KA9+4zEiej/4PFlF8Ty8yh7LTRI
 q0OgyRylbxeUGRjFeTJtaiMG5NIJQD6TEgMhEZoCOSPHVwWjrEyC4ka2RUuXetTw
 eXXbm8c+949flMAaxz8plbTUupOQaPiNAaTHzFJR73y5jiRfNTRejgOrawB97BrF
 N8uM73sNp87bQeCxQUd00SDrzhHEfCrYueeMhO/qfaduMSx/iUNPpK0KmzF6RxDy
 MsmlAFjs9AnPiM7/PH6pdAgwSqsEoIzlRF/oEMOJtONLwHitwGLna8x4NP49eiaQ
 7Np6UJNSUJKTyZVar2SRuzKLAsOyoeZOraey9P4OQCKDVdvlLITLSQ2jbqgEn81K
 8U92W0r5E/DI6oE/wubMRgus1xPh8yrlynoaa9+wFtSt7QHvBWEpmhe8RRXzAUNv
 rSJ4Ovj98IS096f2x1y4zvFfzT3k9soC+5Gxb00JUrWcSbDaffCWTCq4WhrmeasB
 I8YeP3lVDSRkkkqwxy1IY5otjYmqwUqhE1wXiai14I+XZ19pIfyRxooFwlCL5wwi
 vBm4Lu+cI8a6lUsM1mFk3sxYpekwY6xJt62W75RorCz36MesbLE=
 =daBd
 -----END PGP SIGNATURE-----

Merge 4.9.119 into android-4.9

Changes in 4.9.119
	scsi: qla2xxx: Fix ISP recovery on unload
	scsi: qla2xxx: Return error when TMF returns
	genirq: Make force irq threading setup more robust
	nohz: Fix local_timer_softirq_pending()
	netlink: Do not subscribe to non-existent groups
	netlink: Don't shift with UB on nlk->ngroups
	netlink: Don't shift on 64 for ngroups
	ext4: fix false negatives *and* false positives in ext4_check_descriptors()
	ACPI / PCI: Bail early in acpi_pci_add_bus() if there is no ACPI handle
	ring_buffer: tracing: Inherit the tracing setting to next ring buffer
	i2c: imx: Fix reinit_completion() use
	Btrfs: fix file data corruption after cloning a range and fsync
	tcp: add tcp_ooo_try_coalesce() helper
	kmemleak: clear stale pointers from task stacks
	fork: unconditionally clear stack on fork
	IB/hfi1: Fix incorrect mixing of ERR_PTR and NULL return values
	jfs: Fix inconsistency between memory allocation and ea_buf->max_size
	Linux 4.9.119

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-08-09 16:01:36 +02:00
Kees Cook
6a19e26f11 fork: unconditionally clear stack on fork
commit e01e80634ecdde1dd113ac43b3adad21b47f3957 upstream.

One of the classes of kernel stack content leaks[1] is exposing the
contents of prior heap or stack contents when a new process stack is
allocated.  Normally, those stacks are not zeroed, and the old contents
remain in place.  In the face of stack content exposure flaws, those
contents can leak to userspace.

Fixing this will make the kernel no longer vulnerable to these flaws, as
the stack will be wiped each time a stack is assigned to a new process.
There's not a meaningful change in runtime performance; it almost looks
like it provides a benefit.

Performing back-to-back kernel builds before:
	Run times: 157.86 157.09 158.90 160.94 160.80
	Mean: 159.12
	Std Dev: 1.54

and after:
	Run times: 159.31 157.34 156.71 158.15 160.81
	Mean: 158.46
	Std Dev: 1.46

Instead of making this a build or runtime config, Andy Lutomirski
recommended this just be enabled by default.

[1] A noisy search for many kinds of stack content leaks can be seen here:
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak

I did some more with perf and cycle counts on running 100,000 execs of
/bin/true.

before:
Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841
Mean:  221015379122.60
Std Dev: 4662486552.47

after:
Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348
Mean:  217745009865.40
Std Dev: 5935559279.99

It continues to look like it's faster, though the deviation is rather
wide, but I'm not sure what I could do that would be less noisy.  I'm
open to ideas!

Link: http://lkml.kernel.org/r/20180221021659.GA37073@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ Srivatsa: Backported to 4.9.y ]
Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
Reviewed-by: Srinidhi Rao <srinidhir@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-09 12:18:00 +02:00
Konstantin Khlebnikov
885b49b4f3 kmemleak: clear stale pointers from task stacks
commit ca182551857cc2c1e6a2b7f1e72090a137a15008 upstream.

Kmemleak considers any pointers on task stacks as references.  This
patch clears newly allocated and reused vmap stacks.

Link: http://lkml.kernel.org/r/150728990124.744199.8403409836394318684.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ Srivatsa: Backported to 4.9.y ]
Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-09 12:18:00 +02:00
Sultan Alsawaf
47bbcd6bf8 ANDROID: Fix massive cpufreq_times memory leaks
Every time _cpu_up() is called for a CPU, idle_thread_get() is called
which then re-initializes a CPU's idle thread that was already
previously created and cached in a global variable in
smpboot.c. idle_thread_get() calls init_idle() which then calls
__sched_fork(). __sched_fork() is where cpufreq_task_times_init() is,
and cpufreq_task_times_init() allocates memory for the task struct's
time_in_state array.

Since idle_thread_get() reuses a task struct instance that was already
previously created, this means that every time it calls init_idle(),
cpufreq_task_times_init() allocates this array again and overwrites
the existing allocation that the idle thread already had.

This causes memory to be leaked every time a CPU is onlined. In order
to fix this, move allocation of time_in_state into _do_fork to avoid
allocating it at all for idle threads. The cpufreq times interface is
intended to be used for tracking userspace tasks, so we can safely
remove it from the kernel's idle threads without killing any
functionality.

But that's not all!

Task structs can be freed outside of release_task(), which creates
another memory leak because a task struct can be freed without having
its cpufreq times allocation freed. To fix this, free the cpufreq
times allocation at the same time that task struct allocations are
freed, in free_task().

Since free_task() can also be called in error paths of copy_process()
after dup_task_struct(), set time_in_state to NULL immediately after
calling dup_task_struct() to avoid possible double free.

Bug description and fix adapted from patch submitted by
Sultan Alsawaf <sultanxda@gmail.com> at
https://android-review.googlesource.com/c/kernel/msm/+/700134

Bug: 110044919
Test: Hikey960 builds, boots & reports /proc/<pid>/time_in_state
correctly
Change-Id: I12fe7611fc88eb7f6c39f8f7629ad27b6ec4722c
Signed-off-by: Connor O'Brien <connoro@google.com>
2018-07-18 13:22:08 +00:00
Hugh Dickins
0994a2cf8f kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE
Kaiser only needs to map one page of the stack; and
kernel/fork.c did not build on powerpc (no __PAGE_KERNEL).
It's all cleaner if linux/kaiser.h provides kaiser_map_thread_stack()
and kaiser_unmap_thread_stack() wrappers around asm/kaiser.h's
kaiser_add_mapping() and kaiser_remove_mapping().  And use
linux/kaiser.h in init/main.c to avoid the #ifdefs there.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:32 +01:00
Dave Hansen
8f0baadf2b kaiser: merged update
Merged fixes and cleanups, rebased to 4.9.51 tree (no 5-level paging).

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:32 +01:00
Richard Fellner
13be4483bb KAISER: Kernel Address Isolation
This patch introduces our implementation of KAISER (Kernel Address Isolation to
have Side-channels Efficiently Removed), a kernel isolation technique to close
hardware side channels on kernel address information.

More information about the patch can be found on:

        https://github.com/IAIK/KAISER

From: Richard Fellner <richard.fellner@student.tugraz.at>
From: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Subject: [RFC, PATCH] x86_64: KAISER - do not map kernel in user mode
Date: Thu, 4 May 2017 14:26:50 +0200
Link: http://marc.info/?l=linux-kernel&m=149390087310405&w=2
Kaiser-4.10-SHA1: c4b1831d44c6144d3762ccc72f0c4e71a0c713e5

To: <linux-kernel@vger.kernel.org>
To: <kernel-hardening@lists.openwall.com>
Cc: <clementine.maurice@iaik.tugraz.at>
Cc: <moritz.lipp@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <kirill.shutemov@linux.intel.com>
Cc: <anders.fogh@gdata-adan.de>

After several recent works [1,2,3] KASLR on x86_64 was basically
considered dead by many researchers. We have been working on an
efficient but effective fix for this problem and found that not mapping
the kernel space when running in user mode is the solution to this
problem [4] (the corresponding paper [5] will be presented at ESSoS17).

With this RFC patch we allow anybody to configure their kernel with the
flag CONFIG_KAISER to add our defense mechanism.

If there are any questions we would love to answer them.
We also appreciate any comments!

Cheers,
Daniel (+ the KAISER team from Graz University of Technology)

[1] 4977a191.pdf
[2] https://www.blackhat.com/docs/us-16/materials/us-16-Fogh-Using-Undocumented-CPU-Behaviour-To-See-Into-Kernel-Mode-And-Break-KASLR-In-The-Process.pdf
[3] https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX.pdf
[4] https://github.com/IAIK/KAISER
[5] https://gruss.cc/files/kaiser.pdf

[patch based also on
https://raw.githubusercontent.com/IAIK/KAISER/master/KAISER/0001-KAISER-Kernel-Address-Isolation.patch]

Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at>
Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05 15:46:32 +01:00
Eric Biggers
17c564f629 mm, uprobes: fix multiple free of ->uprobes_state.xol_area
commit 355627f518978b5167256d27492fe0b343aaf2f2 upstream.

Commit 7c05126793 ("mm, fork: make dup_mmap wait for mmap_sem for
write killable") made it possible to kill a forking task while it is
waiting to acquire its ->mmap_sem for write, in dup_mmap().

However, it was overlooked that this introduced an new error path before
the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
being copied from the old mm_struct by the memcpy in dup_mm().  For a
task that has previously hit a uprobe tracepoint, this resulted in the
'struct xol_area' being freed multiple times if the task was killed at
just the right time while forking.

Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
than in uprobe_dup_mmap().

With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
program given by commit 2b7e8665b4ff ("fork: fix incorrect fput of
->exe_file causing use-after-free"), provided that a uprobe tracepoint
has been set on the fork_thread() function.  For example:

    $ gcc reproducer.c -o reproducer -lpthread
    $ nm reproducer | grep fork_thread
    0000000000400719 t fork_thread
    $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
    $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
    $ ./reproducer

Here is the use-after-free reported by KASAN:

    BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
    Read of size 8 at addr ffff8800320a8b88 by task reproducer/198

    CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 #255
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
    Call Trace:
     dump_stack+0xdb/0x185
     print_address_description+0x7e/0x290
     kasan_report+0x23b/0x350
     __asan_report_load8_noabort+0x19/0x20
     uprobe_clear_state+0x1c4/0x200
     mmput+0xd6/0x360
     do_exit+0x740/0x1670
     do_group_exit+0x13f/0x380
     get_signal+0x597/0x17d0
     do_signal+0x99/0x1df0
     exit_to_usermode_loop+0x166/0x1e0
     syscall_return_slowpath+0x258/0x2c0
     entry_SYSCALL_64_fastpath+0xbc/0xbe

    ...

    Allocated by task 199:
     save_stack_trace+0x1b/0x20
     kasan_kmalloc+0xfc/0x180
     kmem_cache_alloc_trace+0xf3/0x330
     __create_xol_area+0x10f/0x780
     uprobe_notify_resume+0x1674/0x2210
     exit_to_usermode_loop+0x150/0x1e0
     prepare_exit_to_usermode+0x14b/0x180
     retint_user+0x8/0x20

    Freed by task 199:
     save_stack_trace+0x1b/0x20
     kasan_slab_free+0xa8/0x1a0
     kfree+0xba/0x210
     uprobe_clear_state+0x151/0x200
     mmput+0xd6/0x360
     copy_process.part.8+0x605f/0x65d0
     _do_fork+0x1a5/0xbd0
     SyS_clone+0x19/0x20
     do_syscall_64+0x22f/0x660
     return_from_SYSCALL_64+0x0/0x7a

Note: without KASAN, you may instead see a "Bad page state" message, or
simply a general protection fault.

Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
Fixes: 7c05126793 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-07 08:35:39 +02:00
Eric Biggers
b65b6ac52e fork: fix incorrect fput of ->exe_file causing use-after-free
commit 2b7e8665b4ff51c034c55df3cff76518d1a9ee3a upstream.

Commit 7c05126793 ("mm, fork: make dup_mmap wait for mmap_sem for
write killable") made it possible to kill a forking task while it is
waiting to acquire its ->mmap_sem for write, in dup_mmap().

However, it was overlooked that this introduced an new error path before
a reference is taken on the mm_struct's ->exe_file.  Since the
->exe_file of the new mm_struct was already set to the old ->exe_file by
the memcpy() in dup_mm(), it was possible for the mmput() in the error
path of dup_mm() to drop a reference to ->exe_file which was never
taken.

This caused the struct file to later be freed prematurely.

Fix it by updating mm_init() to NULL out the ->exe_file, in the same
place it clears other things like the list of mmaps.

This bug was found by syzkaller.  It can be reproduced using the
following C program:

    #define _GNU_SOURCE
    #include <pthread.h>
    #include <stdlib.h>
    #include <sys/mman.h>
    #include <sys/syscall.h>
    #include <sys/wait.h>
    #include <unistd.h>

    static void *mmap_thread(void *_arg)
    {
        for (;;) {
            mmap(NULL, 0x1000000, PROT_READ,
                 MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
        }
    }

    static void *fork_thread(void *_arg)
    {
        usleep(rand() % 10000);
        fork();
    }

    int main(void)
    {
        fork();
        fork();
        fork();
        for (;;) {
            if (fork() == 0) {
                pthread_t t;

                pthread_create(&t, NULL, mmap_thread, NULL);
                pthread_create(&t, NULL, fork_thread, NULL);
                usleep(rand() % 10000);
                syscall(__NR_exit_group, 0);
            }
            wait(NULL);
        }
    }

No special kernel config options are needed.  It usually causes a NULL
pointer dereference in __remove_shared_vm_struct() during exit, or in
dup_mmap() (which is usually inlined into copy_process()) during fork.
Both are due to a vm_area_struct's ->vm_file being used after it's
already been freed.

Google Bug Id: 64772007

Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com
Fixes: 7c05126793 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-30 10:21:47 +02:00
Daniel Micay
f157261b55 stackprotector: Increase the per-task stack canary's random range from 32 bits to 64 bits on 64-bit platforms
commit 5ea30e4e58040cfd6434c2f33dc3ea76e2c15b05 upstream.

The stack canary is an 'unsigned long' and should be fully initialized to
random data rather than only 32 bits of random data.

Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Arjan van Ven <arjan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-hardening@lists.openwall.com
Link: http://lkml.kernel.org/r/20170504133209.3053-1-danielmicay@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25 15:44:46 +02:00
Kirill Tkhai
2ea2f891fa pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()
commit 3fd37226216620c1a468afa999739d5016fbc349 upstream.

Imagine we have a pid namespace and a task from its parent's pid_ns,
which made setns() to the pid namespace. The task is doing fork(),
while the pid namespace's child reaper is dying. We have the race
between them:

Task from parent pid_ns             Child reaper
copy_process()                      ..
  alloc_pid()                       ..
  ..                                zap_pid_ns_processes()
  ..                                  disable_pid_allocation()
  ..                                  read_lock(&tasklist_lock)
  ..                                  iterate over pids in pid_ns
  ..                                    kill tasks linked to pids
  ..                                  read_unlock(&tasklist_lock)
  write_lock_irq(&tasklist_lock);   ..
  attach_pid(p, PIDTYPE_PID);       ..
  ..                                ..

So, just created task p won't receive SIGKILL signal,
and the pid namespace will be in contradictory state.
Only manual kill will help there, but does the userspace
care about this? I suppose, the most users just inject
a task into a pid namespace and wait a SIGCHLD from it.

The patch fixes the problem. It simply checks for
(pid_ns->nr_hashed & PIDNS_HASH_ADDING) in copy_process().
We do it under the tasklist_lock, and can't skip
PIDNS_HASH_ADDING as noted by Oleg:

"zap_pid_ns_processes() does disable_pid_allocation()
and then takes tasklist_lock to kill the whole namespace.
Given that copy_process() checks PIDNS_HASH_ADDING
under write_lock(tasklist) they can't race;
if copy_process() takes this lock first, the new child will
be killed, otherwise copy_process() can't miss
the change in ->nr_hashed."

If allocation is disabled, we just return -ENOMEM
like it's made for such cases in alloc_pid().

v2: Do not move disable_pid_allocation(), do not
introduce a new variable in copy_process() and simplify
the patch as suggested by Oleg Nesterov.
Account the problem with double irq enabling
found by Eric W. Biederman.

Fixes: c876ad7682 ("pidns: Stop pid allocation when init dies")
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Mike Rapoport <rppt@linux.vnet.ibm.com>
CC: Michal Hocko <mhocko@suse.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: "Eric W. Biederman" <ebiederm@xmission.com>
CC: Andrei Vagin <avagin@openvz.org>
CC: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Serge Hallyn <serge@hallyn.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25 15:44:38 +02:00
Eric W. Biederman
694a95fa6d mm: Add a user_ns owner to mm_struct and fix ptrace permission checks
commit bfedb589252c01fa505ac9f6f2a3d5d68d707ef4 upstream.

During exec dumpable is cleared if the file that is being executed is
not readable by the user executing the file.  A bug in
ptrace_may_access allows reading the file if the executable happens to
enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

This problem is fixed with only necessary userspace breakage by adding
a user namespace owner to mm_struct, captured at the time of exec, so
it is clear in which user namespace CAP_SYS_PTRACE must be present in
to be able to safely give read permission to the executable.

The function ptrace_may_access is modified to verify that the ptracer
has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
This ensures that if the task changes it's cred into a subordinate
user namespace it does not become ptraceable.

The function ptrace_attach is modified to only set PT_PTRACE_CAP when
CAP_SYS_PTRACE is held over task->mm->user_ns.  The intent of
PT_PTRACE_CAP is to be a flag to note that whatever permission changes
the task might go through the tracer has sufficient permissions for
it not to be an issue.  task->cred->user_ns is always the same
as or descendent of mm->user_ns.  Which guarantees that having
CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
credentials.

To prevent regressions mm->dumpable and mm->user_ns are not considered
when a task has no mm.  As simply failing ptrace_may_attach causes
regressions in privileged applications attempting to read things
such as /proc/<pid>/stat

Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Fixes: 8409cca705 ("userns: allow ptrace from non-init user namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 10:40:13 +01:00
Andy Lutomirski
405c075971 fork: Add task stack refcounting sanity check and prevent premature task stack freeing
If something goes wrong with task stack refcounting and a stack
refcount hits zero too early, warn and leak it rather than
potentially freeing it early (and silently).

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f29119c783a9680a4b4656e751b6123917ace94b.1477926663.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:39:17 +01:00
Linus Torvalds
9ffc66941d This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
 possible, hoping to capitalize on any possible variation in CPU operation
 (due to runtime data differences, hardware differences, SMP ordering,
 thermal timing variation, cache behavior, etc).
 
 At the very least, this plugin is a much more comprehensive example for
 how to manipulate kernel code using the gcc plugin internals.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk
 1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC
 Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm
 iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk
 B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw
 MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW
 SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4
 8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ
 e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh
 afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf
 cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R
 pa/A7CNQwibIV6YD8+/p
 =1dUK
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull gcc plugins update from Kees Cook:
 "This adds a new gcc plugin named "latent_entropy". It is designed to
  extract as much possible uncertainty from a running system at boot
  time as possible, hoping to capitalize on any possible variation in
  CPU operation (due to runtime data differences, hardware differences,
  SMP ordering, thermal timing variation, cache behavior, etc).

  At the very least, this plugin is a much more comprehensive example
  for how to manipulate kernel code using the gcc plugin internals"

* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  latent_entropy: Mark functions with __latent_entropy
  gcc-plugins: Add latent_entropy plugin
2016-10-15 10:03:15 -07:00
Emese Revfy
0766f788eb latent_entropy: Mark functions with __latent_entropy
The __latent_entropy gcc attribute can be used only on functions and
variables.  If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents.  The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-10-10 14:51:45 -07:00
Emese Revfy
38addce8b6 gcc-plugins: Add latent_entropy plugin
This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot time as
possible, hoping to capitalize on any possible variation in CPU operation
(due to runtime data differences, hardware differences, SMP ordering,
thermal timing variation, cache behavior, etc).

At the very least, this plugin is a much more comprehensive example for
how to manipulate kernel code using the gcc plugin internals.

The need for very-early boot entropy tends to be very architecture or
system design specific, so this plugin is more suited for those sorts
of special cases. The existing kernel RNG already attempts to extract
entropy from reliable runtime variation, but this plugin takes the idea to
a logical extreme by permuting a global variable based on any variation
in code execution (e.g. a different value (and permutation function)
is used to permute the global based on loop count, case statement,
if/then/else branching, etc).

To do this, the plugin starts by inserting a local variable in every
marked function. The plugin then adds logic so that the value of this
variable is modified by randomly chosen operations (add, xor and rol) and
random values (gcc generates separate static values for each location at
compile time and also injects the stack pointer at runtime). The resulting
value depends on the control flow path (e.g., loops and branches taken).

Before the function returns, the plugin mixes this local variable into
the latent_entropy global variable. The value of this global variable
is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
though it does not credit any bytes of entropy to the pool; the contents
of the global are just used to mix the pool.

Additionally, the plugin can pre-initialize arrays with build-time
random contents, so that two different kernel builds running on identical
hardware will not have the same starting values.

Signed-off-by: Emese Revfy <re.emese@gmail.com>
[kees: expanded commit message and code comments]
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-10-10 14:51:44 -07:00
Aaron Lu
6fcb52a56f thp: reduce usage of huge zero page's atomic counter
The global zero page is used to satisfy an anonymous read fault.  If
THP(Transparent HugePage) is enabled then the global huge zero page is
used.  The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.

CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults.  This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.

To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
the global counter in two cases:

 1 The first time it uses the global huge zero page;
 2 The time when mm_user of its mm_struct reaches zero.

Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away.  With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.

And with the use of mm_user, the kthread is not eligible to use huge
zero page either.  Since no kthread is using huge zero page today, there
is no difference after applying this patch.  But if that is not desired,
I can change it to when mm_count reaches zero.

Case used for test on Haswell EP:

  usemem -n 72 --readonly -j 0x200000 100G

Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.

  CPU cycles from perf report for base commit:
      54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
  CPU cycles from perf report for this commit:
       0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page

Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.

Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.

Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Michal Hocko
862e3073b3 mm, oom: get rid of signal_struct::oom_victims
After "oom: keep mm of the killed task available" we can safely detect
an oom victim by checking task->signal->oom_mm so we do not need the
signal_struct counter anymore so let's get rid of it.

This alone wouldn't be sufficient for nommu archs because
exit_oom_victim doesn't hide the process from the oom killer anymore.
We can, however, mark the mm with a MMF flag in __mmput.  We can reuse
MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Michal Hocko
7283094ec3 kernel, oom: fix potential pgd_lock deadlock from __mmdrop
Lockdep complains that __mmdrop is not safe from the softirq context:

  =================================
  [ INFO: inconsistent lock state ]
  4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G        W
  ---------------------------------
  inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
  swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
   (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
  {SOFTIRQ-ON-W} state was registered at:
     __lock_acquire+0xa06/0x196e
     lock_acquire+0x139/0x1e1
     _raw_spin_lock+0x32/0x41
     __change_page_attr_set_clr+0x2a5/0xacd
     change_page_attr_set_clr+0x16f/0x32c
     set_memory_nx+0x37/0x3a
     free_init_pages+0x9e/0xc7
     alternative_instructions+0xa2/0xb3
     check_bugs+0xe/0x2d
     start_kernel+0x3ce/0x3ea
     x86_64_start_reservations+0x2a/0x2c
     x86_64_start_kernel+0x17a/0x18d
  irq event stamp: 105916
  hardirqs last  enabled at (105916): free_hot_cold_page+0x37e/0x390
  hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
  softirqs last  enabled at (105878): _local_bh_enable+0x42/0x44
  softirqs last disabled at (105879): irq_exit+0x6f/0xd1

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(pgd_lock);
    <Interrupt>
      lock(pgd_lock);

   *** DEADLOCK ***

  1 lock held by swapper/1/0:
   #0:  (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800

  stack backtrace:
  CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
  Call Trace:
   <IRQ>
    print_usage_bug.part.25+0x259/0x268
    mark_lock+0x381/0x567
    __lock_acquire+0x993/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    pgd_free+0x19/0x6b
    __mmdrop+0x25/0xb9
    __put_task_struct+0x103/0x11e
    delayed_put_task_struct+0x157/0x15e
    rcu_process_callbacks+0x660/0x800
    __do_softirq+0x1ec/0x4d5
    irq_exit+0x6f/0xd1
    smp_apic_timer_interrupt+0x42/0x4d
    apic_timer_interrupt+0x8e/0xa0
   <EOI>
    arch_cpu_idle+0xf/0x11
    default_idle_call+0x32/0x34
    cpu_startup_entry+0x20c/0x399
    start_secondary+0xfe/0x101

More over commit a79e53d856 ("x86/mm: Fix pgd_lock deadlock") was
explicit about pgd_lock not to be called from the irq context.  This
means that __mmdrop called from free_signal_struct has to be postponed
to a user context.  We already have a similar mechanism for mmput_async
so we can use it here as well.  This is safe because mm_count is pinned
by mm_users.

This fixes bug introduced by "oom: keep mm of the killed task available"

Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Michal Hocko
26db62f179 oom: keep mm of the killed task available
oom_reap_task has to call exit_oom_victim in order to make sure that the
oom vicim will not block the oom killer for ever.  This is, however,
opening new problems (e.g oom_killer_disable exclusion - see commit
7407054209 ("oom, suspend: fix oom_reaper vs.  oom_killer_disable
race")).  exit_oom_victim should be only called from the victim's
context ideally.

One way to achieve this would be to rely on per mm_struct flags.  We
already have MMF_OOM_REAPED to hide a task from the oom killer since
"mm, oom: hide mm which is shared with kthread or global init". The
problem is that the exit path:

  do_exit
    exit_mm
      tsk->mm = NULL;
      mmput
        __mmput
      exit_oom_victim

doesn't guarantee that exit_oom_victim will get called in a bounded
amount of time.  At least exit_aio depends on IO which might get blocked
due to lack of memory and who knows what else is lurking there.

This patch takes a different approach.  We remember tsk->mm into the
signal_struct and bind it to the signal struct life time for all oom
victims.  __oom_reap_task_mm as well as oom_scan_process_thread do not
have to rely on find_lock_task_mm anymore and they will have a reliable
reference to the mm struct.  As a result all the oom specific
communication inside the OOM killer can be done via tsk->signal->oom_mm.

Increasing the signal_struct for something as unlikely as the oom killer
is far from ideal but this approach will make the code much more
reasonable and long term we even might want to move task->mm into the
signal_struct anyway.  In the next step we might want to make the oom
killer exclusion and access to memory reserves completely independent
which would be also nice.

Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Linus Torvalds
14986a34e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This set of changes is a number of smaller things that have been
  overlooked in other development cycles focused on more fundamental
  change. The devpts changes are small things that were a distraction
  until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
  trivial regression fix to autofs for the unprivileged mount changes
  that went in last cycle. A pair of ioctls has been added by Andrey
  Vagin making it is possible to discover the relationships between
  namespaces when referring to them through file descriptors.

  The big user visible change is starting to add simple resource limits
  to catch programs that misbehave. With namespaces in general and user
  namespaces in particular allowing users to use more kinds of
  resources, it has become important to have something to limit errant
  programs. Because the purpose of these limits is to catch errant
  programs the code needs to be inexpensive to use as it always on, and
  the default limits need to be high enough that well behaved programs
  on well behaved systems don't encounter them.

  To this end, after some review I have implemented per user per user
  namespace limits, and use them to limit the number of namespaces. The
  limits being per user mean that one user can not exhause the limits of
  another user. The limits being per user namespace allow contexts where
  the limit is 0 and security conscious folks can remove from their
  threat anlysis the code used to manage namespaces (as they have
  historically done as it root only). At the same time the limits being
  per user namespace allow other parts of the system to use namespaces.

  Namespaces are increasingly being used in application sand boxing
  scenarios so an all or nothing disable for the entire system for the
  security conscious folks makes increasing use of these sandboxes
  impossible.

  There is also added a limit on the maximum number of mounts present in
  a single mount namespace. It is nontrivial to guess what a reasonable
  system wide limit on the number of mount structure in the kernel would
  be, especially as it various based on how a system is using
  containers. A limit on the number of mounts in a mount namespace
  however is much easier to understand and set. In most cases in
  practice only about 1000 mounts are used. Given that some autofs
  scenarious have the potential to be 30,000 to 50,000 mounts I have set
  the default limit for the number of mounts at 100,000 which is well
  above every known set of users but low enough that the mount hash
  tables don't degrade unreaonsably.

  These limits are a start. I expect this estabilishes a pattern that
  other limits for resources that namespaces use will follow. There has
  been interest in making inotify event limits per user per user
  namespace as well as interest expressed in making details about what
  is going on in the kernel more visible"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
  autofs:  Fix automounts by using current_real_cred()->uid
  mnt: Add a per mount namespace limit on the number of mounts
  netns: move {inc,dec}_net_namespaces into #ifdef
  nsfs: Simplify __ns_get_path
  tools/testing: add a test to check nsfs ioctl-s
  nsfs: add ioctl to get a parent namespace
  nsfs: add ioctl to get an owning user namespace for ns file descriptor
  kernel: add a helper to get an owning user namespace for a namespace
  devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
  devpts: Remove sync_filesystems
  devpts: Make devpts_kill_sb safe if fsi is NULL
  devpts: Simplify devpts_mount by using mount_nodev
  devpts: Move the creation of /dev/pts/ptmx into fill_super
  devpts: Move parse_mount_options into fill_super
  userns: When the per user per user namespace limit is reached return ENOSPC
  userns; Document per user per user namespace limits.
  mntns: Add a limit on the number of mount namespaces.
  netns: Add a limit on the number of net namespaces
  cgroupns: Add a limit on the number of cgroup namespaces
  ipcns: Add a  limit on the number of ipc namespaces
  ...
2016-10-06 09:52:23 -07:00
Andy Lutomirski
ac496bf48d fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
vmalloc() is a bit slow, and pounding vmalloc()/vfree() will eventually
force a global TLB flush.

To reduce pressure on them, if CONFIG_VMAP_STACK=y, cache two thread
stacks per CPU.  This will let us quickly allocate a hopefully
cache-hot, TLB-hot stack under heavy forking workloads (shell script style).

On my silly pthread_create() benchmark, it saves about 2 µs per
pthread_create()+join() with CONFIG_VMAP_STACK=y.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/94811d8e3994b2e962f88866290017d498eb069c.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-16 09:18:54 +02:00
Andy Lutomirski
68f24b08ee sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
We currently keep every task's stack around until the task_struct
itself is freed.  This means that we keep the stack allocation alive
for longer than necessary and that, under load, we free stacks in
big batches whenever RCU drops the last task reference.  Neither of
these is good for reuse of cache-hot memory, and freeing in batches
prevents us from usefully caching small numbers of vmalloced stacks.

On architectures that have thread_info on the stack, we can't easily
change this, but on architectures that set THREAD_INFO_IN_TASK, we
can free it as soon as the task is dead.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-16 09:18:54 +02:00
Ingo Molnar
d4b80afbba Merge branch 'linus' into x86/asm, to pick up recent fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-15 08:24:53 +02:00
Linus Torvalds
b9677faf45 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "14 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  rapidio/tsi721: fix incorrect detection of address translation condition
  rapidio/documentation/mport_cdev: add missing parameter description
  kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd
  MAINTAINERS: Vladimir has moved
  mm, mempolicy: task->mempolicy must be NULL before dropping final reference
  printk/nmi: avoid direct printk()-s from __printk_nmi_flush()
  treewide: remove references to the now unnecessary DEFINE_PCI_DEVICE_TABLE
  drivers/scsi/wd719x.c: remove last declaration using DEFINE_PCI_DEVICE_TABLE
  mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator
  lib/test_hash.c: fix warning in preprocessor symbol evaluation
  lib/test_hash.c: fix warning in two-dimensional array init
  kconfig: tinyconfig: provide whole choice blocks to avoid warnings
  kexec: fix double-free when failing to relocate the purgatory
  mm, oom: prevent premature OOM killer invocation for high order request
2016-09-01 18:23:22 -07:00
Michal Hocko
735f2770a7 kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd
Commit fec1d01152 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
exit") has caused a subtle regression in nscd which uses
CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
shared databases, so that the clients are notified when nscd is
restarted.  Now, when nscd uses a non-persistent database, clients that
have it mapped keep thinking the database is being updated by nscd, when
in fact nscd has created a new (anonymous) one (for non-persistent
databases it uses an unlinked file as backend).

The original proposal for the CLONE_CHILD_CLEARTID change claimed
(https://lkml.org/lkml/2006/10/25/233):

: The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
: on behalf of pthread_create() library calls.  This feature is used to
: request that the kernel clear the thread-id in user space (at an address
: provided in the syscall) when the thread disassociates itself from the
: address space, which is done in mm_release().
:
: Unfortunately, when a multi-threaded process incurs a core dump (such as
: from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
: the other threads, which then proceed to clear their user-space tids
: before synchronizing in exit_mm() with the start of core dumping.  This
: misrepresents the state of process's address space at the time of the
: SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
: problems (misleading him/her to conclude that the threads had gone away
: before the fault).
:
: The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
: core dump has been initiated.

The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
seems to have a larger scope than the original patch asked for.  It
seems that limitting the scope of the check to core dumping should work
for SIGSEGV issue describe above.

[Changelog partly based on Andreas' description]
Fixes: fec1d01152 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit")
Link: http://lkml.kernel.org/r/1471968749-26173-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: William Preston <wpreston@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Andreas Schwab <schwab@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01 17:52:02 -07:00
Linus Torvalds
511a8cdb65 Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
 "Two small patches to fix some bugs with the audit-by-executable
  functionality we introduced back in v4.3 (both patches are marked
  for the stable folks)"

* 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
  audit: fix exe_file access in audit_exe_compare
  mm: introduce get_task_exe_file
2016-09-01 15:55:56 -07:00
Mateusz Guzik
cd81a9170e mm: introduce get_task_exe_file
For more convenient access if one has a pointer to the task.

As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.

Use the helper in proc_exe_link.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Cc: <stable@vger.kernel.org> # 4.3.x
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-08-31 16:11:20 -04:00
Andy Lutomirski
ba14a194a4 fork: Add generic vmalloced stack support
If CONFIG_VMAP_STACK=y is selected, kernel stacks are allocated with
__vmalloc_node_range().

Grsecurity has had a similar feature (called GRKERNSEC_KSTACKOVERFLOW=y)
for a long time.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/14c07d4fd173a5b117f51e8b939f9f4323e39899.1470907718.git.luto@kernel.org
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:11:41 +02:00
Balbir Singh
568ac88821 cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork
cgroup_threadgroup_rwsem is acquired in read mode during process exit
and fork.  It is also grabbed in write mode during
__cgroups_proc_write().  I've recently run into a scenario with lots
of memory pressure and OOM and I am beginning to see

systemd

 __switch_to+0x1f8/0x350
 __schedule+0x30c/0x990
 schedule+0x48/0xc0
 percpu_down_write+0x114/0x170
 __cgroup_procs_write.isra.12+0xb8/0x3c0
 cgroup_file_write+0x74/0x1a0
 kernfs_fop_write+0x188/0x200
 __vfs_write+0x6c/0xe0
 vfs_write+0xc0/0x230
 SyS_write+0x6c/0x110
 system_call+0x38/0xb4

This thread is waiting on the reader of cgroup_threadgroup_rwsem to
exit.  The reader itself is under memory pressure and has gone into
reclaim after fork. There are times the reader also ends up waiting on
oom_lock as well.

 __switch_to+0x1f8/0x350
 __schedule+0x30c/0x990
 schedule+0x48/0xc0
 jbd2_log_wait_commit+0xd4/0x180
 ext4_evict_inode+0x88/0x5c0
 evict+0xf8/0x2a0
 dispose_list+0x50/0x80
 prune_icache_sb+0x6c/0x90
 super_cache_scan+0x190/0x210
 shrink_slab.part.15+0x22c/0x4c0
 shrink_zone+0x288/0x3c0
 do_try_to_free_pages+0x1dc/0x590
 try_to_free_pages+0xdc/0x260
 __alloc_pages_nodemask+0x72c/0xc90
 alloc_pages_current+0xb4/0x1a0
 page_table_alloc+0xc0/0x170
 __pte_alloc+0x58/0x1f0
 copy_page_range+0x4ec/0x950
 copy_process.isra.5+0x15a0/0x1870
 _do_fork+0xa8/0x4b0
 ppc_clone+0x8/0xc

In the meanwhile, all processes exiting/forking are blocked almost
stalling the system.

This patch moves the threadgroup_change_begin from before
cgroup_fork() to just before cgroup_canfork().  There is no nee to
worry about threadgroup changes till the task is actually added to the
threadgroup.  This avoids having to call reclaim with
cgroup_threadgroup_rwsem held.

tj: Subject and description edits.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-08-17 09:54:52 -04:00