There is a race between reading task->exit_state in pidfd_poll and
writing it after do_notify_parent calls do_notify_pidfd. Expected
sequence of events is:
CPU 0 CPU 1
------------------------------------------------
exit_notify
do_notify_parent
do_notify_pidfd
tsk->exit_state = EXIT_DEAD
pidfd_poll
if (tsk->exit_state)
However nothing prevents the following sequence:
CPU 0 CPU 1
------------------------------------------------
exit_notify
do_notify_parent
do_notify_pidfd
pidfd_poll
if (tsk->exit_state)
tsk->exit_state = EXIT_DEAD
This causes a polling task to wait forever, since poll blocks because
exit_state is 0 and the waiting task is not notified again. A stress
test continuously doing pidfd poll and process exits uncovered this bug.
To fix it, we make sure that the task's exit_state is always set before
calling do_notify_pidfd.
Fixes: b53b0b9d9a6 ("pidfd: add polling support")
Cc: kernel-team@android.com
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org
[christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie]
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit b191d6491be67cef2b3fa83015561caca1394ab9)
Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: Ia9419ceac08497523c4d830160df49f582075070
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlxbDFsACgkQONu9yGCS
aT4TmQ/+PCgz7TKTu1/Sh7ScmbSjlYFfaQ25K9NzqWDuaTOXCsvrWY4IDdbk7fxZ
WnM8WPpTUfKaRwUVL3T32uwv/uMM8TnXN2r4dB9TH+WF/UK1ovLHQGO5zGWS9erY
dvckRjkRuPe27KYd/sPSA7PRBiWuw1EFqcBIdAuC8OJK1knm3W95C8XVeIkVfey2
/NFIvZELYEOCMzXwPqXbeFG553VtHVGB2GRJqEqZzqZ+JGnUCyBpCgvuwbrQ706l
/Smvz23mmElUGzdTXr4uRPmf2g8bA3a+08/gQM4YPpZYRUMbcCVE3+KaYf3jjJPR
Inj+cKBhF8R6JwkJayzdBJ1KbG6e6DaL2yANqNvoeYCkyP+4GNmoWKebFYklrXr/
jn/btnJtHwSBDVufdgW6uQ039UE66Wfvvi7QrkCgTjebEyf6jU3SnB8Z7HpZZMCg
dwgO6rHKklm6cSJTBZsIDyBd4oj19bTTvOiUcaayQXfeyjlrbWXhH6UOEBze54uT
Ho8yjCq9r2TiO6g9mJOfkkvep1hgNVgabnOyA/papUx3my9sGkUn1Rqg1Pbm9OEF
PV4p853t4+Qe5iBq+yXgacrCg+21q9Xg3ct6QW/c/Quz4UirR9Ayw93C4ej4rJYZ
79/mADmkpH+A1ufP2F80UoWo0IN9oizecfYHOZ+6useUEVgIRik=
=QoXP
-----END PGP SIGNATURE-----
Merge 4.9.155 into android-4.9
Changes in 4.9.155
Fix "net: ipv4: do not handle duplicate fragments as overlapping"
fs: add the fsnotify call to vfs_iter_write
ipv6: Consider sk_bound_dev_if when binding a socket to an address
l2tp: copy 4 more bytes to linear part if necessary
net/mlx4_core: Add masking for a few queries on HCA caps
netrom: switch to sock timer API
net/rose: fix NULL ax25_cb kernel panic
ucc_geth: Reset BQL queue when stopping device
net/mlx5e: Allow MAC invalidation while spoofchk is ON
l2tp: remove l2specific_len dependency in l2tp_core
l2tp: fix reading optional fields of L2TPv3
ipvlan, l3mdev: fix broken l3s mode wrt local routes
CIFS: Do not count -ENODATA as failure for query directory
fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb()
ARM: cns3xxx: Fix writing to wrong PCI config registers after alignment
arm64: kaslr: ensure randomized quantities are clean also when kaslr is off
arm64: hyp-stub: Forbid kprobing of the hyp-stub
arm64: hibernate: Clean the __hyp_text to PoC after resume
gfs2: Revert "Fix loop in gfs2_rbm_find"
platform/x86: asus-nb-wmi: Map 0x35 to KEY_SCREENLOCK
platform/x86: asus-nb-wmi: Drop mapping of 0x33 and 0x34 scan codes
mmc: sdhci-iproc: handle mmc_of_parse() errors during probe
kernel/exit.c: release ptraced tasks before zap_pid_ns_processes
mm, oom: fix use-after-free in oom_kill_process
mm: hwpoison: use do_send_sig_info() instead of force_sig()
mm: migrate: don't rely on __PageMovable() of newpage after unlocking it
cifs: Always resolve hostname before reconnecting
drivers: core: Remove glue dirs from sysfs earlier
fs: don't scan the inode cache before SB_BORN is set
fanotify: fix handling of events on child sub-directory
Linux 4.9.155
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 8fb335e078378c8426fabeed1ebee1fbf915690c upstream.
Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
zap_pid_ns_processes() waits on all tasks in a current pidns, and only
then are tasks from the dead list released.
zap_pid_ns_processes() can get stuck on waiting tasks from the dead
list. In this case, we will have one unkillable process with one or
more dead children.
Thanks to Oleg for the advice to release tasks in find_child_reaper().
Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
Fixes: 7c8bd2322c ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Every time _cpu_up() is called for a CPU, idle_thread_get() is called
which then re-initializes a CPU's idle thread that was already
previously created and cached in a global variable in
smpboot.c. idle_thread_get() calls init_idle() which then calls
__sched_fork(). __sched_fork() is where cpufreq_task_times_init() is,
and cpufreq_task_times_init() allocates memory for the task struct's
time_in_state array.
Since idle_thread_get() reuses a task struct instance that was already
previously created, this means that every time it calls init_idle(),
cpufreq_task_times_init() allocates this array again and overwrites
the existing allocation that the idle thread already had.
This causes memory to be leaked every time a CPU is onlined. In order
to fix this, move allocation of time_in_state into _do_fork to avoid
allocating it at all for idle threads. The cpufreq times interface is
intended to be used for tracking userspace tasks, so we can safely
remove it from the kernel's idle threads without killing any
functionality.
But that's not all!
Task structs can be freed outside of release_task(), which creates
another memory leak because a task struct can be freed without having
its cpufreq times allocation freed. To fix this, free the cpufreq
times allocation at the same time that task struct allocations are
freed, in free_task().
Since free_task() can also be called in error paths of copy_process()
after dup_task_struct(), set time_in_state to NULL immediately after
calling dup_task_struct() to avoid possible double free.
Bug description and fix adapted from patch submitted by
Sultan Alsawaf <sultanxda@gmail.com> at
https://android-review.googlesource.com/c/kernel/msm/+/700134
Bug: 110044919
Test: Hikey960 builds, boots & reports /proc/<pid>/time_in_state
correctly
Change-Id: I12fe7611fc88eb7f6c39f8f7629ad27b6ec4722c
Signed-off-by: Connor O'Brien <connoro@google.com>
Add empty versions of functions to cpufreq_times.h to cut down on use
of #ifdef in .c files.
Test: kernel builds with and without CONFIG_CPU_FREQ_TIMES=y
Change-Id: I49ac364fac3d42bba0ca1801e23b15081094fb12
Signed-off-by: Connor O'Brien <connoro@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlr/39UACgkQONu9yGCS
aT4YOhAAtLZwHeApo0kvcF8PzgMVuQoGmTlKxRoQZNdwP/EwI5mKGR+zfBdBd/iq
5KQ3ECs+q10Ch61dVUh28D0Rj18BHDXdIvFspLNzh32SfwBV43VfnfAB5zG6liVE
83z16XA+nJtR/yu7n4lClouh4CHIGr95H3Z7RAAPcw9oxyPGpoKxlOZ5KLz4sez6
7VifH3/t48RlDZ2VUVBmZup2QXvJl2eqYEIs+T9D22Fo9+28roFabhrl5o5JGz/q
qu6mFwIV+bpvtuVuX8SS/KLrL/zDVX9HokhOXa5IPZ2zJk1MnL82UoHzTPD5jCPr
lPQcZ95wiPShhrPJcavI+9G8I+cK6cltu+pKW+IfocLMwy5w0D8UH2qlI02rxj1h
xJFrS6NK2eB22VvIkxGbtYaT3Sz394sNEQhnB9/SwJj4ajsRNpLSveegWOn3JYzc
7NqyDT4zpQ/oj2ysurA6ks1hWXMKHFrsFJOak0v4dpnE3u5DW69xhRJsQneAUND5
7fmn55f4Sbut0UiMJZ0sRgGWPbleTR+ied48e6cEo2rwjlb7qSmXEYiyQXVBfPYs
eAU6IjUulVk6i7Z209lIUcwG6dGF3/umzuBvjQ3qRBTezoKDYmPt/Ci0dWmIXyVU
hmR/IorgK8jOyFIfKM2pNBELD2/FaBjhFNJAQyKu7DNXcgq07To=
=Ftrq
-----END PGP SIGNATURE-----
Merge 4.9.101 into android-4.9
Changes in 4.9.101
8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
bridge: check iface upper dev when setting master via ioctl
dccp: fix tasklet usage
ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg
llc: better deal with too small mtu
net: ethernet: sun: niu set correct packet size in skb
net: ethernet: ti: cpsw: fix packet leaking in dual_mac mode
net/mlx4_en: Verify coalescing parameters are in range
net/mlx5: E-Switch, Include VF RDMA stats in vport statistics
net_sched: fq: take care of throttled flows before reuse
net: support compat 64-bit time in {s,g}etsockopt
openvswitch: Don't swap table in nlattr_set() after OVS_ATTR_NESTED is found
qmi_wwan: do not steal interfaces from class drivers
r8169: fix powering up RTL8168h
sctp: handle two v4 addrs comparison in sctp_inet6_cmp_addr
sctp: remove sctp_chunk_put from fail_mark err path in sctp_ulpevent_make_rcvmsg
sctp: use the old asoc when making the cookie-ack chunk in dupcook_d
tcp_bbr: fix to zero idle_restart only upon S/ACKed data
tg3: Fix vunmap() BUG_ON() triggered from tg3_free_consistent().
bonding: do not allow rlb updates to invalid mac
net/mlx5: Avoid cleaning flow steering table twice during error flow
bonding: send learning packets for vlans on slave
tcp: ignore Fast Open on repair mode
sctp: fix the issue that the cookie-ack with auth can't get processed
sctp: delay the authentication for the duplicated cookie-echo chunk
serial: sccnxp: Fix error handling in sccnxp_probe()
futex: Remove duplicated code and fix undefined behaviour
xfrm: fix xfrm_do_migrate() with AEAD e.g(AES-GCM)
lockd: lost rollback of set_grace_period() in lockd_down_net()
Revert "ARM: dts: imx6qdl-wandboard: Fix audio channel swap"
l2tp: revert "l2tp: fix missing print session offset info"
nfp: TX time stamp packets before HW doorbell is rung
proc: do not access cmdline nor environ from file-backed areas
futex: futex_wake_op, fix sign_extend32 sign bits
kernel/exit.c: avoid undefined behaviour when calling wait4()
Linux 4.9.101
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Add time in state data to task structs, and create
/proc/<pid>/time_in_state files to show how long each individual task
has run at each frequency.
Create a CONFIG_CPU_FREQ_TIMES option to enable/disable this tracking.
Signed-off-by: Connor O'Brien <connoro@google.com>
Bug: 72339335
Bug: 70951257
Test: Read /proc/<pid>/time_in_state
Change-Id: Ia6456754f4cb1e83b2bc35efa8fbe9f8696febc8
Contains:
sched/tune: fix accounting for runnable tasks (1/5)
The accounting for tasks into boost groups of different CPUs is currently
broken mainly because:
a) we do not properly track the change of boost group of a RUNNABLE task
b) there are race conditions between migration code and accounting code
This patch provides a fixes to ensure enqueue/dequeue
accounting also for throttled tasks.
Without this patch is can happen that a task is enqueued into a throttled
RQ thus not being accounted for the boosting of the corresponding RQ.
We could argue that a throttled task should not boost a CPU, however:
a) properly implementing CPU boosting considering throttled tasks will
increase a lot the complexity of the solution
b) it's not easy to quantify the benefits introduced by such a more
complex solution
Since task throttling requires the usage of the CFS bandwidth controller,
which is not widely used on mobile systems (at least not by Android kernels
so far), for the time being we go for the simple solution and boost also
for throttled RQs.
sched/tune: fix accounting for runnable tasks (2/5)
This patch provides the code required to enforce proper locking.
A per boost group spinlock has been added to grant atomic
accounting of tasks as well as to serialise enqueue/dequeue operations,
triggered by tasks migrations, with cgroups's attach/detach operations.
sched/tune: fix accounting for runnable tasks (3/5)
This patch adds cgroups {allow,can,cancel}_attach callbacks.
Since a task can be migrated between boost groups while it's running,
the CGroups's attach callbacks have been added to properly migrate
boost contributions of RUNNABLE tasks.
The RQ's lock is used to serialise enqueue/dequeue operations, triggered
by tasks migrations, with cgroups's attach/detach operations. While the
SchedTune's CPU lock is used to grant atrocity of the accounting within
the CPU.
NOTE: the current implementation does not allows a concurrent CPU migration
and CGroups change.
sched/tune: fix accounting for runnable tasks (4/5)
This fixes accounting for exiting tasks by adding a dedicated call early
in the do_exit() syscall, which disables SchedTune accounting as soon as a
task is flagged PF_EXITING.
This flag is set before the multiple dequeue/enqueue dance triggered
by cgroup_exit() which is useful only to inject useless tasks movements
thus increasing possibilities for race conditions with the migration code.
The schedtune_exit_task() call does the last dequeue of a task from its
current boost group. This is a solution more aligned with what happens in
mainline kernels (>v4.4) where the exit_cgroup does not move anymore a dying
task to the root control group.
sched/tune: fix accounting for runnable tasks (5/5)
To avoid accounting issues at startup, this patch disable the SchedTune
accounting until the required data structures have been properly
initialized.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andres Oportus <andresoportus@google.com>
Exactly because for_each_thread() in autogroup_move_group() can't see it
and update its ->sched_task_group before _put() and possibly free().
So the exiting task needs another sched_move_task() before exit_notify()
and we need to re-introduce the PF_EXITING (or similar) check removed by
the previous change for another reason.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Cc: vlovejoy@redhat.com
Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are no users of exit_oom_victim on !current task anymore so enforce
the API to always work on the current.
Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg noted that by making do_exit() use __schedule() for the TASK_DEAD
context switch, we can avoid the TASK_DEAD special case currently in
__schedule() because that avoids the extra preempt_disable() from
schedule().
In order to facilitate this, create a do_task_dead() helper which we
place in the scheduler code, such that it can access __schedule().
Also add some __noreturn annotations to the functions, there's no
coming back from do_exit().
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Cheng Chao <cs.os.kernel@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: chris@chris-wilson.co.uk
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/20160913163729.GB5012@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
KASAN allocates memory from the page allocator as part of
kmem_cache_free(), and that can reference current->mempolicy through any
number of allocation functions. It needs to be NULL'd out before the
final reference is dropped to prevent a use-after-free bug:
BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
...
Call Trace:
dump_stack
kasan_object_err
kasan_report_error
__asan_report_load2_noabort
alloc_pages_current <-- use after free
depot_save_stack
save_stack
kasan_slab_free
kmem_cache_free
__mpol_put <-- free
do_exit
This patch sets current->mempolicy to NULL before dropping the final
reference.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
Fixes: cd11016e5f ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org> [4.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many targets enable CONFIG_DEBUG_STACK_USAGE, and while the information
is useful, it isn't worthy of pr_warn(). Reduce it to pr_info().
Link: http://lkml.kernel.org/r/1466982072-29836-1-git-send-email-anton@ozlabs.org
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler updates from Ingo Molnar:
- introduce and use task_rcu_dereference()/try_get_task_struct() to fix
and generalize task_struct handling (Oleg Nesterov)
- do various per entity load tracking (PELT) fixes and optimizations
(Peter Zijlstra)
- cputime virt-steal time accounting enhancements/fixes (Wanpeng Li)
- introduce consolidated cputime output file cpuacct.usage_all and
related refactorings (Zhao Lei)
- ... plus misc fixes and enhancements
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set
sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together
sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show()
sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums
sched/fair: Rework throttle_count sync
sched/core: Fix sched_getaffinity() return value kerneldoc comment
sched/fair: Reorder cgroup creation code
sched/fair: Apply more PELT fixes
sched/fair: Fix PELT integrity for new tasks
sched/cgroup: Fix cpu_cgroup_fork() handling
sched/fair: Fix PELT integrity for new groups
sched/fair: Fix and optimize the fork() path
sched/cputime: Add steal time support to full dynticks CPU time accounting
sched/cputime: Fix prev steal time accouting during CPU hotplug
KVM: Fix steal clock warp during guest CPU hotplug
sched/debug: Always show 'nr_migrations'
sched/fair: Use task_rcu_dereference()
sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
sched/idle: Optimize the generic idle loop
sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()
With the modified semantics of spin_unlock_wait() a number of
explicit barriers can be removed. Also update the comment for the
do_exit() usecase, as that was somewhat stale/obscure.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Generally task_struct is only protected by RCU if it was found on a
RCU protected list (say, for_each_process() or find_task_by_vpid()).
As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
drops the (potentially) last reference without RCU gp, this means
that we need to fix the code which uses foreign_rq->curr under
rcu_read_lock().
Add a new helper which can be used to dereference rq->curr or any
other pointer to task_struct assuming that it should be cleared or
updated before the final put_task_struct(). It returns non-NULL
only if this task can't go away before rcu_read_unlock().
( Also add try_get_task_struct() to make it easier to use this API
correctly. )
Suggested-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ Updated comments; added try_get_task_struct()]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Link: http://lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I see no reason why waitid() can't support other linux-specific flags
allowed in sys_wait4().
In particular this change can help if we reconsider the previous change
("wait/ptrace: assume __WALL if the child is traced") which adds the
"automagical" __WALL for debugger.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: <syzkaller@googlegroups.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following program (simplified version of generated by syzkaller)
#include <pthread.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <stdio.h>
#include <signal.h>
void *thread_func(void *arg)
{
ptrace(PTRACE_TRACEME, 0,0,0);
return 0;
}
int main(void)
{
pthread_t thread;
if (fork())
return 0;
while (getppid() != 1)
;
pthread_create(&thread, NULL, thread_func, NULL);
pthread_join(thread, NULL);
return 0;
}
creates an unreapable zombie if /sbin/init doesn't use __WALL.
This is not a kernel bug, at least in a sense that everything works as
expected: debugger should reap a traced sub-thread before it can reap the
leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
Unfortunately, it seems that /sbin/init in most (all?) distributions
doesn't use it and we have to change the kernel to avoid the problem.
Note also that most init's use sys_waitid() which doesn't allow __WALL, so
the necessary user-space fix is not that trivial.
This patch just adds the "ptrace" check into eligible_child(). To some
degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is
mostly ignored when the tracee reports to debugger. Or WSTOPPED, the
tracer doesn't need to set this flag to wait for the stopped tracee.
This obviously means the user-visible change: __WCLONE and __WALL no
longer have any meaning for debugger. And I can only hope that this won't
break something, but at least strace/gdb won't suffer.
We could make a more conservative change. Say, we can take __WCLONE into
account, or !thread_group_leader(). But it would be nice to not
complicate these historical/confusing checks.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to call exit_thread from copy_process in a fail path. So make it
accept task_struct as a parameter.
[v2]
* s390: exit_thread_runtime_instr doesn't make sense to be called for
non-current tasks.
* arm: fix the comment in vfp_thread_copy
* change 'me' to 'tsk' for task_struct
* now we can change only archs that actually have exit_thread
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.
The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.
This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem for
write should be blocked killable which will help to reduce such a lock
contention. This is not done by this patch.
Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing). Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system. A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/). However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.
kcov does not aim to collect as much coverage as possible. It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g. scheduler, locking).
Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes. Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch). I've
dropped the second mode for simplicity.
This patch adds the necessary support on kernel side. The complimentary
compiler support was added in gcc revision 231296.
We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:
https://github.com/google/syzkaller/wiki/Found-Bugs
We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation". For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.
Why not gcov. Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
typical coverage can be just a dozen of basic blocks (e.g. an invalid
input). In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M). Cost of
kcov depends only on number of executed basic blocks/edges. On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.
kcov exposes kernel PCs and control flow to user-space which is
insecure. But debugfs should not be mapped as user accessible.
Based on a patch by Quentin Casasnovas.
[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_stopped_code()->task_is_stopped_or_traced() doesn't look right, the
traced task must never be TASK_STOPPED.
We can not add WARN_ON(task_is_stopped(p)), but this is only because
do_wait() can race with PTRACE_ATTACH from another thread.
[akpm@linux-foundation.org: teeny cleanup]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Roland McGrath <roland@hack.frob.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Pedro Alves <palves@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle were:
- sched/fair load tracking fixes and cleanups (Byungchul Park)
- Make load tracking frequency scale invariant (Dietmar Eggemann)
- sched/deadline updates (Juri Lelli)
- stop machine fixes, cleanups and enhancements for bugs triggered by
CPU hotplug stress testing (Oleg Nesterov)
- scheduler preemption code rework: remove PREEMPT_ACTIVE and related
cleanups (Peter Zijlstra)
- Rework the sched_info::run_delay code to fix races (Peter Zijlstra)
- Optimize per entity utilization tracking (Peter Zijlstra)
- ... misc other fixes, cleanups and smaller updates"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
sched: Start stopper early
stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
sched/x86: Fix typo in __switch_to() comments
sched/core: Remove a parameter in the migrate_task_rq() function
sched/core: Drop unlikely behind BUG_ON()
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
sched/numa: Fix task_tick_fair() from disabling numa_balancing
sched/core: Add preempt_count invariant check
sched/core: More notrace annotations
sched/core: Kill PREEMPT_ACTIVE
sched/core, sched/x86: Kill thread_info::saved_preempt_count
sched/core: Simplify preempt_count tests
sched/core: Robustify preemption leak checks
sched/core: Stop setting PREEMPT_ACTIVE
...
Currently, __srcu_read_lock() cannot be invoked from restricted
environments because it contains calls to preempt_disable() and
preempt_enable(), both of which can invoke lockdep, which is a bad
idea in some restricted execution modes. This commit therefore moves
the preempt_disable() and preempt_enable() from __srcu_read_lock()
to srcu_read_lock(). It also inserts the preempt_disable() and
preempt_enable() around the call to __srcu_read_lock() in do_exit().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
When we warn about a preempt_count leak; reset the preempt_count to
the known good value such that the problem does not ripple forward.
This is most important on x86 which has a per cpu preempt_count that is
not saved/restored (after this series). So if you schedule with an
invalid (!2*PREEMPT_DISABLE_OFFSET) preempt_count the next task is
messed up too.
Enforcing this invariant limits the borkage to just the one task.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
s,critiera,criteria,
While at it, add a comma, because it makes sense grammatically.
Signed-off-by: Frans Klaver <fransklaver@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
There is a helpful comment in do_exit() that states we sync the mm's RSS
info before statistics gathering.
The function that does the statistics gathering is called right above that
comment.
Change the code to obey the comment.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking
are related in functionality, but the interface is not symmetrical at
all: one is an internal OOM killer function used during the killing, the
other is for an OOM victim to signal its own death on exit later on.
This has locking implications, see follow-up changes.
While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
is easier on the eye.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All users of exec_domain are gone, now we can get rid
of that abandoned feature.
To not break existing userspace we keep a dummy
/proc/execdomains file which will always contain
"0-0 Linux [kernel]".
Signed-off-by: Richard Weinberger <richard@nod.at>
Commit 5695be142e ("OOM, PM: OOM killed task shouldn't escape PM
suspend") has left a race window when OOM killer manages to
note_oom_kill after freeze_processes checks the counter. The race
window is quite small and really unlikely and partial solution deemed
sufficient at the time of submission.
Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.
oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.
oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the further
OOMs by setting oom_killer_disabled and checks for any oom victims.
Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
last victim wakes up all waiters enqueued by oom_killer_disable().
Therefore this function acts as the full OOM barrier.
The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.
out_of_memory tells the caller whether the OOM was allowed to trigger and
the callers are supposed to handle the situation. The page allocation
path simply fails the allocation same as before. The page fault path will
retry the fault (more on that later) and Sysrq OOM trigger will simply
complain to the log.
Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because
- the victim cannot loop in the page fault handler (it would die
on the way out from the exception)
- it cannot loop in the page allocator because all the further
allocation would fail and __GFP_NOFAIL allocations are not
acceptable at this stage
- it shouldn't be blocked on any locks held by frozen tasks
(try_to_freeze expects lockless context) and kernel threads and
work queues are not frozen yet
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset addresses a race which was described in the changelog for
5695be142e ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
: PM freezer relies on having all tasks frozen by the time devices are
: getting frozen so that no task will touch them while they are getting
: frozen. But OOM killer is allowed to kill an already frozen task in order
: to handle OOM situtation. In order to protect from late wake ups OOM
: killer is disabled after all tasks are frozen. This, however, still keeps
: a window open when a killed task didn't manage to die by the time
: freeze_processes finishes.
The original patch hasn't closed the race window completely because that
would require a more complex solution as it can be seen by this patchset.
The primary motivation was to close the race condition between OOM killer
and PM freezer _completely_. As Tejun pointed out, even though the race
condition is unlikely the harder it would be to debug weird bugs deep in
the PM freezer when the debugging options are reduced considerably. I can
only speculate what might happen when a task is still runnable
unexpectedly.
On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.
I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
echo processors > /sys/power/pm_test
while true
do
echo mem > /sys/power/state
sleep 1s
done
- simple module which allocates and frees 20M in 8K chunks. If it sees
freezing(current) then it tries another round of allocation before calling
try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
I think this should be OK because __thaw_task shouldn't interfere with any
locking down wake_up_process. Oleg?
As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the tasks
are frozen and OOM disabled. I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the race
is really unlikely.
[ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[ 243.636072] (elapsed 2.837 seconds) done.
[ 243.641985] Trying to disable OOM killer
[ 243.643032] Waiting for concurent OOM victims
[ 243.644342] OOM killer disabled
[ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[ 243.652983] Suspending console(s) (use no_console_suspend to debug)
[ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[ 243.992600] PM: suspend of devices complete after 336.667 msecs
[ 243.993264] PM: late suspend of devices complete after 0.660 msecs
[ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[ 243.994717] ACPI: Preparing to enter system sleep state S3
[ 243.994795] PM: Saving platform NVS memory
[ 243.994796] Disabling non-boot CPUs ...
The first 2 patches are simple cleanups for OOM. They should go in
regardless the rest IMO.
Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.
The main patch is the last one and I would appreciate acks from Tejun and
Rafael. I think the OOM part should be OK (except for __thaw_task vs.
task_lock where a look from Oleg would appreciated) but I am not so sure I
haven't screwed anything in the freezer code. I have found several
surprises there.
This patch (of 5):
This patch is just a preparatory and it doesn't introduce any functional
change.
Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wait_consider_task() checks EXIT_ZOMBIE after EXIT_DEAD/EXIT_TRACE and
both checks can fail if we race with EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE
change in between, gcc needs to reload p->exit_state after
security_task_wait(). In this case ->notask_error will be wrongly
cleared and do_wait() can hang forever if it was the last eligible
child.
Many thanks to Arne who carefully investigated the problem.
Note: this bug is very old but it was pure theoretical until commit
b3ab03160d ("wait: completely ignore the EXIT_DEAD tasks"). Before
this commit "-O2" was probably enough to guarantee that compiler won't
read ->exit_state twice.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Arne Goedeke <el@laramies.com>
Tested-by: Arne Goedeke <el@laramies.com>
Cc: <stable@vger.kernel.org> [3.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here's the big tty/serial driver update for 3.19-rc1.
There are a number of TTY core changes/fixes in here from Peter Hurley
that have all been teted in linux-next for a long time now. There are
also the normal serial driver updates as well, full details in the
changelog below.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlSOD/MACgkQMUfUDdst+ymW+wCfbSzoYMRObIImMPWfoQtxkvvN
rpkAnAtyEP/zZIfkQIuKTSH6FJxocF8V
=WZt3
-----END PGP SIGNATURE-----
Merge tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here's the big tty/serial driver update for 3.19-rc1.
There are a number of TTY core changes/fixes in here from Peter Hurley
that have all been teted in linux-next for a long time now. There are
also the normal serial driver updates as well, full details in the
changelog below"
* tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (219 commits)
serial: pxa: hold port.lock when reporting modem line changes
tty-hvsi_lib: Deletion of an unnecessary check before the function call "tty_kref_put"
tty: Deletion of unnecessary checks before two function calls
n_tty: Fix read_buf race condition, increment read_head after pushing data
serial: of-serial: add PM suspend/resume support
Revert "serial: of-serial: add PM suspend/resume support"
Revert "serial: of-serial: fix up PM ops on no_console_suspend and port type"
serial: 8250: don't attempt a trylock if in sysrq
serial: core: Add big-endian iotype
serial: samsung: use port->fifosize instead of hardcoded values
serial: samsung: prefer to use fifosize from driver data
serial: samsung: fix style problems
serial: samsung: wait for transfer completion before clock disable
serial: icom: fix error return code
serial: tegra: clean up tty-flag assignments
serial: Fix io address assign flow with Fintek PCI-to-UART Product
serial: mxs-auart: fix tx_empty against shift register
serial: mxs-auart: fix gpio change detection on interrupt
serial: mxs-auart: Fix mxs_auart_set_ldisc()
serial: 8250_dw: Use 64-bit access for OCTEON.
...
After the previous change we can add just the exiting EXIT_DEAD task to
the "dead" list and remove another release_task(tsk).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shift "release dead children" loop from forget_original_parent() to its
caller, exit_notify(). It is safe to reap them even if our parent reaps
us right after we drop tasklist_lock, those children no longer have any
connection to the exiting task.
And this allows us to avoid write_lock_irq(tasklist_lock) right after it
was released by forget_original_parent(), we can simply call it with
tasklist_lock held.
While at it, move the comment about forget_original_parent() up to
this function.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that pid_ns logic was isolated we can change forget_original_parent()
to return right after find_child_reaper() when father->children is empty,
there is nothing to reparent in this case.
In particular this avoids find_alive_thread() and this can help if the
whole process exits and it has a lot of PF_EXITING threads at the start of
the thread list, this can easily lead to O(nr_threads ** 2) iterations.
Trivial test case (tested under KVM, 2 CPUs):
static void *tfunc(void *arg)
{
pause();
return NULL;
}
static int child(unsigned int nt)
{
pthread_t pt;
while (nt--)
assert(pthread_create(&pt, NULL, tfunc, NULL) == 0);
pthread_kill(pt, SIGTRAP);
pause();
return 0;
}
int main(int argc, const char *argv[])
{
int stat;
unsigned int nf = atoi(argv[1]);
unsigned int nt = atoi(argv[2]);
while (nf--) {
if (!fork())
return child(nt);
wait(&stat);
assert(stat == SIGTRAP);
}
return 0;
}
$ time ./test 16 16536 shows:
real user sys
- 5m37.628s 0m4.437s 8m5.560s
+ 0m50.032s 0m7.130s 1m4.927s
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the new simple helper to factor out the for_each_thread() code in
find_child_reaper() and find_new_reaper(). It can also simplify the
potential PF_EXITING -> exit_state change, plus perhaps we can change this
code to take SIGNAL_GROUP_EXIT into account.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_new_reaper() does 2 completely different things. Not only it finds a
reaper, it also updates pid_ns->child_reaper or kills the whole namespace
if the caller is ->child_reaper.
Now that has_child_subreaper logic doesn't depend on child_reaper check we
can move that pid_ns code into a separate helper. IMHO this makes the
code more clean, and this allows the next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Swap the "init_task" and same_thread_group() checks. This way it is more
simple to document these checks and we can remove the link to the previous
discussion on lkml.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change find_new_reaper() to use for_each_thread() instead of deprecated
while_each_thread(). We do not bother to check "thread != father" in the
1st loop, we can rely on PF_EXITING check.
Note: this means the minor behavioural change: for_each_thread() starts
from the group leader. But this should be fine, nobody should make any
assumption about do_wait(__WNOTHREAD) when it comes to reparented tasks.
And this can avoid the pointless reparenting to a short-living thread
While zombie leaders are not that common.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_new_reaper() assumes that "has_child_subreaper" logic is safe as
long as we are not the exiting ->child_reaper and this is doubly wrong:
1. In fact it is safe if "pid_ns->child_reaper == father"; there must
be no children after zap_pid_ns_processes() returns, so it doesn't
matter what we return in this case and even pid_ns->child_reaper is
wrong otherwise: we can't reparent to ->child_reaper == current.
This is not a bug, but this is confusing.
2. It is not safe if we are not pid_ns->child_reaper but from the same
thread group. We drop tasklist_lock before zap_pid_ns_processes(),
so another thread can lock it and choose the new reaper from the
upper namespace if has_child_subreaper == T, and this is obviously
wrong.
This is not that bad, zap_pid_ns_processes() won't return until the
the new reaper reaps all zombies, but this should be fixed anyway.
We could change for_each_thread() loop to use ->exit_state instead of
PF_EXITING which we had to use until 8aac62706a, or we could change
copy_signal() to check CLONE_NEWPID before setting has_child_subreaper,
but lets change this code so that it is clear we can't look outside of
our namespace, otherwise same_thread_group(reaper, child_reaper) check
will look wrong and confusing anyway.
We can simply start from "father" and fix the problem. We can't wrongly
return a thread from the same thread group if ->is_child_subreaper == T,
we know that all threads have PF_EXITING set.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ->has_child_subreaper code in find_new_reaper() finds alive "thread"
but returns another "reaper" thread which can be dead.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Contrary to what the comment in __exit_signal() says we do account the
group leader. Fix this and explain why.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wait_task_zombie() no longer needs tasklist_lock to accumulate the
psig->c* counters, we can drop it right after cmpxchg(exit_state).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>