Patch series "kvmalloc", v5.
There are many open coded kmalloc with vmalloc fallback instances in the
tree. Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.
As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.
Most callers, I could find, have been converted to use the helper
instead. This is patch 6. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.
[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com
This patch (of 9):
Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code. Yet we do not have any common helper
for that and so users have invented their own helpers. Some of them are
really creative when doing so. Let's just add kv[mz]alloc and make sure
it is implemented properly. This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures. This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.
This patch also changes some existing users and removes helpers which
are specific for them. In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL). Those need to be
fixed separately.
While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers. Existing ones would have to die
slowly.
[sfr@canb.auug.org.au: f2fs fixup]
Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Angelo G. Del Regno <kholk11@gmail.com>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: DennySPB <dennyspb@gmail.com>
,
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl5ZJvIACgkQONu9yGCS
aT4YXxAAkGwu5Kr2xdjQVfkG/P/v0DBf1QJ1xxUGlOt0eql0Kl9aiDSOoHDwLBpX
L+nCaXdH5miLR/U3bhV95UW2uR9Ma/Q4hSyWv8ESzMvv3ov8zauCaCjRpIY00+0X
2HjEl50SV1ST5fb8U55JIaYp2kNGXLsrnL2/J7NiYq/gv2+2qAQVup3rnTWk4YMB
Zz3XCoyVw5m1Wn3D5sPe3S6z0TQq5wAqzH2hFgHIkOddIEbD+ZBtWWcWVSuVfBu6
eyJBSNz75SV7eLK6MUVQyfKMIA0rQpidGHPO9R9QspS9wUT/PG4veCGsrvIXOQKG
wD/PbVlvGCI0aj87dbvJjd4N24HEFPM85Kq5KcJdFYwSTgpf9h/R/Y3amZ1EzS1Y
+7IzDWAdw2jqWo7Ys78bBoT2BGAHTkRbj93jTVv8cjow88URHQeakIyFPh2+nQg3
+ky04NGy/eWzNRNwBB6BdpukZZ5WVwcmYq7VhDkHLOvqhH5Q2eom+pUKgLszvZCY
U/Nv2wCQB+gVXEAZ1LNqG8uePBkJXZSwCEAoMTQMAC0riiJSAtS1BZPnr+x2ZdJD
OrN7DsFvTeJLsh/SDg+s9mVvO4YotQ5JcBbmGdbXDjuEIaxpncDIo5MM9y5CE/Ej
hW6n8tNPbIZfkzEDsBMkhHbUoDtKLgdZql7dMmeVWIsRi3UfrVM=
=EsN0
-----END PGP SIGNATURE-----
Merge 4.9.215 into android-4.9-q
Changes in 4.9.215
x86/vdso: Use RDPID in preference to LSL when available
KVM: x86: emulate RDPID
ALSA: hda: Use scnprintf() for printing texts for sysfs/procfs
ecryptfs: fix a memory leak bug in parse_tag_1_packet()
ecryptfs: fix a memory leak bug in ecryptfs_init_messaging()
ALSA: usb-audio: Apply sample rate quirk for Audioengine D1
ext4: don't assume that mmp_nodename/bdevname have NUL
ext4: fix checksum errors with indexed dirs
ext4: improve explanation of a mount failure caused by a misconfigured kernel
Btrfs: fix race between using extent maps and merging them
btrfs: log message when rw remount is attempted with unclean tree-log
perf/x86/amd: Add missing L2 misses event spec to AMD Family 17h's event map
padata: Remove broken queue flushing
s390/time: Fix clk type in get_tod_clock
perf/x86/intel: Fix inaccurate period in context switch for auto-reload
hwmon: (pmbus/ltc2978) Fix PMBus polling of MFR_COMMON definitions.
jbd2: move the clearing of b_modified flag to the journal_unmap_buffer()
jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer
btrfs: print message when tree-log replay starts
scsi: qla2xxx: fix a potential NULL pointer dereference
Revert "KVM: VMX: Add non-canonical check on writes to RTIT address MSRs"
drm/gma500: Fixup fbdev stolen size usage evaluation
cpu/hotplug, stop_machine: Fix stop_machine vs hotplug order
brcmfmac: Fix use after free in brcmf_sdio_readframes()
gianfar: Fix TX timestamping with a stacked DSA driver
pinctrl: sh-pfc: sh7264: Fix CAN function GPIOs
pxa168fb: Fix the function used to release some memory in an error handling path
media: i2c: mt9v032: fix enum mbus codes and frame sizes
powerpc/powernv/iov: Ensure the pdn for VFs always contains a valid PE number
gpio: gpio-grgpio: fix possible sleep-in-atomic-context bugs in grgpio_irq_map/unmap()
media: sti: bdisp: fix a possible sleep-in-atomic-context bug in bdisp_device_run()
pinctrl: baytrail: Do not clear IRQ flags on direct-irq enabled pins
efi/x86: Map the entire EFI vendor string before copying it
MIPS: Loongson: Fix potential NULL dereference in loongson3_platform_init()
sparc: Add .exit.data section.
uio: fix a sleep-in-atomic-context bug in uio_dmem_genirq_irqcontrol()
usb: gadget: udc: fix possible sleep-in-atomic-context bugs in gr_probe()
jbd2: clear JBD2_ABORT flag before journal_reset to update log tail info when load journal
x86/sysfb: Fix check for bad VRAM size
tracing: Fix tracing_stat return values in error handling paths
tracing: Fix very unlikely race of registering two stat tracers
ext4, jbd2: ensure panic when aborting with zero errno
kconfig: fix broken dependency in randconfig-generated .config
clk: qcom: rcg2: Don't crash if our parent can't be found; return an error
drm/amdgpu: remove 4 set but not used variable in amdgpu_atombios_get_connector_info_from_object_table
regulator: rk808: Lower log level on optional GPIOs being not available
net/wan/fsl_ucc_hdlc: reject muram offsets above 64K
PCI/IOV: Fix memory leak in pci_iov_add_virtfn()
NFC: port100: Convert cpu_to_le16(le16_to_cpu(E1) + E2) to use le16_add_cpu().
media: v4l2-device.h: Explicitly compare grp{id,mask} to zero in v4l2_device macros
reiserfs: Fix spurious unlock in reiserfs_fill_super() error handling
ALSA: usx2y: Adjust indentation in snd_usX2Y_hwdep_dsp_status
b43legacy: Fix -Wcast-function-type
ipw2x00: Fix -Wcast-function-type
iwlegacy: Fix -Wcast-function-type
rtlwifi: rtl_pci: Fix -Wcast-function-type
orinoco: avoid assertion in case of NULL pointer
ACPICA: Disassembler: create buffer fields in ACPI_PARSE_LOAD_PASS1
scsi: aic7xxx: Adjust indentation in ahc_find_syncrate
drm/mediatek: handle events when enabling/disabling crtc
ARM: dts: r8a7779: Add device node for ARM global timer
x86/vdso: Provide missing include file
PM / devfreq: rk3399_dmc: Add COMPILE_TEST and HAVE_ARM_SMCCC dependency
pinctrl: sh-pfc: sh7269: Fix CAN function GPIOs
RDMA/rxe: Fix error type of mmap_offset
ALSA: sh: Fix compile warning wrt const
tools lib api fs: Fix gcc9 stringop-truncation compilation error
usbip: Fix unsafe unaligned pointer usage
udf: Fix free space reporting for metadata and virtual partitions
soc/tegra: fuse: Correct straps' address for older Tegra124 device trees
rcu: Use WRITE_ONCE() for assignments to ->pprev for hlist_nulls
Input: edt-ft5x06 - work around first register access error
wan: ixp4xx_hss: fix compile-testing on 64-bit
ASoC: atmel: fix build error with CONFIG_SND_ATMEL_SOC_DMA=m
tty: synclinkmp: Adjust indentation in several functions
tty: synclink_gt: Adjust indentation in several functions
driver core: platform: Prevent resouce overflow from causing infinite loops
driver core: Print device when resources present in really_probe()
vme: bridges: reduce stack usage
drm/nouveau/gr/gk20a,gm200-: add terminators to method lists read from fw
drm/nouveau: Fix copy-paste error in nouveau_fence_wait_uevent_handler
drm/vmwgfx: prevent memory leak in vmw_cmdbuf_res_add
usb: musb: omap2430: Get rid of musb .set_vbus for omap2430 glue
iommu/arm-smmu-v3: Use WRITE_ONCE() when changing validity of an STE
scsi: iscsi: Don't destroy session if there are outstanding connections
arm64: fix alternatives with LLVM's integrated assembler
pwm: omap-dmtimer: Remove PWM chip in .remove before making it unfunctional
cmd64x: potential buffer overflow in cmd64x_program_timings()
ide: serverworks: potential overflow in svwks_set_pio_mode()
remoteproc: Initialize rproc_class before use
x86/decoder: Add TEST opcode to Group3-2
s390/ftrace: generate traced function stack frame
driver core: platform: fix u32 greater or equal to zero comparison
ALSA: hda - Add docking station support for Lenovo Thinkpad T420s
powerpc/sriov: Remove VF eeh_dev state when disabling SR-IOV
jbd2: switch to use jbd2_journal_abort() when failed to submit the commit record
ARM: 8951/1: Fix Kexec compilation issue.
hostap: Adjust indentation in prism2_hostapd_add_sta
iwlegacy: ensure loop counter addr does not wrap and cause an infinite loop
cifs: fix NULL dereference in match_prepath
irqchip/gic-v3: Only provision redistributors that are enabled in ACPI
drm/nouveau/disp/nv50-: prevent oops when no channel method map provided
ftrace: fpid_next() should increase position index
trigger_next should increase position index
radeon: insert 10ms sleep in dce5_crtc_load_lut
ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans()
lib/scatterlist.c: adjust indentation in __sg_alloc_table
reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
bcache: explicity type cast in bset_bkey_last()
irqchip/gic-v3-its: Reference to its_invall_cmd descriptor when building INVALL
iwlwifi: mvm: Fix thermal zone registration
microblaze: Prevent the overflow of the start
brd: check and limit max_part par
help_next should increase position index
selinux: ensure we cleanup the internal AVC counters on error in avc_update()
enic: prevent waking up stopped tx queues over watchdog reset
net/sched: matchall: add missing validation of TCA_MATCHALL_FLAGS
net/sched: flower: add missing validation of TCA_FLOWER_FLAGS
floppy: check FDC index for errors before assigning it
vt: selection, handle pending signals in paste_selection
staging: android: ashmem: Disallow ashmem memory from being remapped
staging: vt6656: fix sign of rx_dbm to bb_pre_ed_rssi.
xhci: Force Maximum Packet size for Full-speed bulk devices to valid range.
usb: uas: fix a plug & unplug racing
USB: Fix novation SourceControl XL after suspend
USB: hub: Don't record a connect-change event during reset-resume
staging: rtl8188eu: Fix potential security hole
staging: rtl8188eu: Fix potential overuse of kernel memory
x86/mce/amd: Publish the bank pointer only after setup has succeeded
x86/mce/amd: Fix kobject lifetime
tty/serial: atmel: manage shutdown in case of RS485 or ISO7816 mode
tty: serial: imx: setup the correct sg entry for tx dma
Revert "ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()"
xhci: apply XHCI_PME_STUCK_QUIRK to Intel Comet Lake platforms
KVM: x86: don't notify userspace IOAPIC on edge-triggered interrupt EOI
VT_RESIZEX: get rid of field-by-field copyin
vt: vt_ioctl: fix race in VT_RESIZEX
lib/stackdepot.c: fix global out-of-bounds in stack_slabs
KVM: nVMX: Don't emulate instructions in guest mode
netfilter: xt_bpf: add overflow checks
ext4: fix a data race in EXT4_I(inode)->i_disksize
ext4: add cond_resched() to __ext4_find_entry()
ext4: fix mount failure with quota configured as module
ext4: rename s_journal_flag_rwsem to s_writepages_rwsem
ext4: fix race between writepages and enabling EXT4_EXTENTS_FL
KVM: nVMX: Refactor IO bitmap checks into helper function
KVM: nVMX: Check IO instruction VM-exit conditions
KVM: apic: avoid calculating pending eoi from an uninitialized val
Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents
scsi: Revert "RDMA/isert: Fix a recently introduced regression related to logout"
scsi: Revert "target: iscsi: Wait for all commands to finish before freeing a session"
usb: gadget: composite: Fix bMaxPower for SuperSpeedPlus
staging: greybus: use after free in gb_audio_manager_remove_all()
ecryptfs: replace BUG_ON with error handling code
ALSA: rawmidi: Avoid bit fields for state flags
ALSA: seq: Avoid concurrent access to queue flags
ALSA: seq: Fix concurrent access to queue current tick/time
netfilter: xt_hashlimit: limit the max size of hashtable
ata: ahci: Add shutdown to freeze hardware resources of ahci
xen: Enable interrupts when calling _cond_resched()
s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range
Linux 4.9.215
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4c663321dde48cd2a324e59acb70c99f75f9344e
commit edf28f4061afe4c2d9eb1c3323d90e882c1d6800 upstream.
This reverts commit a979558448.
Commit a979558448 ("ipc,sem: remove uneeded sem_undo_list lock usage
in exit_sem()") removes a lock that is needed. This leads to a process
looping infinitely in exit_sem() and can also lead to a crash. There is
a reproducer available in [1] and with the commit reverted the issue
does not reproduce anymore.
Using the reproducer found in [1] is fairly easy to reach a point where
one of the child processes is looping infinitely in exit_sem between
for(;;) and if (semid == -1) block, while it's trying to free its last
sem_undo structure which has already been freed by freeary().
Each sem_undo struct is on two lists: one per semaphore set (list_id)
and one per process (list_proc). The list_id list tracks undos by
semaphore set, and the list_proc by process.
Undo structures are removed either by freeary() or by exit_sem(). The
freeary function is invoked when the user invokes a syscall to remove a
semaphore set. During this operation freeary() traverses the list_id
associated with the semaphore set and removes the undo structures from
both the list_id and list_proc lists.
For this case, exit_sem() is called at process exit. Each process
contains a struct sem_undo_list (referred to as "ulp") which contains
the head for the list_proc list. When the process exits, exit_sem()
traverses this list to remove each sem_undo struct. As in freeary(),
whenever a sem_undo struct is removed from list_proc, it is also removed
from the list_id list.
Removing elements from list_id is safe for both exit_sem() and freeary()
due to sem_lock(). Removing elements from list_proc is not safe;
freeary() locks &un->ulp->lock when it performs
list_del_rcu(&un->list_proc) but exit_sem() does not (locking was
removed by commit a979558448 ("ipc,sem: remove uneeded sem_undo_list
lock usage in exit_sem()").
This can result in the following situation while executing the
reproducer [1] : Consider a child process in exit_sem() and the parent
in freeary() (because of semctl(sid[i], NSEM, IPC_RMID)).
- The list_proc for the child contains the last two undo structs A and
B (the rest have been removed either by exit_sem() or freeary()).
- The semid for A is 1 and semid for B is 2.
- exit_sem() removes A and at the same time freeary() removes B.
- Since A and B have different semid sem_lock() will acquire different
locks for each process and both can proceed.
The bug is that they remove A and B from the same list_proc at the same
time because only freeary() acquires the ulp lock. When exit_sem()
removes A it makes ulp->list_proc.next to point at B and at the same
time freeary() removes B setting B->semid=-1.
At the next iteration of for(;;) loop exit_sem() will try to remove B.
The only way to break from for(;;) is for (&un->list_proc ==
&ulp->list_proc) to be true which is not. Then exit_sem() will check if
B->semid=-1 which is and will continue looping in for(;;) until the
memory for B is reallocated and the value at B->semid is changed.
At that point, exit_sem() will crash attempting to unlink B from the
lists (this can be easily triggered by running the reproducer [1] a
second time).
To prove this scenario instrumentation was added to keep information
about each sem_undo (un) struct that is removed per process and per
semaphore set (sma).
CPU0 CPU1
[caller holds sem_lock(sma for A)] ...
freeary() exit_sem()
... ...
... sem_lock(sma for B)
spin_lock(A->ulp->lock) ...
list_del_rcu(un_A->list_proc) list_del_rcu(un_B->list_proc)
Undo structures A and B have different semid and sem_lock() operations
proceed. However they belong to the same list_proc list and they are
removed at the same time. This results into ulp->list_proc.next
pointing to the address of B which is already removed.
After reverting commit a979558448 ("ipc,sem: remove uneeded
sem_undo_list lock usage in exit_sem()") the issue was no longer
reproducible.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1694779
Link: http://lkml.kernel.org/r/20191211191318.11860-1-ioanna-maria.alifieraki@canonical.com
Fixes: a979558448 ("ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()")
Signed-off-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Herton R. Krzesinski <herton@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: <malat@debian.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl1JqvYACgkQONu9yGCS
aT78HA//TVAfvVC0NqHbLFIIgkkZ1ygQc9A7K0hhkdCvZXcwSQgaxWO3uhHOeVNz
yK2RS5GqxsNQaSnsmO66cfb6+B1YzmnRuXbZsKU9AcA+rs5rhRGZx1Ax+nMxbpeC
zQ6Yi9aATa+R5O/yt3emtxL/Ml7W82sb0kE/o2YGJBDamiOmWEAnf6ROTz9euzZ3
bm2ifHNDiN5EG49J6PgQLXGNca9YqxFmpH6bts+mdNIBmv1M46QyD4nMtke3q4fz
XDLy9RoPXvbUt9i+xRf18Z5lHMSMqdS9+rU7iOK5E3mir9slWqTNLHN6B7dObvH3
lmICvdYUo+CucVHMHlihUNtPWpw4q1Ml5f8gY9p9qNE7WNErbt8Do9hCi7RKEpwL
o1jIVoJcWdeElbTsMMOnqnypOX54JsE4NrValuuuRuC8LqqsFe6AxbOoqouQHzqW
TOuJFw6thJlYXhsZCNciNLSlyD7PbobeBuZ+6DSfeGhqmGR0/5+Aq9bbZmxsLd0p
OMsnwkiboYBhTn/IaC0jpIVwqcI+NvKQG0fwlAR0KqILwLrvbe62WMQS7OxoKxix
4fj5B1tZMHKWKMdE0GPFE0f2bfVMJMMnF+ra8ilgLLI1cuOPGwjXHpT0Nio9Uqtb
UQ2Q9R8MpxUCYkOLy/22U0cNrc29ND9dzp9mkaGpDH0PJUwVg2s=
=i+NY
-----END PGP SIGNATURE-----
Merge 4.9.188 into android-4.9-q
Changes in 4.9.188
ARM: riscpc: fix DMA
ARM: dts: rockchip: Make rk3288-veyron-minnie run at hs200
ARM: dts: rockchip: Make rk3288-veyron-mickey's emmc work again
ARM: dts: rockchip: Mark that the rk3288 timer might stop in suspend
ftrace: Enable trampoline when rec count returns back to one
kernel/module.c: Only return -EEXIST for modules that have finished loading
MIPS: lantiq: Fix bitfield masking
dmaengine: rcar-dmac: Reject zero-length slave DMA requests
fs/adfs: super: fix use-after-free bug
btrfs: fix minimum number of chunk errors for DUP
ceph: fix improper use of smp_mb__before_atomic()
ceph: return -ERANGE if virtual xattr value didn't fit in buffer
scsi: zfcp: fix GCC compiler warning emitted with -Wmaybe-uninitialized
ACPI: fix false-positive -Wuninitialized warning
be2net: Signal that the device cannot transmit during reconfiguration
x86/apic: Silence -Wtype-limits compiler warnings
x86: math-emu: Hide clang warnings for 16-bit overflow
mm/cma.c: fail if fixed declaration can't be honored
coda: add error handling for fget
coda: fix build using bare-metal toolchain
uapi linux/coda_psdev.h: move upc_req definition from uapi to kernel side headers
drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
ipc/mqueue.c: only perform resource calculation if user valid
x86/kvm: Don't call kvm_spurious_fault() from .fixup
x86, boot: Remove multiple copy of static function sanitize_boot_params()
kbuild: initialize CLANG_FLAGS correctly in the top Makefile
Btrfs: fix incremental send failure after deduplication
mmc: dw_mmc: Fix occasional hang after tuning on eMMC
gpiolib: fix incorrect IRQ requesting of an active-low lineevent
selinux: fix memory leak in policydb_init()
s390/dasd: fix endless loop after read unit address configuration
drivers/perf: arm_pmu: Fix failure path in PM notifier
xen/swiotlb: fix condition for calling xen_destroy_contiguous_region()
IB/mlx5: Fix RSS Toeplitz setup to be aligned with the HW specification
coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
infiniband: fix race condition between infiniband mlx4, mlx5 driver and core dumping
coredump: fix race condition between collapse_huge_page() and core dumping
eeprom: at24: make spd world-readable again
Backport minimal compiler_attributes.h to support GCC 9
include/linux/module.h: copy __init/__exit attrs to init/cleanup_module
objtool: Support GCC 9 cold subfunction naming scheme
x86, mm, gup: prevent get_page() race with munmap in paravirt guest
Linux 4.9.188
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit a318f12ed8843cfac53198390c74a565c632f417 ]
Andreas Christoforou reported:
UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow:
9 * 2305843009213693951 cannot be represented in type 'long int'
...
Call Trace:
mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414
evict+0x472/0x8c0 fs/inode.c:558
iput_final fs/inode.c:1547 [inline]
iput+0x51d/0x8c0 fs/inode.c:1573
mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320
mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459
vfs_mkobj+0x39e/0x580 fs/namei.c:2892
prepare_open ipc/mqueue.c:731 [inline]
do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771
Which could be triggered by:
struct mq_attr attr = {
.mq_flags = 0,
.mq_maxmsg = 9,
.mq_msgsize = 0x1fffffffffffffff,
.mq_curmsgs = 0,
};
if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1)
perror("mq_open");
mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and
preparing to return -EINVAL. During the cleanup, it calls
mqueue_evict_inode() which performed resource usage tracking math for
updating "user", before checking if there was a valid "user" at all
(which would indicate that the calculations would be sane). Instead,
delay this check to after seeing a valid "user".
The overflow was real, but the results went unused, so while the flaw is
harmless, it's noisy for kernel fuzzers, so just fix it by moving the
calculation under the non-NULL "user" where it actually gets used.
Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescook
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Andreas Christoforou <andreaschristofo@gmail.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl0Nx/UACgkQONu9yGCS
aT78vQ/9FibC80cVhM69HzZko3evlbEQmPZkmgjwkEqJTZ+2qczRZR6sM2Oexx9q
tLBeRU1+3cPe9Pl/SsEZVM/rTqSrxcRXnGgyG6RMq8wBYtBWyiwRFLRJ2j5zw//s
Wg0LXJhaqLV9FfP34PaPkXmRdBOec2nw5N0U1uoA/0VPwLFRfPTMn8lsfsmbtLWw
JlXMlpEYosGQJArYMadMOntnCwaPtUJtd4EfLTqb+Yc23LM0RTYhtwn+XKE8cJzj
1WMWHhTRkd+RtZIlsgw+FW1iX8pAHvrAPq0b2xOpmq5CCmErHfqjmeZ7VY6SUdAf
RqnTDFb/yNAdhbjklzB5+o/jAl/yv1nbfOGmbJrZNVxzZ3RqKfUJNDpVQPwJlgAc
frCJ2+ZuBVnJWcvbyaR35DgJlAxEZ26JVP17D8oywXnQLpemKW++bvMB6J3kL4qD
nN5bmPd5jCe89Aawo66akIVN1UOnbBlzkdiXEXEzxwvKMraYcoSnoL6AmFxzjcBp
HV/mUy0L7umBbtz1l9aXiACoGtUwZbbsfNTDeiXadbnZSSAGVYpmtf9UGhSSUrA0
dqkWos5SWXTYWSO4+4FkuIaGIZF+LPE6Esm+PxyzXmbm5rBiVnxxcxT1X4Pd5O+C
0y+wng0TSKHMPbwUyoGktuR7ec9P+/WhcyBwjS0EnMnBeBL+Bdw=
=o76k
-----END PGP SIGNATURE-----
Merge 4.9.183 into android-4.9-q
Changes in 4.9.183
rapidio: fix a NULL pointer dereference when create_workqueue() fails
fs/fat/file.c: issue flush after the writeback of FAT
sysctl: return -EINVAL if val violates minmax
ipc: prevent lockup on alloc_msg and free_msg
ARM: prevent tracing IPI_CPU_BACKTRACE
hugetlbfs: on restore reserve error path retain subpool reservation
mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLE
mm/cma.c: fix crash on CMA allocation if bitmap allocation fails
mm/cma_debug.c: fix the break condition in cma_maxchunk_get()
mm/slab.c: fix an infinite loop in leaks_show()
kernel/sys.c: prctl: fix false positive in validate_prctl_map()
drivers: thermal: tsens: Don't print error message on -EPROBE_DEFER
mfd: tps65912-spi: Add missing of table registration
mfd: intel-lpss: Set the device in reset state when init
mfd: twl6040: Fix device init errors for ACCCTL register
perf/x86/intel: Allow PEBS multi-entry in watermark mode
drm/bridge: adv7511: Fix low refresh rate selection
objtool: Don't use ignore flag for fake jumps
pwm: meson: Use the spin-lock only to protect register modifications
ntp: Allow TAI-UTC offset to be set to zero
f2fs: fix to avoid panic in do_recover_data()
f2fs: fix to clear dirty inode in error path of f2fs_iget()
f2fs: fix to do sanity check on valid block count of segment
configfs: fix possible use-after-free in configfs_register_group
uml: fix a boot splat wrt use of cpu_all_mask
watchdog: imx2_wdt: Fix set_timeout for big timeout values
watchdog: fix compile time error of pretimeout governors
iommu/vt-d: Set intel_iommu_gfx_mapped correctly
ALSA: hda - Register irq handler after the chip initialization
nvmem: core: fix read buffer in place
fuse: retrieve: cap requested size to negotiated max_write
nfsd: allow fh_want_write to be called twice
x86/PCI: Fix PCI IRQ routing table memory leak
platform/chrome: cros_ec_proto: check for NULL transfer function
soc: mediatek: pwrap: Zero initialize rdata in pwrap_init_cipher
clk: rockchip: Turn on "aclk_dmac1" for suspend on rk3288
ARM: dts: imx6sx: Specify IMX6SX_CLK_IPG as "ahb" clock to SDMA
ARM: dts: imx7d: Specify IMX7D_CLK_IPG as "ipg" clock to SDMA
ARM: dts: imx6ul: Specify IMX6UL_CLK_IPG as "ipg" clock to SDMA
ARM: dts: imx6sx: Specify IMX6SX_CLK_IPG as "ipg" clock to SDMA
ARM: dts: imx6qdl: Specify IMX6QDL_CLK_IPG as "ipg" clock to SDMA
PCI: rpadlpar: Fix leaked device_node references in add/remove paths
platform/x86: intel_pmc_ipc: adding error handling
PCI: rcar: Fix a potential NULL pointer dereference
PCI: rcar: Fix 64bit MSI message address handling
video: hgafb: fix potential NULL pointer dereference
video: imsttfb: fix potential NULL pointer dereferences
PCI: xilinx: Check for __get_free_pages() failure
gpio: gpio-omap: add check for off wake capable gpios
dmaengine: idma64: Use actual device for DMA transfers
pwm: tiehrpwm: Update shadow register for disabling PWMs
ARM: dts: exynos: Always enable necessary APIO_1V8 and ABB_1V8 regulators on Arndale Octa
pwm: Fix deadlock warning when removing PWM device
ARM: exynos: Fix undefined instruction during Exynos5422 resume
Revert "Bluetooth: Align minimum encryption key size for LE and BR/EDR connections"
ALSA: seq: Cover unsubscribe_port() in list_mutex
ALSA: oxfw: allow PCM capture for Stanton SCS.1m
libata: Extend quirks for the ST1000LM024 drives with NOLPM quirk
mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node
fs/ocfs2: fix race in ocfs2_dentry_attach_lock()
signal/ptrace: Don't leak unitialized kernel memory with PTRACE_PEEK_SIGINFO
ptrace: restore smp_rmb() in __ptrace_may_access()
media: v4l2-ioctl: clear fields in s_parm
i2c: acorn: fix i2c warning
bcache: fix stack corruption by PRECEDING_KEY()
cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
ASoC: cs42xx8: Add regcache mask dirty
ASoC: fsl_asrc: Fix the issue about unsupported rate
x86/uaccess, kcov: Disable stack protector
ALSA: seq: Protect in-kernel ioctl calls with mutex
ALSA: seq: Fix race of get-subscription call vs port-delete ioctls
Revert "ALSA: seq: Protect in-kernel ioctl calls with mutex"
Drivers: misc: fix out-of-bounds access in function param_set_kgdbts_var
scsi: lpfc: add check for loss of ndlp when sending RRQ
arm64/mm: Inhibit huge-vmap with ptdump
scsi: bnx2fc: fix incorrect cast to u64 on shift operation
selftests/timers: Add missing fflush(stdout) calls
usbnet: ipheth: fix racing condition
KVM: x86/pmu: do not mask the value that is written to fixed PMUs
KVM: s390: fix memory slot handling for KVM_SET_USER_MEMORY_REGION
drm/vmwgfx: integer underflow in vmw_cmd_dx_set_shader() leading to an invalid read
drm/vmwgfx: NULL pointer dereference from vmw_cmd_dx_view_define()
usb: dwc2: Fix DMA cache alignment issues
USB: Fix chipmunk-like voice when using Logitech C270 for recording audio.
USB: usb-storage: Add new ID to ums-realtek
USB: serial: pl2303: add Allied Telesis VT-Kit3
USB: serial: option: add support for Simcom SIM7500/SIM7600 RNDIS mode
USB: serial: option: add Telit 0x1260 and 0x1261 compositions
rtc: pcf8523: don't return invalid date when battery is low
ax25: fix inconsistent lock state in ax25_destroy_timer
be2net: Fix number of Rx queues used for flow hashing
ipv6: flowlabel: fl6_sock_lookup() must use atomic_inc_not_zero
lapb: fixed leak of control-blocks.
neigh: fix use-after-free read in pneigh_get_next
sunhv: Fix device naming inconsistency between sunhv_console and sunhv_reg
Revert "staging: vc04_services: prevent integer overflow in create_pagelist()"
perf/x86/intel/ds: Fix EVENT vs. UEVENT PEBS constraints
selftests: netfilter: missing error check when setting up veth interface
mISDN: make sure device name is NUL terminated
x86/CPU/AMD: Don't force the CPB cap when running under a hypervisor
perf/ring_buffer: Fix exposing a temporarily decreased data_head
perf/ring_buffer: Add ordering to rb->nest increment
gpio: fix gpio-adp5588 build errors
net: tulip: de4x5: Drop redundant MODULE_DEVICE_TABLE()
i2c: dev: fix potential memory leak in i2cdev_ioctl_rdwr
configfs: Fix use-after-free when accessing sd->s_dentry
perf data: Fix 'strncat may truncate' build failure with recent gcc
perf record: Fix s390 missing module symbol and warning for non-root users
ia64: fix build errors by exporting paddr_to_nid()
KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token list
KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpu
net: sh_eth: fix mdio access in sh_eth_close() for R-Car Gen2 and RZ/A1 SoCs
scsi: libcxgbi: add a check for NULL pointer in cxgbi_check_route()
scsi: smartpqi: properly set both the DMA mask and the coherent DMA mask
scsi: libsas: delete sas port if expander discover failed
mlxsw: spectrum: Prevent force of 56G
Abort file_remove_privs() for non-reg. files
Linux 4.9.183
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlsOO78ACgkQONu9yGCS
aT4OfBAAk3FbbltZaC/dvdJalMB1D95dL2qQcQkY45eVKpeVzmiSboEdFdqNt2oN
PtOs6tD/ao7y25CWCvYGoX3jrHUZflDv9akTDPsFyyxkbSpEZ36jTc0W6SpwZ9ez
IOtRTaUk6oXhkCr8KQRgjFQy0Ip+QRdpMucMl+LSz+HEoXgMUDqHFouznJhsjbd9
g5uVHWMvbeEWVEGlT3ch2fuO6lRoNxdO9V/DhQxviiIdGhFF+T7nZuAbH/OgXFr9
eX6UiWVGDLsjP6iSw16pVZT0Lf2RbG5oWlmGtoaj2fPf8eWeF2aJcCT3JtqTN7Bj
Bd4//X+c+GgYMZM/411+lSgPLQBhiLqPYEzubabxEm17KCi2hmiac6FN6TEVv8ZB
2I3N2APG94qExPZmYdHTgIM7SDP+vJP7hpJ63Ez2AAjgNHJVvNaLZS6ZznILtTxu
6It0PfewHgt6NWqSWutc/+gNIbvjyztYgdxRGSClYd4HwezNA8e6nrnpNoS8Pfn1
RAMg4ercWMm0hKt0ELr+pCOv8Izlo+tMWIKba/GYcX8+ab3XlTH1hWHw/gLV5ikO
HAv3CRdwr8NU3mkuNwQHJNcEXvThmsOhldH5AU1+503r7Qsjb14wI2ASF4SDYtRt
TdbolMWoVElSqgC+4AKo+d5BrC30vAmQ90bAgLF5P6/vNPSNMZc=
=nLEw
-----END PGP SIGNATURE-----
Merge 4.9.104 into android-4.9
Changes in 4.9.104
MIPS: c-r4k: Fix data corruption related to cache coherence
MIPS: ptrace: Expose FIR register through FP regset
MIPS: Fix ptrace(2) PTRACE_PEEKUSR and PTRACE_POKEUSR accesses to o32 FGRs
KVM: Fix spelling mistake: "cop_unsuable" -> "cop_unusable"
affs_lookup(): close a race with affs_remove_link()
aio: fix io_destroy(2) vs. lookup_ioctx() race
ALSA: timer: Fix pause event notification
do d_instantiate/unlock_new_inode combinations safely
mmc: sdhci-iproc: remove hard coded mmc cap 1.8v
mmc: sdhci-iproc: fix 32bit writes for TRANSFER_MODE register
libata: Blacklist some Sandisk SSDs for NCQ
libata: blacklist Micron 500IT SSD with MU01 firmware
xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent
drm/vmwgfx: Fix 32-bit VMW_PORT_HB_[IN|OUT] macros
IB/hfi1: Use after free race condition in send context error path
Revert "ipc/shm: Fix shmat mmap nil-page protection"
ipc/shm: fix shmat() nil address after round-down when remapping
kasan: fix memory hotplug during boot
kernel/sys.c: fix potential Spectre v1 issue
kernel/signal.c: avoid undefined behaviour in kill_something_info
KVM/VMX: Expose SSBD properly to guests
KVM: s390: vsie: fix < 8k check for the itdba
KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed
kvm: x86: IA32_ARCH_CAPABILITIES is always supported
firewire-ohci: work around oversized DMA reads on JMicron controllers
x86/tsc: Allow TSC calibration without PIT
NFSv4: always set NFS_LOCK_LOST when a lock is lost.
ALSA: hda - Use IS_REACHABLE() for dependency on input
kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460
tracing/hrtimer: Fix tracing bugs by taking all clock bases and modes into account
PCI: Add function 1 DMA alias quirk for Marvell 9128
Input: psmouse - fix Synaptics detection when protocol is disabled
i40iw: Zero-out consumer key on allocate stag for FMR
tools lib traceevent: Simplify pointer print logic and fix %pF
perf callchain: Fix attr.sample_max_stack setting
tools lib traceevent: Fix get_field_str() for dynamic strings
perf record: Fix failed memory allocation for get_cpuid_str
iommu/vt-d: Use domain instead of cache fetching
dm thin: fix documentation relative to low water mark threshold
net: stmmac: dwmac-meson8b: fix setting the RGMII TX clock on Meson8b
net: stmmac: dwmac-meson8b: propagate rate changes to the parent clock
nfs: Do not convert nfs_idmap_cache_timeout to jiffies
watchdog: sp5100_tco: Fix watchdog disable bit
kconfig: Don't leak main menus during parsing
kconfig: Fix automatic menu creation mem leak
kconfig: Fix expr_free() E_NOT leak
mac80211_hwsim: fix possible memory leak in hwsim_new_radio_nl()
ipmi/powernv: Fix error return code in ipmi_powernv_probe()
Btrfs: set plug for fsync
btrfs: Fix out of bounds access in btrfs_search_slot
Btrfs: fix scrub to repair raid6 corruption
btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP
HID: roccat: prevent an out of bounds read in kovaplus_profile_activated()
fm10k: fix "failed to kill vid" message for VF
device property: Define type of PROPERTY_ENRTY_*() macros
jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes
powerpc/numa: Ensure nodes initialized for hotplug
RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
ntb_transport: Fix bug with max_mw_size parameter
gianfar: prevent integer wrapping in the rx handler
tcp_nv: fix potential integer overflow in tcpnv_acked
kvm: Map PFN-type memory regions as writable (if possible)
ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid
ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute
ocfs2: return error when we attempt to access a dirty bh in jbd2
mm/mempolicy: fix the check of nodemask from user
mm/mempolicy: add nodes_empty check in SYSC_migrate_pages
asm-generic: provide generic_pmdp_establish()
sparc64: update pmdp_invalidate() to return old pmd value
mm: thp: use down_read_trylock() in khugepaged to avoid long block
mm: pin address_space before dereferencing it while isolating an LRU page
mm/fadvise: discard partial page if endbyte is also EOF
openvswitch: Remove padding from packet before L3+ conntrack processing
IB/ipoib: Fix for potential no-carrier state
drm/nouveau/pmu/fuc: don't use movw directly anymore
netfilter: ipv6: nf_defrag: Kill frag queue on RFC2460 failure
x86/power: Fix swsusp_arch_resume prototype
firmware: dmi_scan: Fix handling of empty DMI strings
ACPI: processor_perflib: Do not send _PPC change notification if not ready
ACPI / scan: Use acpi_bus_get_status() to initialize ACPI_TYPE_DEVICE devs
bpf: fix selftests/bpf test_kmod.sh failure when CONFIG_BPF_JIT_ALWAYS_ON=y
MIPS: generic: Fix machine compatible matching
MIPS: TXx9: use IS_BUILTIN() for CONFIG_LEDS_CLASS
xen-netfront: Fix race between device setup and open
xen/grant-table: Use put_page instead of free_page
RDS: IB: Fix null pointer issue
arm64: spinlock: Fix theoretical trylock() A-B-A with LSE atomics
proc: fix /proc/*/map_files lookup
cifs: silence compiler warnings showing up with gcc-8.0.0
bcache: properly set task state in bch_writeback_thread()
bcache: fix for allocator and register thread race
bcache: fix for data collapse after re-attaching an attached device
bcache: return attach error when no cache set exist
tools/libbpf: handle issues with bpf ELF objects containing .eh_frames
bpf: fix rlimit in reuseport net selftest
vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page
locking/qspinlock: Ensure node->count is updated before initialising node
irqchip/gic-v3: Ignore disabled ITS nodes
cpumask: Make for_each_cpu_wrap() available on UP as well
irqchip/gic-v3: Change pr_debug message to pr_devel
ARC: Fix malformed ARC_EMUL_UNALIGNED default
ptr_ring: prevent integer overflow when calculating size
libata: Fix compile warning with ATA_DEBUG enabled
selftests: pstore: Adding config fragment CONFIG_PSTORE_RAM=m
selftests: memfd: add config fragment for fuse
ARM: OMAP2+: timer: fix a kmemleak caused in omap_get_timer_dt
ARM: OMAP3: Fix prm wake interrupt for resume
ARM: OMAP1: clock: Fix debugfs_create_*() usage
ibmvnic: Free RX socket buffer in case of adapter error
iwlwifi: mvm: fix security bug in PN checking
iwlwifi: mvm: always init rs with 20mhz bandwidth rates
NFC: llcp: Limit size of SDP URI
rxrpc: Work around usercopy check
mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4
mac80211: fix a possible leak of station stats
mac80211: fix calling sleeping function in atomic context
mac80211: Do not disconnect on invalid operating class
md raid10: fix NULL deference in handle_write_completed()
drm/exynos: g2d: use monotonic timestamps
drm/exynos: fix comparison to bitshift when dealing with a mask
locking/xchg/alpha: Add unconditional memory barrier to cmpxchg()
md: raid5: avoid string overflow warning
kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
powerpc/bpf/jit: Fix 32-bit JIT for seccomp_data access
s390/cio: fix ccw_device_start_timeout API
s390/cio: fix return code after missing interrupt
s390/cio: clear timer when terminating driver I/O
PKCS#7: fix direct verification of SignerInfo signature
ARM: OMAP: Fix dmtimer init for omap1
smsc75xx: fix smsc75xx_set_features()
regulatory: add NUL to request alpha2
integrity/security: fix digsig.c build error with header file
locking/xchg/alpha: Fix xchg() and cmpxchg() memory ordering bugs
x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations
mac80211: drop frames with unexpected DS bits from fast-rx to slow path
arm64: fix unwind_frame() for filtered out fn for function graph tracing
macvlan: fix use-after-free in macvlan_common_newlink()
kvm: fix warning for CONFIG_HAVE_KVM_EVENTFD builds
fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
fs: dcache: Use READ_ONCE when accessing i_dir_seq
md: fix a potential deadlock of raid5/raid10 reshape
md/raid1: fix NULL pointer dereference
batman-adv: fix packet checksum in receive path
batman-adv: invalidate checksum on fragment reassembly
netfilter: ebtables: convert BUG_ONs to WARN_ONs
batman-adv: Ignore invalid batadv_iv_gw during netlink send
batman-adv: Ignore invalid batadv_v_gw during netlink send
batman-adv: Fix netlink dumping of BLA claims
batman-adv: Fix netlink dumping of BLA backbones
nvme-pci: Fix nvme queue cleanup if IRQ setup fails
clocksource/drivers/fsl_ftm_timer: Fix error return checking
ceph: fix dentry leak when failing to init debugfs
ARM: orion5x: Revert commit 4904dbda41.
qrtr: add MODULE_ALIAS macro to smd
r8152: fix tx packets accounting
virtio-gpu: fix ioctl and expose the fixed status to userspace.
dmaengine: rcar-dmac: fix max_chunk_size for R-Car Gen3
bcache: fix kcrashes with fio in RAID5 backend dev
ip6_tunnel: fix IFLA_MTU ignored on NEWLINK
sit: fix IFLA_MTU ignored on NEWLINK
ARM: dts: NSP: Fix amount of RAM on BCM958625HR
powerpc/boot: Fix random libfdt related build errors
gianfar: Fix Rx byte accounting for ndev stats
net/tcp/illinois: replace broken algorithm reference link
nvmet: fix PSDT field check in command format
xen/pirq: fix error path cleanup when binding MSIs
drm/sun4i: Fix dclk_set_phase
Btrfs: send, fix issuing write op when processing hole in no data mode
selftests/powerpc: Skip the subpage_prot tests if the syscall is unavailable
KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backing
iwlwifi: mvm: fix TX of CCMP 256
watchdog: f71808e_wdt: Fix magic close handling
watchdog: sbsa: use 32-bit read for WCV
batman-adv: Fix multicast packet loss with a single WANT_ALL_IPV4/6 flag
e1000e: Fix check_for_link return value with autoneg off
e1000e: allocate ring descriptors with dma_zalloc_coherent
ia64/err-inject: Use get_user_pages_fast()
RDMA/qedr: Fix kernel panic when running fio over NFSoRDMA
RDMA/qedr: Fix iWARP write and send with immediate
IB/mlx4: Fix corruption of RoCEv2 IPv4 GIDs
IB/mlx4: Include GID type when deleting GIDs from HW table under RoCE
IB/mlx5: Fix an error code in __mlx5_ib_modify_qp()
fbdev: Fixing arbitrary kernel leak in case FBIOGETCMAP_SPARC in sbusfb_ioctl_helper().
fsl/fman: avoid sleeping in atomic context while adding an address
net: qcom/emac: Use proper free methods during TX
net: smsc911x: Fix unload crash when link is up
IB/core: Fix possible crash to access NULL netdev
xen: xenbus: use put_device() instead of kfree()
arm64: Relax ARM_SMCCC_ARCH_WORKAROUND_1 discovery
dmaengine: mv_xor_v2: Fix clock resource by adding a register clock
netfilter: ebtables: fix erroneous reject of last rule
bnxt_en: Check valid VNIC ID in bnxt_hwrm_vnic_set_tpa().
workqueue: use put_device() instead of kfree()
ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu
sunvnet: does not support GSO for sctp
drm/imx: move arming of the vblank event to atomic_flush
net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off
batman-adv: fix header size check in batadv_dbg_arp()
batman-adv: Fix skbuff rcsum on packet reroute
vti4: Don't count header length twice on tunnel setup
vti4: Don't override MTU passed on link creation via IFLA_MTU
perf/cgroup: Fix child event counting bug
brcmfmac: Fix check for ISO3166 code
kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races
RDMA/ucma: Correct option size check using optlen
RDMA/qedr: fix QP's ack timeout configuration
RDMA/qedr: Fix rc initialization on CNQ allocation failure
mm/mempolicy.c: avoid use uninitialized preferred_node
mm, thp: do not cause memcg oom for thp
selftests: ftrace: Add probe event argument syntax testcase
selftests: ftrace: Add a testcase for string type with kprobe_event
selftests: ftrace: Add a testcase for probepoint
batman-adv: fix multicast-via-unicast transmission with AP isolation
batman-adv: fix packet loss for broadcasted DHCP packets to a server
ARM: 8748/1: mm: Define vdso_start, vdso_end as array
net: qmi_wwan: add BroadMobi BM806U 2020:2033
perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs
llc: properly handle dev_queue_xmit() return value
builddeb: Fix header package regarding dtc source links
mm/kmemleak.c: wait for scan completion before disabling free
net: Fix untag for vlan packets without ethernet header
net: mvneta: fix enable of all initialized RXQs
sh: fix debug trap failure to process signals before return to user
nvme: don't send keep-alives to the discovery controller
x86/pgtable: Don't set huge PUD/PMD on non-leaf entries
x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init
fs/proc/proc_sysctl.c: fix potential page fault while unregistering sysctl table
swap: divide-by-zero when zero length swap file on ssd
sr: get/drop reference to device in revalidate and check_events
Force log to disk before reading the AGF during a fstrim
cpufreq: CPPC: Initialize shared perf capabilities of CPUs
dp83640: Ensure against premature access to PHY registers after reset
mm/ksm: fix interaction with THP
mm: fix races between address_space dereference and free in page_evicatable
Btrfs: bail out on error during replay_dir_deletes
Btrfs: fix NULL pointer dereference in log_dir_items
btrfs: Fix possible softlock on single core machines
ocfs2/dlm: don't handle migrate lockres if already in shutdown
sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
KVM: VMX: raise internal error for exception during invalid protected mode state
fscache: Fix hanging wait on page discarded by writeback
sparc64: Make atomic_xchg() an inline function rather than a macro.
net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
btrfs: tests/qgroup: Fix wrong tree backref level
Btrfs: fix copy_items() return value when logging an inode
btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers
rxrpc: Fix Tx ring annotation after initial Tx failure
rxrpc: Don't treat call aborts as conn aborts
xen/acpi: off by one in read_acpi_id()
drivers: macintosh: rack-meter: really fix bogus memsets
ACPI: acpi_pad: Fix memory leak in power saving threads
powerpc/mpic: Check if cpu_possible() in mpic_physmask()
m68k: set dma and coherent masks for platform FEC ethernets
parisc/pci: Switch LBA PCI bus from Hard Fail to Soft Fail mode
hwmon: (nct6775) Fix writing pwmX_mode
powerpc/perf: Prevent kernel address leak to userspace via BHRB buffer
powerpc/perf: Fix kernel address leak via sampling registers
tools/thermal: tmon: fix for segfault
selftests: Print the test we're running to /dev/kmsg
net/mlx5: Protect from command bit overflow
ath10k: Fix kernel panic while using worker (ath10k_sta_rc_update_wk)
cxgb4: Setup FW queues before registering netdev
ima: Fallback to the builtin hash algorithm
virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
arm: dts: socfpga: fix GIC PPI warning
cpufreq: cppc_cpufreq: Fix cppc_cpufreq_init() failure path
zorro: Set up z->dev.dma_mask for the DMA API
bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
ACPICA: Events: add a return on failure from acpi_hw_register_read
ACPICA: acpi: acpica: fix acpi operand cache leak in nseval.c
cxgb4: Fix queue free path of ULD drivers
i2c: mv64xxx: Apply errata delay only in standard mode
KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use
perf top: Fix top.call-graph config option reading
perf stat: Fix core dump when flag T is used
IB/core: Honor port_num while resolving GID for IB link layer
regulator: gpio: Fix some error handling paths in 'gpio_regulator_probe()'
spi: bcm-qspi: fIX some error handling paths
MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset
PCI: Restore config space on runtime resume despite being unbound
ipmi_ssif: Fix kernel panic at msg_done_handler
powerpc: Add missing prototype for arch_irq_work_raise()
f2fs: fix to check extent cache in f2fs_drop_extent_tree
perf/core: Fix perf_output_read_group()
drm/panel: simple: Fix the bus format for the Ontat panel
hwmon: (pmbus/max8688) Accept negative page register values
hwmon: (pmbus/adm1275) Accept negative page register values
perf/x86/intel: Properly save/restore the PMU state in the NMI handler
cdrom: do not call check_disk_change() inside cdrom_open()
perf/x86/intel: Fix large period handling on Broadwell CPUs
perf/x86/intel: Fix event update for auto-reload
arm64: dts: qcom: Fix SPI5 config on MSM8996
soc: qcom: wcnss_ctrl: Fix increment in NV upload
gfs2: Fix fallocate chunk size
x86/devicetree: Initialize device tree before using it
x86/devicetree: Fix device IRQ settings in DT
ALSA: vmaster: Propagate slave error
dmaengine: pl330: fix a race condition in case of threaded irqs
dmaengine: rcar-dmac: Check the done lists in rcar_dmac_chan_get_residue()
enic: enable rq before updating rq descriptors
hwrng: stm32 - add reset during probe
dmaengine: qcom: bam_dma: get num-channels and num-ees from dt
net: stmmac: ensure that the device has released ownership before reading data
net: stmmac: ensure that the MSS desc is the last desc to set the own bit
cpufreq: Reorder cpufreq_online() error code path
PCI: Add function 1 DMA alias quirk for Marvell 88SE9220
udf: Provide saner default for invalid uid / gid
ARM: dts: bcm283x: Fix probing of bcm2835-i2s
audit: return on memory error to avoid null pointer dereference
rcu: Call touch_nmi_watchdog() while printing stall warnings
pinctrl: sh-pfc: r8a7796: Fix MOD_SEL register pin assignment for SSI pins group
MIPS: Octeon: Fix logging messages with spurious periods after newlines
drm/rockchip: Respect page offset for PRIME mmap calls
x86/apic: Set up through-local-APIC mode on the boot CPU if 'noapic' specified
perf tests: Use arch__compare_symbol_names to compare symbols
perf report: Fix memory corruption in --branch-history mode --branch-history
selftests/net: fixes psock_fanout eBPF test case
netlabel: If PF_INET6, check sk_buff ip header version
regmap: Correct comparison in regmap_cached
ARM: dts: imx7d: cl-som-imx7: fix pinctrl_enet
ARM: dts: porter: Fix HDMI output routing
regulator: of: Add a missing 'of_node_put()' in an error handling path of 'of_regulator_match()'
pinctrl: msm: Use dynamic GPIO numbering
kdb: make "mdr" command repeat
Linux 4.9.104
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 8f89c007b6dec16a1793cb88de88fcc02117bbbc upstream.
shmat()'s SHM_REMAP option forbids passing a nil address for; this is in
fact the very first thing we check for. Andrea reported that for
SHM_RND|SHM_REMAP cases we can end up bypassing the initial addr check,
but we need to check again if the address was rounded down to nil. As
of this patch, such cases will return -EINVAL.
Link: http://lkml.kernel.org/r/20180503204934.kk63josdu6u53fbd@linux-n805
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a73ab244f0dad8fffb3291b905f73e2d3eaa7c00 upstream.
Patch series "ipc/shm: shmat() fixes around nil-page".
These patches fix two issues reported[1] a while back by Joe and Andrea
around how shmat(2) behaves with nil-page.
The first reverts a commit that it was incorrectly thought that mapping
nil-page (address=0) was a no no with MAP_FIXED. This is not the case,
with the exception of SHM_REMAP; which is address in the second patch.
I chose two patches because it is easier to backport and it explicitly
reverts bogus behaviour. Both patches ought to be in -stable and ltp
testcases need updated (the added testcase around the cve can be
modified to just test for SHM_RND|SHM_REMAP).
[1] lkml.kernel.org/r/20180430172152.nfa564pvgpk3ut7p@linux-n805
This patch (of 2):
Commit 95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection")
worked on the idea that we should not be mapping as root addr=0 and
MAP_FIXED. However, it was reported that this scenario is in fact
valid, thus making the patch both bogus and breaks userspace as well.
For example X11's libint10.so relies on shmat(1, SHM_RND) for lowmem
initialization[1].
[1] https://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/os-support/linux/int10/linux.c#n347
Link: http://lkml.kernel.org/r/20180503203243.15045-2-dave@stgolabs.net
Fixes: 95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection")
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Joe Lawrence <joe.lawrence@redhat.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlre3foACgkQONu9yGCS
aT4ygw/7BxK85/7my26TCcF3NlNfw8x9B1AuEHcjQdUFmSFip9fbolC1VVFQyzWV
1Z6sK5LKrFU/234A4LVW+6oFVnUi8VvX5RnWNqRLj/xpPyv/L5PsG4A7VwQLZDCj
yn6OrIkQ2czv3oio9R8bz1DFxrPsxoCdrMzV1xR/tbWQJscSRSomBSNstZ63lGQP
FleS//hG21092lM0EDPQRfc9/xnwJFMKJoCwBrf6SUR7CK2ZdYp+w+HoeP5BSA49
BCDdMU6wi63gTdfGFJa76VcKuFxRRic3F9VVJwOuYPCfCpgNz7GiUaGXirNCRSxD
F0VveFMx9lOrxMCyDDEoj3lRaZ4FrhwFOF+Q0RK4/p8XH8Pr9emvt4beOeYNk3ev
dFvCI/zA3zmDj8ppB+ZnKq9GTRSuNZ5sw4q7AyFu6fm0mSMV84ZRHgCrVbJkis3i
VDnu6V+Nd9ZYP6rR6AegtQwFM3OReVVJG4ZpB17PZOebLdM+n4DsmqDCgvhFboVP
WowtlR13GIn6ER7hiHfOlwv3CtTUspRrAZi73F8aEA7gdoYzGc8/9Q1fA738WiI4
46utL18qyfjzwbwmI9dtFx1SLkrobSJ+X88PLquLYNUleoxbLx/2u4+v+gm0P4xo
yQGPjIv+kms+rbwPR3ogMKqC2GrYYJs4uHBEPJboJWzYVoPAaEs=
=yoJG
-----END PGP SIGNATURE-----
Merge 4.9.96 into android-4.9
Changes in 4.9.96
tty: make n_tty_read() always abort if hangup is in progress
ubifs: Check ubifs_wbuf_sync() return code
ubi: fastmap: Don't flush fastmap work on detach
ubi: Fix error for write access
ubi: Reject MLC NAND
fs/reiserfs/journal.c: add missing resierfs_warning() arg
resource: fix integer overflow at reallocation
ipc/shm: fix use-after-free of shm file via remap_file_pages()
mm, slab: reschedule cache_reap() on the same CPU
usb: musb: gadget: misplaced out of bounds check
usb: gadget: udc: core: update usb_ep_queue() documentation
ARM: dts: at91: at91sam9g25: fix mux-mask pinctrl property
ARM: dts: exynos: Fix IOMMU support for GScaler devices on Exynos5250
ARM: dts: at91: sama5d4: fix pinctrl compatible string
spi: Fix scatterlist elements size in spi_map_buf
xen-netfront: Fix hang on device removal
regmap: Fix reversed bounds check in regmap_raw_write()
ACPI / video: Add quirk to force acpi-video backlight on Samsung 670Z5E
ACPI / hotplug / PCI: Check presence of slot itself in get_slot_status()
USB: gadget: f_midi: fixing a possible double-free in f_midi
USB:fix USB3 devices behind USB3 hubs not resuming at hibernate thaw
usb: dwc3: pci: Properly cleanup resource
smb3: Fix root directory when server returns inode number of zero
HID: i2c-hid: fix size check and type usage
powerpc/powernv: Handle unknown OPAL errors in opal_nvram_write()
powerpc/64: Fix smp_wmb barrier definition use use lwsync consistently
powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops
powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops
HID: Fix hid_report_len usage
HID: core: Fix size as type u32
ASoC: ssm2602: Replace reg_default_raw with reg_default
thunderbolt: Resume control channel after hibernation image is created
irqchip/gic: Take lock when updating irq type
random: use a tighter cap in credit_entropy_bits_safe()
jbd2: if the journal is aborted then don't allow update of the log tail
ext4: don't update checksum of new initialized bitmaps
ext4: protect i_disksize update by i_data_sem in direct write path
ext4: fail ext4_iget for root directory if unallocated
RDMA/ucma: Don't allow setting RDMA_OPTION_IB_PATH without an RDMA device
RDMA/rxe: Fix an out-of-bounds read
ALSA: pcm: Fix UAF at PCM release via PCM timer access
IB/srp: Fix srp_abort()
IB/srp: Fix completion vector assignment algorithm
dmaengine: at_xdmac: fix rare residue corruption
libnvdimm, namespace: use a safe lookup for dimm device name
nfit, address-range-scrub: fix scrub in-progress reporting
um: Compile with modern headers
um: Use POSIX ucontext_t instead of struct ucontext
iommu/vt-d: Fix a potential memory leak
mmc: jz4740: Fix race condition in IRQ mask update
clk: mvebu: armada-38x: add support for 1866MHz variants
clk: mvebu: armada-38x: add support for missing clocks
clk: fix false-positive Wmaybe-uninitialized warning
clk: bcm2835: De-assert/assert PLL reset signal when appropriate
pwm: rcar: Fix a condition to prevent mismatch value setting to duty
thermal: imx: Fix race condition in imx_thermal_probe()
dt-bindings: clock: mediatek: add binding for fixed-factor clock axisel_d4
watchdog: f71808e_wdt: Fix WD_EN register read
vfio/pci: Virtualize Maximum Read Request Size
ALSA: pcm: Use ERESTARTSYS instead of EINTR in OSS emulation
ALSA: pcm: Avoid potential races between OSS ioctls and read/write
ALSA: pcm: Return -EBUSY for OSS ioctls changing busy streams
ALSA: pcm: Fix mutex unbalance in OSS emulation ioctls
ALSA: pcm: Fix endless loop for XRUN recovery in OSS emulation
ext4: don't allow r/w mounts if metadata blocks overlap the superblock
drm/amdgpu: Add an ATPX quirk for hybrid laptop
drm/amdgpu: Fix always_valid bos multiple LRU insertions.
drm/amdgpu: Fix PCIe lane width calculation
drm/rockchip: Clear all interrupts before requesting the IRQ
drm/radeon: Fix PCIe lane width calculation
ALSA: line6: Use correct endpoint type for midi output
ALSA: rawmidi: Fix missing input substream checks in compat ioctls
ALSA: hda - New VIA controller suppor no-snoop path
random: fix crng_ready() test
random: crng_reseed() should lock the crng instance that it is modifying
random: add new ioctl RNDRESEEDCRNG
HID: hidraw: Fix crash on HIDIOCGFEATURE with a destroyed device
MIPS: uaccess: Add micromips clobbers to bzero invocation
MIPS: memset.S: EVA & fault support for small_memset
MIPS: memset.S: Fix return of __clear_user from Lpartial_fixup
MIPS: memset.S: Fix clobber of v1 in last_fixup
powerpc/eeh: Fix enabling bridge MMIO windows
powerpc/lib: Fix off-by-one in alternate feature patching
udf: Fix leak of UTF-16 surrogates into encoded strings
jffs2_kill_sb(): deal with failed allocations
hypfs_kill_super(): deal with failed allocations
orangefs_kill_sb(): deal with allocation failures
rpc_pipefs: fix double-dput()
Don't leak MNT_INTERNAL away from internal mounts
autofs: mount point create should honour passed in mode
mm/filemap.c: fix NULL pointer in page_cache_tree_insert()
fanotify: fix logic of events on child
writeback: safer lock nesting
block/mq: fix potential deadlock during cpu hotplug
Linux 4.9.96
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 3f05317d9889ab75c7190dcd39491d2a97921984 upstream.
syzbot reported a use-after-free of shm_file_data(file)->file->f_op in
shm_get_unmapped_area(), called via sys_remap_file_pages().
Unfortunately it couldn't generate a reproducer, but I found a bug which
I think caused it. When remap_file_pages() is passed a full System V
shared memory segment, the memory is first unmapped, then a new map is
created using the ->vm_file. Between these steps, the shm ID can be
removed and reused for a new shm segment. But, shm_mmap() only checks
whether the ID is currently valid before calling the underlying file's
->mmap(); it doesn't check whether it was reused. Thus it can use the
wrong underlying file, one that was already freed.
Fix this by making the "outer" shm file (the one that gets put in
->vm_file) hold a reference to the real shm file, and by making
__shm_open() require that the file associated with the shm ID matches
the one associated with the "outer" file.
Taking the reference to the real shm file is needed to fully solve the
problem, since otherwise sfd->file could point to a freed file, which
then could be reallocated for the reused shm ID, causing the wrong shm
segment to be mapped (and without the required permission checks).
Commit 1ac0b6dec6 ("ipc/shm: handle removed segments gracefully in
shm_mmap()") almost fixed this bug, but it didn't go far enough because
it didn't consider the case where the shm ID is reused.
The following program usually reproduces this bug:
#include <stdlib.h>
#include <sys/shm.h>
#include <sys/syscall.h>
#include <unistd.h>
int main()
{
int is_parent = (fork() != 0);
srand(getpid());
for (;;) {
int id = shmget(0xF00F, 4096, IPC_CREAT|0700);
if (is_parent) {
void *addr = shmat(id, NULL, 0);
usleep(rand() % 50);
while (!syscall(__NR_remap_file_pages, addr, 4096, 0, 0, 0));
} else {
usleep(rand() % 50);
shmctl(id, IPC_RMID, NULL);
}
}
}
It causes the following NULL pointer dereference due to a 'struct file'
being used while it's being freed. (I couldn't actually get a KASAN
use-after-free splat like in the syzbot report. But I think it's
possible with this bug; it would just take a more extraordinary race...)
BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
PGD 0 P4D 0
Oops: 0000 [#1] SMP NOPTI
CPU: 9 PID: 258 Comm: syz_ipc Not tainted 4.16.0-05140-gf8cf2f16a7c95 #189
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
RIP: 0010:d_inode include/linux/dcache.h:519 [inline]
RIP: 0010:touch_atime+0x25/0xd0 fs/inode.c:1724
[...]
Call Trace:
file_accessed include/linux/fs.h:2063 [inline]
shmem_mmap+0x25/0x40 mm/shmem.c:2149
call_mmap include/linux/fs.h:1789 [inline]
shm_mmap+0x34/0x80 ipc/shm.c:465
call_mmap include/linux/fs.h:1789 [inline]
mmap_region+0x309/0x5b0 mm/mmap.c:1712
do_mmap+0x294/0x4a0 mm/mmap.c:1483
do_mmap_pgoff include/linux/mm.h:2235 [inline]
SYSC_remap_file_pages mm/mmap.c:2853 [inline]
SyS_remap_file_pages+0x232/0x310 mm/mmap.c:2769
do_syscall_64+0x64/0x1a0 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ebiggers@google.com: add comment]
Link: http://lkml.kernel.org/r/20180410192850.235835-1-ebiggers3@gmail.com
Link: http://lkml.kernel.org/r/20180409043039.28915-1-ebiggers3@gmail.com
Reported-by: syzbot+d11f321e7f1923157eac80aa990b446596f46439@syzkaller.appspotmail.com
Fixes: c8d78c1823 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3d942ee079b917b24e2a0c5f18d35ac8ec9fee48 upstream.
If System V shmget/shmat operations are used to create a hugetlbfs
backed mapping, it is possible to munmap part of the mapping and split
the underlying vma such that it is not huge page aligned. This will
untimately result in the following BUG:
kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C E 4.15.0-10-generic #11-Ubuntu
NIP: c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
REGS: c000003fbcdcf810 TRAP: 0700 Tainted: G C E (4.15.0-10-generic)
MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002222 XER: 20040000
CFAR: c00000000036ee44 SOFTE: 1
NIP __unmap_hugepage_range+0xa4/0x760
LR __unmap_hugepage_range_final+0x28/0x50
Call Trace:
0x7115e4e00000 (unreliable)
__unmap_hugepage_range_final+0x28/0x50
unmap_single_vma+0x11c/0x190
unmap_vmas+0x94/0x140
exit_mmap+0x9c/0x1d0
mmput+0xa8/0x1d0
do_exit+0x360/0xc80
do_group_exit+0x60/0x100
SyS_exit_group+0x24/0x30
system_call+0x58/0x6c
---[ end trace ee88f958a1c62605 ]---
This bug was introduced by commit 31383c6865a5 ("mm, hugetlbfs:
introduce ->split() to vm_operations_struct"). A split function was
added to vm_operations_struct to determine if a mapping can be split.
This was mostly for device-dax and hugetlbfs mappings which have
specific alignment constraints.
Mappings initiated via shmget/shmat have their original vm_ops
overwritten with shm_vm_ops. shm_vm_ops functions will call back to the
original vm_ops if needed. Add such a split function to shm_vm_ops.
Link: http://lkml.kernel.org/r/20180321161314.7711-1-mike.kravetz@oracle.com
Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlpxrs0ACgkQONu9yGCS
aT7Yxw/+NmM+Yh70QOpW02RCFHCB+F9tnuQXNlLfEoDqlujMS/UNuVMx39gQXDaU
7T/JOmnVtp9WQL9RLgAegSc3ayIQELzvtKjDLSo/hzxYsOmr0WlN2CVTGT7hn9JH
IQdf8cR2r4FZ/XcxQLpSsRabwhqfeoND1TTm5LUNB1Ii05hUU6/s0k1rQguabuo5
vi0BzSh7v/URxlLyL0m4ZVqovWOASS5/qSv7wazd4i/bSqH3g7VXLNu93iyOB8ih
XXpeTjtfAwJ5kUXBWZPNazUzpQ7b56sQPtsvN6CrvTv8jKJ+FH+7S4d50Vgbu51X
YBC36yypYPXunMXB9iiLYkyb8jraKr12BRLXQyl3TlNANoYjBiT/a2XmHDMA1VbL
+ydbswbmcAvZ1fuAekVY+HIogEroWzN7FbhdUgV12nm7/4WfxpBTZW+M8Es/Stuh
2ACT9TWopbhwRFUhFT5kyDTTnK++NsshGzUXbR9qPQzhdaqe76RPfJ6uHV69MXxP
gE9o3NQ3fUieJO5nQj54atErX+sJ4987DnGoWrg+Ye9Svsq1oVw0K1e44VLBp08v
iZk2lvNjUWnkDGQOhsPEYCLq6KPjXkaqV4OZVS6tGxGEZ4QQJjbnYk+kPeKjrKIA
iP3nfaLJ4HQc2kvwEI41HEJGWyGUlhdrnDqfpxpWgGXOStGJrq0=
=RJ2h
-----END PGP SIGNATURE-----
Merge 4.9.79 into android-4.9
Changes in 4.9.79
x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
orangefs: use list_for_each_entry_safe in purge_waiting_ops
orangefs: initialize op on loop restart in orangefs_devreq_read
usbip: prevent vhci_hcd driver from leaking a socket pointer address
usbip: Fix implicit fallthrough warning
usbip: Fix potential format overflow in userspace tools
can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
KVM: arm/arm64: Check pagesize when allocating a hugepage at Stage 2
Prevent timer value 0 for MWAITX
drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled
drivers: base: cacheinfo: fix boot error message when acpi is enabled
mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack
hwpoison, memcg: forcibly uncharge LRU pages
cma: fix calculation of aligned offset
mm, page_alloc: fix potential false positive in __zone_watermark_ok
ipc: msg, make msgrcv work with LONG_MIN
ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
ACPICA: Namespace: fix operand cache leak
netfilter: nfnetlink_cthelper: Add missing permission checks
netfilter: xt_osf: Add missing permission checks
reiserfs: fix race in prealloc discard
reiserfs: don't preallocate blocks for extended attributes
fs/fcntl: f_setown, avoid undefined behaviour
scsi: libiscsi: fix shifting of DID_REQUEUE host byte
Revert "module: Add retpoline tag to VERMAGIC"
mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
Input: trackpoint - force 3 buttons if 0 button is reported
orangefs: fix deadlock; do not write i_size in read_iter
um: link vmlinux with -no-pie
vsyscall: Fix permissions for emulate mode with KAISER/PTI
eventpoll.h: add missing epoll event masks
dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state
ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL
ipv6: fix udpv6 sendmsg crash caused by too small MTU
ipv6: ip6_make_skb() needs to clear cork.base.dst
lan78xx: Fix failure in USB Full Speed
net: igmp: fix source address check for IGMPv3 reports
net: qdisc_pkt_len_init() should be more robust
net: tcp: close sock if net namespace is exiting
pppoe: take ->needed_headroom of lower device into account on xmit
r8169: fix memory corruption on retrieval of hardware statistics.
sctp: do not allow the v4 socket to bind a v4mapped v6 address
sctp: return error if the asoc has been peeled off in sctp_wait_for_sndbuf
tipc: fix a memory leak in tipc_nl_node_get_link()
vmxnet3: repair memory leak
net: Allow neigh contructor functions ability to modify the primary_key
ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY
ppp: unlock all_ppp_mutex before registering device
be2net: restore properly promisc mode after queues reconfiguration
ip6_gre: init dev->mtu and dev->hard_header_len correctly
gso: validate gso_type in GSO handlers
mlxsw: spectrum_router: Don't log an error on missing neighbor
tun: fix a memory leak for tfile->tx_array
flow_dissector: properly cap thoff field
perf/x86/amd/power: Do not load AMD power module on !AMD platforms
x86/microcode/intel: Extend BDW late-loading further with LLC size check
hrtimer: Reset hrtimer cpu base proper on CPU hotplug
x86: bpf_jit: small optimization in emit_bpf_tail_call()
bpf: fix bpf_tail_call() x64 JIT
bpf: introduce BPF_JIT_ALWAYS_ON config
bpf: arsh is not supported in 32 bit alu thus reject it
bpf: avoid false sharing of map refcount with max_entries
bpf: fix divides by zero
bpf: fix 32-bit divide by zero
bpf: reject stores into ctx via st and xadd
nfsd: auth: Fix gid sorting when rootsquash enabled
Linux 4.9.79
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 999898355e08ae3b92dfd0a08db706e0c6703d30 upstream.
When LONG_MIN is passed to msgrcv, one would expect to recieve any
message. But convert_mode does *msgtyp = -*msgtyp and -LONG_MIN is
undefined. In particular, with my gcc -LONG_MIN produces -LONG_MIN
again.
So handle this case properly by assigning LONG_MAX to *msgtyp if
LONG_MIN was specified as msgtyp to msgrcv.
This code:
long msg[] = { 100, 200 };
int m = msgget(IPC_PRIVATE, IPC_CREAT | 0644);
msgsnd(m, &msg, sizeof(msg), 0);
msgrcv(m, &msg, sizeof(msg), LONG_MIN, 0);
produces currently nothing:
msgget(IPC_PRIVATE, IPC_CREAT|0644) = 65538
msgsnd(65538, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0
msgrcv(65538, ...
Except a UBSAN warning:
UBSAN: Undefined behaviour in ipc/msg.c:745:13
negation of -9223372036854775808 cannot be represented in type 'long int':
With the patch, I see what I expect:
msgget(IPC_PRIVATE, IPC_CREAT|0644) = 0
msgsnd(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0
msgrcv(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, -9223372036854775808, 0) = 16
Link: http://lkml.kernel.org/r/20161024082633.10148-1-jslaby@suse.cz
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAllp69gACgkQONu9yGCS
aT6WLQ/8CVbMTorw1Xb4x9vsIP7v/ozsPYDN41EbpsH19/3xxEyhR0LsN056GxA4
HPweVL7ridF6/IMzGg+RoBJsjsXR/+bxkhgS/7dcT37BKqAlUOSNV+AdVz7z+NgM
UOm+s6OjRSRz2A01jqd/k0emGWx0+3Cd2C/637RBgv+vo4D5lDTLhNTFbEM/45oL
QG0fz/Ba0C9dvrPMwMvMwLzS/Bi+h9PeUazjVgAQHCXUkeSvJYJTk8TmYVUGIJIc
E4CNaKl3qUEMph/7YujPpdXYaWR/SYnoBv0UYZovksxdmU0DegF4V3BVO8VKt0ma
K0gf8k3YRAXiEOW9D7yfffnZxbqA0Cwe4hau1gY+IVQYxwP6YaygBMHLbssTKbY1
WJ36WzQMmqiZ7YD7Yf7cHjDG+d1bl9UYACr0BE8rlWpHa1bJme6H8BdLVqnP7OM8
KpBgmRAZbQ5TBX/fCrd+PzgfS66393qMPiXzPrpTMyo4HfBAOs8FVCeyxzX+3Qwm
Qd35CH0wIcHSJ48d+3FCuxYQ654IoLIZieStnFYyXKGhFe5Cgd+6F91oJiRAvY59
BssBfl2RqnZB9l+tBMliYbCHTOzZKZv70Z9/+vw9ogNqiv7ZlTW7maobLzk7AQOf
HiCC6GpnxM2FzDSjD/vfQZM9LTSUhXw5owebA+vwCc3KZgpNrn0=
=zHg+
-----END PGP SIGNATURE-----
Merge 4.9.38 into android-4.9
Changes in 4.9.38
mqueue: fix a use-after-free in sys_mq_notify()
Add "shutdown" to "struct class".
tpm: Issue a TPM2_Shutdown for TPM2 devices.
tools include: Add a __fallthrough statement
tools string: Use __fallthrough in perf_atoll()
tools strfilter: Use __fallthrough
perf top: Use __fallthrough
perf thread_map: Correctly size buffer used with dirent->dt_name
perf intel-pt: Use __fallthrough
perf tests: Avoid possible truncation with dirent->d_name + snprintf
perf bench numa: Avoid possible truncation when using snprintf()
perf header: Fix handling of PERF_EVENT_UPDATE__SCALE
perf scripting perl: Fix compile error with some perl5 versions
perf probe: Fix to probe on gcc generated symbols for offline kernel
perf probe: Add error checks to offline probe post-processing
md: fix incorrect use of lexx_to_cpu in does_sb_need_changing
md: fix super_offset endianness in super_1_rdev_size_change
locking/rwsem-spinlock: Fix EINTR branch in __down_write_common()
staging: vt6556: vnt_start Fix missing call to vnt_key_init_table.
staging: comedi: fix clean-up of comedi_class in comedi_init()
crypto: caam - fix gfp allocation flags (part I)
crypto: rsa-pkcs1pad - use constant time memory comparison for MACs
ext4: check return value of kstrtoull correctly in reserved_clusters_store
x86/mm/pat: Don't report PAT on CPUs that don't support it
saa7134: fix warm Medion 7134 EEPROM read
Linux 4.9.38
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit f991af3daabaecff34684fd51fac80319d1baad1 upstream.
The retry logic for netlink_attachskb() inside sys_mq_notify()
is nasty and vulnerable:
1) The sock refcnt is already released when retry is needed
2) The fd is controllable by user-space because we already
release the file refcnt
so we when retry but the fd has been just closed by user-space
during this small window, we end up calling netlink_detachskb()
on the error path which releases the sock again, later when
the user-space closes this socket a use-after-free could be
triggered.
Setting 'sock' to NULL here should be sufficient to fix it.
Reported-by: GeneBlue <geneblue.mail@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 95e91b831f87ac8e1f8ed50c14d709089b4e01b8 upstream.
The issue is described here, with a nice testcase:
https://bugzilla.kernel.org/show_bug.cgi?id=192931
The problem is that shmat() calls do_mmap_pgoff() with MAP_FIXED, and
the address rounded down to 0. For the regular mmap case, the
protection mentioned above is that the kernel gets to generate the
address -- arch_get_unmapped_area() will always check for MAP_FIXED and
return that address. So by the time we do security_mmap_addr(0) things
get funky for shmat().
The testcase itself shows that while a regular user crashes, root will
not have a problem attaching a nil-page. There are two possible fixes
to this. The first, and which this patch does, is to simply allow root
to crash as well -- this is also regular mmap behavior, ie when hacking
up the testcase and adding mmap(... |MAP_FIXED). While this approach
is the safer option, the second alternative is to ignore SHM_RND if the
rounded address is 0, thus only having MAP_SHARED flags. This makes the
behavior of shmat() identical to the mmap() case. The downside of this
is obviously user visible, but does make sense in that it maintains
semantics after the round-down wrt 0 address and mmap.
Passes shm related ltp tests.
Link: http://lkml.kernel.org/r/1486050195-18629-1-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Gareth Evans <gareth.evans@contextis.co.uk>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This allows filesystems to use their mount private data to
influence the permssions they return in permission2. It has
been separated into a new call to avoid disrupting current
permission users.
Change-Id: I9d416e3b8b6eca84ef3e336bd2af89ddd51df6ca
Signed-off-by: Daniel Rosenberg <drosen@google.com>
When kmem accounting switched from account by default to only account if
flagged by __GFP_ACCOUNT, IPC mqueue and messages was left out.
The production use case at hand is that mqueues should be customizable
via sysctls in Docker containers in a Kubernetes cluster. This can only
be safely allowed to the users of the cluster (without the risk that
they can cause resource shortage on a node, influencing other users'
containers) if all resources they control are bounded, i.e. accounted
for.
Link: http://lkml.kernel.org/r/1476806075-1210-1-git-send-email-arozansk@redhat.com
Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
Reported-by: Stefan Schimanski <sttts@redhat.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Stefan Schimanski <sttts@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In CONFIG_PREEMPT=n kernel a softlockup was observed while the for loop in
exit_sem. Apparently it's possible for the loop to take quite a long time
and it doesn't have a scheduling point in it. Since the codes is
executing under an rcu read section this may also cause rcu stalls, which
in turn block synchronize_rcu operations, which more or less de-stabilises
the whole system.
Fix this by introducing a cond_resched() at the beginning of the loop.
So this patch fixes the following:
NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [httpd:18119]
CPU: 10 PID: 18119 Comm: httpd Tainted: G O 4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
task: ffff88348d695280 ti: ffff881c95550000 task.ti: ffff881c95550000
RIP: 0010:[<ffffffff81614bc7>] [<ffffffff81614bc7>] _raw_spin_lock+0x17/0x30
RSP: 0018:ffff881c95553e40 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff883161b1eea8 RCX: 000000000000000d
RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff883161b1eea4
RBP: ffff881c95553ea0 R08: ffff881c95553e68 R09: ffff883fef376f88
R10: ffff881fffb58c20 R11: ffffea0072556600 R12: ffff883161b1eea0
R13: ffff88348d695280 R14: ffff883dec427000 R15: ffff8831621672a0
FS: 0000000000000000(0000) GS:ffff881fffb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3b3723e020 CR3: 0000000001c0a000 CR4: 00000000001406e0
Call Trace:
? exit_sem+0x7c/0x280
do_exit+0x338/0xb40
do_group_exit+0x43/0xd0
SyS_exit_group+0x14/0x20
entry_SYSCALL_64_fastpath+0x16/0x6e
Link: http://lkml.kernel.org/r/1475154992-6363-1-git-send-email-kernel@kyup.com
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Blocked tasks queued in q_senders waiting for their message to fit in the
queue are blindly awoken every time we think there's a remote chance this
might happen. This could cause numerous (and expensive -- thundering
herd-ish) bogus wakeups if the queue is still really full. Adding to the
scheduling cost/overhead, there's also the fact that we need to take the
ipc object lock and requeue ourselves in the q_senders list.
By keeping track of the blocked sender's message size, we can know
previously if the wakeup ought to occur or not. Otherwise, to maintain
the current wakeup order we just move it to the tail. This is exactly
what occurs right now if the sender needs to go back to sleep.
The case of EIDRM is left completely untouched, as we need to wakeup all
the tasks, and shouldn't be playing games in the first place.
This patch was seen to save on the 'msgctl10' ltp testcase ~15% in context
switches (avg out of ten runs). Although these tests are really about
functionality (as opposed to performance), is does show the direct
benefits of the optimization.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1469748819-19484-6-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the use of wake_qs in sysv msg queues are only for the receiver
tasks that are blocked on the queue. But blocked sender tasks (due to
queue size constraints) still are awoken with the ipc object lock held,
which can be a problem particularly for small sized queues and far from
gracious for -rt (just like it was for the receiver side).
The paths that actually wakeup a sender are obviously related to when we
are either getting rid of the queue or after (some) space is freed-up
after a receiver takes the msg (msgrcv). Furthermore, with the exception
of msgrcv, we can always piggy-back on expunge_all that has its own tasks
lined-up for waking. Finally, upon unlinking the message, it should be no
problem delaying the wakeups a bit until after we've released the lock.
Link: http://lkml.kernel.org/r/1469748819-19484-3-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch moves the wakeup_process() invocation so it is not done under
the ipc global lock by making use of a lockless wake_q. With this change,
the waiter is woken up once the message has been assigned and it does not
need to loop on SMP if the message points to NULL. In the signal case we
still need to check the pointer under the lock to verify the state.
This change should also avoid the introduction of preempt_disable() in -RT
which avoids a busy-loop which pools for the NULL -> !NULL change if the
waiter has a higher priority compared to the waker.
By making use of wake_qs, the logic of sysv msg queues is greatly
simplified (and very well suited as we can batch lockless wakeups),
particularly around the lockless receive algorithm.
This has been tested with Manred's pmsg-shared tool on a "AMD A10-7800
Radeon R7, 12 Compute Cores 4C+8G":
test | before | after | diff
-----------------|------------|------------|----------
pmsg-shared 8 60 | 19,347,422 | 30,442,191 | + ~57.34 %
pmsg-shared 4 60 | 21,367,197 | 35,743,458 | + ~67.28 %
pmsg-shared 2 60 | 22,884,224 | 24,278,200 | + ~6.09 %
Link: http://lkml.kernel.org/r/1469748819-19484-2-git-send-email-dave@stgolabs.net
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 6d07b68ce1 ("ipc/sem.c: optimize sem_lock()") introduced a
race:
sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)
As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).
The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.
The patch also updates stale documentation regarding the locking.
With regards to stable kernels:
The patch is required for all kernels that include the
commit 6d07b68ce1 ("ipc/sem.c: optimize sem_lock()") (3.10?)
The alternative is to revert the patch that introduced the race.
The patch is safe for backporting, i.e. it makes no assumptions
about memory barriers in spin_unlock_wait().
Background:
Here is the race of the current implementation:
Thread A: (simple op)
- does the first "sma->complex_count == 0" test
Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
drops sem_lock, sleeps
Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test
Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count
Thread A:
- does the complex_count test
Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.
Fixes: 6d07b68ce1 ("ipc/sem.c: optimize sem_lock()")
Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com
Reported-by: <felixh@informatik.uni-bremen.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <1vier1@web.de>
Cc: <stable@vger.kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.
CURRENT_TIME is also not y2038 safe.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.
Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
From: Andrey Vagin <avagin@openvz.org>
Each namespace has an owning user namespace and now there is not way
to discover these relationships.
Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.
Why we may want to know relationships between namespaces?
One use would be visualization, in order to understand the running
system. Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?
One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.
There [1] was a discussion about which interface to choose to determing
relationships between namespaces.
Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble. I think this may actually a case for creating ioctls
> for these two cases. Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.
Here is an implementaions of these ioctl-s.
$ man man7/namespaces.7
...
Since Linux 4.X, the following ioctl(2) calls are supported for
namespace file descriptors. The correct syntax is:
fd = ioctl(ns_fd, ioctl_type);
where ioctl_type is one of the following:
NS_GET_USERNS
Returns a file descriptor that refers to an owning user names‐
pace.
NS_GET_PARENT
Returns a file descriptor that refers to a parent namespace.
This ioctl(2) can be used for pid and user namespaces. For
user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
meaning.
In addition to generic ioctl(2) errors, the following specific ones
can occur:
EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.
EPERM The requested namespace is outside of the current namespace
scope.
[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101
Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
outside of the init namespace, so we can return EPERM in this case too.
> The fewer special cases the easier the code is to get
> correct, and the easier it is to read. // Eric
Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: "W. Trevor King" <wking@tremily.us>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Return -EPERM if an owning user namespace is outside of a process
current user namespace.
v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
This special cases was removed from this version. There is nothing
outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.
The best general error code I have found for hitting a resource limit
is ENOSPC. It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Pull userns vfs updates from Eric Biederman:
"This tree contains some very long awaited work on generalizing the
user namespace support for mounting filesystems to include filesystems
with a backing store. The real world target is fuse but the goal is
to update the vfs to allow any filesystem to be supported. This
patchset is based on a lot of code review and testing to approach that
goal.
While looking at what is needed to support the fuse filesystem it
became clear that there were things like xattrs for security modules
that needed special treatment. That the resolution of those concerns
would not be fuse specific. That sorting out these general issues
made most sense at the generic level, where the right people could be
drawn into the conversation, and the issues could be solved for
everyone.
At a high level what this patchset does a couple of simple things:
- Add a user namespace owner (s_user_ns) to struct super_block.
- Teach the vfs to handle filesystem uids and gids not mapping into
to kuids and kgids and being reported as INVALID_UID and
INVALID_GID in vfs data structures.
By assigning a user namespace owner filesystems that are mounted with
only user namespace privilege can be detected. This allows security
modules and the like to know which mounts may not be trusted. This
also allows the set of uids and gids that are communicated to the
filesystem to be capped at the set of kuids and kgids that are in the
owning user namespace of the filesystem.
One of the crazier corner casees this handles is the case of inodes
whose i_uid or i_gid are not mapped into the vfs. Most of the code
simply doesn't care but it is easy to confuse the inode writeback path
so no operation that could cause an inode write-back is permitted for
such inodes (aka only reads are allowed).
This set of changes starts out by cleaning up the code paths involved
in user namespace permirted mounts. Then when things are clean enough
adds code that cleanly sets s_user_ns. Then additional restrictions
are added that are possible now that the filesystem superblock
contains owner information.
These changes should not affect anyone in practice, but there are some
parts of these restrictions that are changes in behavior.
- Andy's restriction on suid executables that does not honor the
suid bit when the path is from another mount namespace (think
/proc/[pid]/fd/) or when the filesystem was mounted by a less
privileged user.
- The replacement of the user namespace implicit setting of MNT_NODEV
with implicitly setting SB_I_NODEV on the filesystem superblock
instead.
Using SB_I_NODEV is a stronger form that happens to make this state
user invisible. The user visibility can be managed but it caused
problems when it was introduced from applications reasonably
expecting mount flags to be what they were set to.
There is a little bit of work remaining before it is safe to support
mounting filesystems with backing store in user namespaces, beyond
what is in this set of changes.
- Verifying the mounter has permission to read/write the block device
during mount.
- Teaching the integrity modules IMA and EVM to handle filesystems
mounted with only user namespace root and to reduce trust in their
security xattrs accordingly.
- Capturing the mounters credentials and using that for permission
checks in d_automount and the like. (Given that overlayfs already
does this, and we need the work in d_automount it make sense to
generalize this case).
Furthermore there are a few changes that are on the wishlist:
- Get all filesystems supporting posix acls using the generic posix
acls so that posix_acl_fix_xattr_from_user and
posix_acl_fix_xattr_to_user may be removed. [Maintainability]
- Reducing the permission checks in places such as remount to allow
the superblock owner to perform them.
- Allowing the superblock owner to chown files with unmapped uids and
gids to something that is mapped so the files may be treated
normally.
I am not considering even obvious relaxations of permission checks
until it is clear there are no more corner cases that need to be
locked down and handled generically.
Many thanks to Seth Forshee who kept this code alive, and putting up
with me rewriting substantial portions of what he did to handle more
corner cases, and for his diligent testing and reviewing of my
changes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
fs: Call d_automount with the filesystems creds
fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
evm: Translate user/group ids relative to s_user_ns when computing HMAC
dquot: For now explicitly don't support filesystems outside of init_user_ns
quota: Handle quota data stored in s_user_ns in quota_setxquota
quota: Ensure qids map to the filesystem
vfs: Don't create inodes with a uid or gid unknown to the vfs
vfs: Don't modify inodes with a uid or gid unknown to the vfs
cred: Reject inodes with invalid ids in set_create_file_as()
fs: Check for invalid i_uid in may_follow_link()
vfs: Verify acls are valid within superblock's s_user_ns.
userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
fs: Refuse uid/gid changes which don't map into s_user_ns
selinux: Add support for unprivileged mounts from user namespaces
Smack: Handle labels consistently in untrusted mounts
Smack: Add support for unprivileged mounts from user namespaces
fs: Treat foreign mounts as nosuid
fs: Limit file caps to the user namespace of the super block
userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
userns: Remove implicit MNT_NODEV fragility.
...
We are going to need to call shmem_charge() under tree_lock to get
accoutning right on collapse of small tmpfs pages into a huge one.
The problem is that tree_lock is irq-safe and lockdep is not happy, that
we take irq-unsafe lock under irq-safe[1].
Let's convert the lock to irq-safe.
[1] https://gist.github.com/kiryl/80c0149e03ed35dfaf26628b8e03cdbc
Link: http://lkml.kernel.org/r/1466021202-61880-34-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide a shmem_get_unmapped_area method in file_operations, called at
mmap time to decide the mapping address. It could be conditional on
CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making
it unconditional.
shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout). Lots of conditions, and in most cases it
just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back to
ask for a larger arena, within which to align the mapping suitably.
There have to be some direct calls to shmem_get_unmapped_area(), not via
the file_operations: because of the way shmem_zero_setup() is called to
create a shmem object late in the mmap sequence, when MAP_SHARED is
requested with MAP_ANONYMOUS or /dev/zero. Though this only matters
when /proc/sys/vm/shmem_huge has been set.
Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a function may_open_dev that tests MNT_NODEV and a new
superblock flab SB_I_NODEV. Use this new function in all of the
places where MNT_NODEV was previously tested.
Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
filesystems should never support device nodes, and a simple superblock
flags makes that very hard to get wrong. With SB_I_NODEV set if any
device nodes somehow manage to show up on on a filesystem those
device nodes will be unopenable.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Set SB_I_NOEXEC on mqueuefs to ensure small implementation mistakes
do not result in executable on mqueuefs by accident.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Today what is normally called data (the mount options) is not passed
to fill_super through mount_ns.
Pass the mount options and the namespace separately to mount_ns so
that filesystems such as proc that have mount options, can use
mount_ns.
Pass the user namespace to mount_ns so that the standard permission
check that verifies the mounter has permissions over the namespace can
be performed in mount_ns instead of in each filesystems .mount method.
Thus removing the duplication between mqueuefs and proc in terms of
permission checks. The extra permission check does not currently
affect the rpc_pipefs filesystem and the nfsd filesystem as those
filesystems do not currently allow unprivileged mounts. Without
unpvileged mounts it is guaranteed that the caller has already passed
capable(CAP_SYS_ADMIN) which guarantees extra permission check will
pass.
Update rpc_pipefs and the nfsd filesystem to ensure that the network
namespace reference is always taken in fill_super and always put in kill_sb
so that the logic is simpler and so that errors originating inside of
fill_super do not cause a network namespace leak.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Allow the ipc namespace initialization code to depend on ns->user_ns
being set during initialization.
In particular this allows mq_init_ns to use ns->user_ns for permission
checks and initializating s_user_ns while the the mq filesystem is
being mounted.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Suggested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
With the modified semantics of spin_unlock_wait() a number of
explicit barriers can be removed. Also update the comment for the
do_exit() usecase, as that was somewhat stale/obscure.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce smp_acquire__after_ctrl_dep(), this construct is not
uncommon, but the lack of this barrier is.
Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC
semaphore code and the qspinlock code.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
shmat and shmdt rely on mmap_sem for write. If the waiting task gets
killed by the oom killer it would block oom_reaper from asynchronous
address space reclaim and reduce the chances of timely OOM resolving.
Wait for the lock in the killable mode and return with EINTR if the task
got killed while waiting.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As indicated by bug#112271, Linux sets the sempid value upon semctl, and
not only for semop calls. However, within semctl we only do this for
SETVAL, leaving SETALL without updating the field, and therefore rather
inconsistent behavior when compared to other Unices.
There is really no documentation regarding this and therefore users
should not make assumptions. With this patch, along with updating
semctl.2 manpages, this scenario should become less ambiguous As such,
set sempid on SETALL cmd.
Also update some in-code documentation, specifying where the sempid is
set.
Passes ltp and custom testcase where a child (fork) does SETALL to the
set.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Philip Semanchuk <linux_kernel.20.ick@spamgourmet.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: PrasannaKumar Muralidharan <prasannatsmkumar@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remap_file_pages(2) emulation can reach file which represents removed
IPC ID as long as a memory segment is mapped. It breaks expectations of
IPC subsystem.
Test case (rewritten to be more human readable, originally autogenerated
by syzkaller[1]):
#define _GNU_SOURCE
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/mman.h>
#include <sys/shm.h>
#define PAGE_SIZE 4096
int main()
{
int id;
void *p;
id = shmget(IPC_PRIVATE, 3 * PAGE_SIZE, 0);
p = shmat(id, NULL, 0);
shmctl(id, IPC_RMID, NULL);
remap_file_pages(p, 3 * PAGE_SIZE, 0, 7, 0);
return 0;
}
The patch changes shm_mmap() and code around shm_lock() to propagate
locking error back to caller of shm_mmap().
[1] http://github.com/google/syzkaller
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull final vfs updates from Al Viro:
- The ->i_mutex wrappers (with small prereq in lustre)
- a fix for too early freeing of symlink bodies on shmem (they need to
be RCU-delayed) (-stable fodder)
- followup to dedupe stuff merged this cycle
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: abort dedupe loop if fatal signals are pending
make sure that freeing shmem fast symlinks is RCU-delayed
wrappers for ->i_mutex access
lustre: remove unused declaration