In case task_struct was locked out, we can't check its cmdline
If we do, we'll get a soft lockup
Background and top-app check is enough
Signed-off-by: Diep Quynh <remilia.1505@gmail.com>
How to trigger gaming mode? Just open a game that is supported in games list
How to exit? Kill the game from recents, or simply back to homescreen
What does this gaming mode do?
- It limits big cluster maximum frequency to 2,0GHz, and little cluster
to values matching GPU frequencies as below:
+ 338MHz: 455MHz
+ 385MHz and above: 1053MHz
- As for the cluster freq limits, it overcomes heating issue while playing
heavy games, as well as saves battery juice
Big thanks to [kerneltoast] for the following commits on his wahoo kernel
5ac1e81d3de13e2c4554
Gaming control's idea was based on these
Signed-off-by: Diep Quynh <remilia.1505@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl50ed8ACgkQONu9yGCS
aT7A8A//WF8aq34T1LLR70ESa6fJVj2dEzCQA94qvUCPHDJfBj1x00R8OGvZfKEc
E7YPI8MPrvMmXnyxImge2UcqCJhmnis+gk1fLn+N3Yi5nfZWrxC1dnShCnCTeye1
tA+BuhyzOXHDmLqrktBcT3TCTrIUWK8lehR8m7jlkAh7wG6Z2QmgKlaiq5fRAVE9
Z0je4fw0HxdVrCM84BJiM1M6w3pa+AYUkSBfqvU+fgm8xfOWGUPZFEitx9O1PwaH
m0UOS51gM5ZSzpAMDBlXYnmUgPxUTlpwqIL77zgMFzYcpKbf2rVphFRxmacWWqQ2
MWnt6vn/eatDeJ6TFAF08APy63SUAR7BvssSYoySz8Y4m6LbL4yUfCWYkL9lhLFH
dJL6TTjUwFw1Rd7UVzHongEr24NASCS37MiR8V/3RUiDqIYtrlaJKxhhVdTKg7mm
xRKA+QBXkGUfwFJm2mL+3E3D7/Xgl4kNo22lp1YgamTgc50/UMB5JEf8Z3aqa/qh
ie/4oKnmaa+SSgOxQmvqS3lWF7RYU+axvc9K0FK5CYSXnc6Jfgtq0yDB5twslj05
HdCwNJX9KOL0NZAzQKs+FLjnhOtnRigANMG7KMm5nmS3vwsxQqIkm0b6tPhgBcd/
k4QQm8h6C5Tdi+R39fGQOgtBK0gVVJl+n3XfA3gbmunvXhSo6IQ=
=fFH1
-----END PGP SIGNATURE-----
Merge 4.9.217 into android-4.9-q
Changes in 4.9.217
NFS: Remove superfluous kmap in nfs_readdir_xdr_to_array
phy: Revert toggling reset changes.
net: phy: Avoid multiple suspends
cgroup, netclassid: periodically release file_lock on classid updating
gre: fix uninit-value in __iptunnel_pull_header
ipv6/addrconf: call ipv6_mc_up() for non-Ethernet interface
net: macsec: update SCI upon MAC address change.
net: nfc: fix bounds checking bugs on "pipe"
r8152: check disconnect status after long sleep
bnxt_en: reinitialize IRQs when MTU is modified
fib: add missing attribute validation for tun_id
nl802154: add missing attribute validation
nl802154: add missing attribute validation for dev_type
macsec: add missing attribute validation for port
net: fq: add missing attribute validation for orphan mask
team: add missing attribute validation for port ifindex
team: add missing attribute validation for array index
nfc: add missing attribute validation for SE API
nfc: add missing attribute validation for vendor subcommand
ipvlan: add cond_resched_rcu() while processing muticast backlog
ipvlan: do not add hardware address of master to its unicast filter list
ipvlan: egress mcast packets are not exceptional
ipvlan: do not use cond_resched_rcu() in ipvlan_process_multicast()
ipvlan: don't deref eth hdr before checking it's set
macvlan: add cond_resched() during multicast processing
net: fec: validate the new settings in fec_enet_set_coalesce()
slip: make slhc_compress() more robust against malicious packets
bonding/alb: make sure arp header is pulled before accessing it
cgroup: memcg: net: do not associate sock with unrelated cgroup
net: phy: fix MDIO bus PM PHY resuming
virtio-blk: fix hw_queue stopped on arbitrary error
iommu/vt-d: quirk_ioat_snb_local_iommu: replace WARN_TAINT with pr_warn + add_taint
workqueue: don't use wq_select_unbound_cpu() for bound works
drm/amd/display: remove duplicated assignment to grph_obj_type
cifs_atomic_open(): fix double-put on late allocation failure
gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache
KVM: x86: clear stale x86_emulate_ctxt->intercept value
ARC: define __ALIGN_STR and __ALIGN symbols for ARC
efi: Fix a race and a buffer overflow while reading efivars via sysfs
iommu/vt-d: dmar: replace WARN_TAINT with pr_warn + add_taint
iommu/vt-d: Fix a bug in intel_iommu_iova_to_phys() for huge page
nl80211: add missing attribute validation for critical protocol indication
nl80211: add missing attribute validation for beacon report scanning
nl80211: add missing attribute validation for channel switch
netfilter: cthelper: add missing attribute validation for cthelper
mwifiex: Fix heap overflow in mmwifiex_process_tdls_action_frame()
iommu/vt-d: Fix the wrong printing in RHSA parsing
iommu/vt-d: Ignore devices with out-of-spec domain number
ipv6: restrict IPV6_ADDRFORM operation
efi: Add a sanity check to efivar_store_raw()
batman-adv: Fix double free during fragment merge error
batman-adv: Fix transmission of final, 16th fragment
batman-adv: Initialize gw sel_class via batadv_algo
batman-adv: Fix rx packet/bytes stats on local ARP reply
batman-adv: Use default throughput value on cfg80211 error
batman-adv: Accept only filled wifi station info
batman-adv: fix TT sync flag inconsistencies
batman-adv: Avoid spurious warnings from bat_v neigh_cmp implementation
batman-adv: Always initialize fragment header priority
batman-adv: Fix check of retrieved orig_gw in batadv_v_gw_is_eligible
batman-adv: Fix lock for ogm cnt access in batadv_iv_ogm_calc_tq
batman-adv: Fix internal interface indices types
batman-adv: Avoid race in TT TVLV allocator helper
batman-adv: Fix TT sync flags for intermediate TT responses
batman-adv: prevent TT request storms by not sending inconsistent TT TLVLs
batman-adv: Fix debugfs path for renamed hardif
batman-adv: Fix debugfs path for renamed softif
batman-adv: Avoid storing non-TT-sync flags on singular entries too
batman-adv: Fix multicast TT issues with bogus ROAM flags
batman-adv: Prevent duplicated gateway_node entry
batman-adv: Fix duplicated OGMs on NETDEV_UP
batman-adv: Avoid free/alloc race when handling OGM2 buffer
batman-adv: Avoid free/alloc race when handling OGM buffer
batman-adv: Don't schedule OGM for disabled interface
batman-adv: update data pointers after skb_cow()
batman-adv: Avoid probe ELP information leak
batman-adv: Use explicit tvlv padding for ELP packets
perf/amd/uncore: Replace manual sampling check with CAP_NO_INTERRUPT flag
ACPI: watchdog: Allow disabling WDAT at boot
HID: apple: Add support for recent firmware on Magic Keyboards
HID: i2c-hid: add Trekstor Surfbook E11B to descriptor override
cfg80211: check reg_rule for NULL in handle_channel_custom()
net: ks8851-ml: Fix IRQ handling and locking
mac80211: rx: avoid RCU list traversal under mutex
signal: avoid double atomic counter increments for user accounting
jbd2: fix data races at struct journal_head
ARM: 8957/1: VDSO: Match ARMv8 timer in cntvct_functional()
ARM: 8958/1: rename missed uaccess .fixup section
mm: slub: add missing TID bump in kmem_cache_alloc_bulk()
ipv4: ensure rcu_read_lock() in cipso_v4_error()
Linux 4.9.217
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia7aeed273cd7548dc8d0dfaaad8b96bedfe499b1
[ Upstream commit fda31c50292a5062332fa0343c084bd9f46604d9 ]
When queueing a signal, we increment both the users count of pending
signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount
of the user struct itself (because we keep a reference to the user in
the signal structure in order to correctly account for it when freeing).
That turns out to be fairly expensive, because both of them are atomic
updates, and particularly under extreme signal handling pressure on big
machines, you can get a lot of cache contention on the user struct.
That can then cause horrid cacheline ping-pong when you do these
multiple accesses.
So change the reference counting to only pin the user for the _first_
pending signal, and to unpin it when the last pending signal is
dequeued. That means that when a user sees a lot of concurrent signal
queuing - which is the only situation when this matters - the only
atomic access needed is generally the 'sigpending' count update.
This was noticed because of a particularly odd timing artifact on a
dual-socket 96C/192T Cascade Lake platform: when you get into bad
contention, on that machine for some reason seems to be much worse when
the contention happens in the upper 32-byte half of the cacheline.
As a result, the kernel test robot will-it-scale 'signal1' benchmark had
an odd performance regression simply due to random alignment of the
'struct user_struct' (and pointed to a completely unrelated and
apparently nonsensical commit for the regression).
Avoiding the double increments (and decrements on the dequeueing side,
of course) makes for much less contention and hugely improved
performance on that will-it-scale microbenchmark.
Quoting Feng Tang:
"It makes a big difference, that the performance score is tripled! bump
from original 17000 to 54000. Also the gap between 5.0-rc6 and
5.0-rc6+Jiri's patch is reduced to around 2%"
[ The "2% gap" is the odd cacheline placement difference on that
platform: under the extreme contention case, the effect of which half
of the cacheline was hot was 5%, so with the reduced contention the
odd timing artifact is reduced too ]
It does help in the non-contended case too, but is not nearly as
noticeable.
Reported-and-tested-by: Feng Tang <feng.tang@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Philip Li <philip.li@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl4xT1kACgkQONu9yGCS
aT7d9Q/9FvxHEFvYlet8Bfx/uLFsVGOKwDiZ+e7eQhYqzamdl75qLbPbPqcyfeDO
hCa87JFJCCpRlcHjSvdNk/IF9rRpA6+Wi7FijX/UjgmWNPCao6qb5vBRoH9KMgOk
rtDnIp+klLcG+xLAnTkLgAVCm7osGMVZgwKKqhCLEKdVIT38VlP9nCipehTeMmrS
1f8sqQhqXt0FpticFP5JVJXg3iH5bdqXNc9eCXuF1h7nudYEut8v5NFoF3w0EInZ
iMKnLhbfR9ScOpzaWE/Hg5tJIUwFG9l/IhlDDJJk3O0oTKzs+ZM4N8+RzZ3iFy+N
JNzspQatLKIHSbGGyGK61bu4Eq2bGdCaQMPcMmFHiDKtz7+thE7bjZOcPjzo4pjA
3JJ9ytVUAujIXKryed5E/vNfG6W/puIY6xaM1LPJPIYGa20/0gJGiWXnsXBk6NxG
EvkuC1nwYcIdkopYqM4OZz8n63ywya0pY5JifVW6q3zYU8sqwC9mkkAYGjT0IRb8
llvUyaO05nIDJkY2mgv124cxxyKTsUYPc7+c/VTW7HGyzAc7MLPOpYDbXZHW9WWT
GNGhlBJlk1UK87CRoHswuPgmx+eABXtGjmwHQJbxRmf1y7aWFNjYBKAyQCi9y/sm
XW4LT4l4JhT3xB5SFcti1iqchD2OFk65CKCFeHkm54URXb68Qjs=
=PuM5
-----END PGP SIGNATURE-----
Merge 4.9.212 into android-4.9-q
Changes in 4.9.212
xfs: Sanity check flags of Q_XQUOTARM call
powerpc/archrandom: fix arch_get_random_seed_int()
mt7601u: fix bbp version check in mt7601u_wait_bbp_ready
drm/sti: do not remove the drm_bridge that was never added
drm/virtio: fix bounds check in virtio_gpu_cmd_get_capset()
ALSA: hda: fix unused variable warning
IB/rxe: replace kvfree with vfree
ALSA: usb-audio: update quirk for B&W PX to remove microphone
staging: comedi: ni_mio_common: protect register write overflow
pwm: lpss: Release runtime-pm reference from the driver's remove callback
mlxsw: reg: QEEC: Add minimum shaper fields
pcrypt: use format specifier in kobject_add
exportfs: fix 'passing zero to ERR_PTR()' warning
drm/dp_mst: Skip validating ports during destruction, just ref
net: phy: Fix not to call phy_resume() if PHY is not attached
pinctrl: sh-pfc: r8a7740: Add missing REF125CK pin to gether_gmii group
pinctrl: sh-pfc: r8a7740: Add missing LCD0 marks to lcd0_data24_1 group
pinctrl: sh-pfc: r8a7791: Remove bogus ctrl marks from qspi_data4_b group
pinctrl: sh-pfc: r8a7791: Remove bogus marks from vin1_b_data18 group
pinctrl: sh-pfc: sh73a0: Add missing TO pin to tpu4_to3 group
pinctrl: sh-pfc: r8a7794: Remove bogus IPSR9 field
pinctrl: sh-pfc: sh7734: Add missing IPSR11 field
pinctrl: sh-pfc: sh7269: Add missing PCIOR0 field
pinctrl: sh-pfc: sh7734: Remove bogus IPSR10 value
Input: nomadik-ske-keypad - fix a loop timeout test
clk: highbank: fix refcount leak in hb_clk_init()
clk: qoriq: fix refcount leak in clockgen_init()
clk: socfpga: fix refcount leak
clk: samsung: exynos4: fix refcount leak in exynos4_get_xom()
clk: imx6q: fix refcount leak in imx6q_clocks_init()
clk: imx6sx: fix refcount leak in imx6sx_clocks_init()
clk: imx7d: fix refcount leak in imx7d_clocks_init()
clk: vf610: fix refcount leak in vf610_clocks_init()
clk: armada-370: fix refcount leak in a370_clk_init()
clk: kirkwood: fix refcount leak in kirkwood_clk_init()
clk: armada-xp: fix refcount leak in axp_clk_init()
clk: dove: fix refcount leak in dove_clk_init()
IB/usnic: Fix out of bounds index check in query pkey
RDMA/ocrdma: Fix out of bounds index check in query pkey
RDMA/qedr: Fix out of bounds index check in query pkey
arm64: dts: apq8016-sbc: Increase load on l11 for SDCARD
drm/etnaviv: NULL vs IS_ERR() buf in etnaviv_core_dump()
media: s5p-jpeg: Correct step and max values for V4L2_CID_JPEG_RESTART_INTERVAL
crypto: tgr192 - fix unaligned memory access
ASoC: imx-sgtl5000: put of nodes if finding codec fails
IB/iser: Pass the correct number of entries for dma mapped SGL
rtc: cmos: ignore bogus century byte
clk: sunxi-ng: sun8i-a23: Enable PLL-MIPI LDOs when ungating it
iwlwifi: mvm: fix A-MPDU reference assignment
tty: ipwireless: Fix potential NULL pointer dereference
crypto: crypto4xx - Fix wrong ppc4xx_trng_probe()/ppc4xx_trng_remove() arguments
ARM: dts: lpc32xx: add required clocks property to keypad device node
ARM: dts: lpc32xx: reparent keypad controller to SIC1
ARM: dts: lpc32xx: fix ARM PrimeCell LCD controller variant
ARM: dts: lpc32xx: fix ARM PrimeCell LCD controller clocks property
ARM: dts: lpc32xx: phy3250: fix SD card regulator voltage
iwlwifi: mvm: fix RSS config command
staging: most: cdev: add missing check for cdev_add failure
rtc: ds1672: fix unintended sign extension
thermal: mediatek: fix register index error
net: phy: fixed_phy: Fix fixed_phy not checking GPIO
rtc: 88pm860x: fix unintended sign extension
rtc: 88pm80x: fix unintended sign extension
rtc: pm8xxx: fix unintended sign extension
fbdev: chipsfb: remove set but not used variable 'size'
iw_cxgb4: use tos when importing the endpoint
iw_cxgb4: use tos when finding ipv6 routes
pinctrl: sh-pfc: emev2: Add missing pinmux functions
pinctrl: sh-pfc: r8a7791: Fix scifb2_data_c pin group
pinctrl: sh-pfc: r8a7792: Fix vin1_data18_b pin group
pinctrl: sh-pfc: sh73a0: Fix fsic_spdif pin groups
usb: phy: twl6030-usb: fix possible use-after-free on remove
block: don't use bio->bi_vcnt to figure out segment number
keys: Timestamp new keys
vfio_pci: Enable memory accesses before calling pci_map_rom
dmaengine: mv_xor: Use correct device for DMA API
cdc-wdm: pass return value of recover_from_urb_loss
regulator: pv88060: Fix array out-of-bounds access
regulator: pv88080: Fix array out-of-bounds access
regulator: pv88090: Fix array out-of-bounds access
net: dsa: qca8k: Enable delay for RGMII_ID mode
drm/nouveau/bios/ramcfg: fix missing parentheses when calculating RON
drm/nouveau/pmu: don't print reply values if exec is false
ASoC: qcom: Fix of-node refcount unbalance in apq8016_sbc_parse_of()
fs/nfs: Fix nfs_parse_devname to not modify it's argument
NFS: Fix a soft lockup in the delegation recovery code
clocksource/drivers/sun5i: Fail gracefully when clock rate is unavailable
clocksource/drivers/exynos_mct: Fix error path in timer resources initialization
mmc: sdhci-brcmstb: handle mmc_of_parse() errors during probe
ARM: 8847/1: pm: fix HYP/SVC mode mismatch when MCPM is used
ARM: 8848/1: virt: Align GIC version check with arm64 counterpart
regulator: wm831x-dcdc: Fix list of wm831x_dcdc_ilim from mA to uA
nios2: ksyms: Add missing symbol exports
scsi: megaraid_sas: reduce module load time
drivers/rapidio/rio_cm.c: fix potential oops in riocm_ch_listen()
xen, cpu_hotplug: Prevent an out of bounds access
net: sh_eth: fix a missing check of of_get_phy_mode
media: ivtv: update *pos correctly in ivtv_read_pos()
media: cx18: update *pos correctly in cx18_read_pos()
media: wl128x: Fix an error code in fm_download_firmware()
media: cx23885: check allocation return
regulator: tps65086: Fix tps65086_ldoa1_ranges for selector 0xB
jfs: fix bogus variable self-initialization
tipc: tipc clang warning
m68k: mac: Fix VIA timer counter accesses
ARM: OMAP2+: Fix potentially uninitialized return value for _setup_reset()
media: davinci-isif: avoid uninitialized variable use
media: tw5864: Fix possible NULL pointer dereference in tw5864_handle_frame
spi: tegra114: clear packed bit for unpacked mode
spi: tegra114: fix for unpacked mode transfers
soc/fsl/qe: Fix an error code in qe_pin_request()
spi: bcm2835aux: fix driver to not allow 65535 (=-1) cs-gpios
ehea: Fix a copy-paste err in ehea_init_port_res
scsi: qla2xxx: Unregister chrdev if module initialization fails
ARM: pxa: ssp: Fix "WARNING: invalid free of devm_ allocated data"
hwmon: (w83627hf) Use request_muxed_region for Super-IO accesses
tipc: set sysctl_tipc_rmem and named_timeout right range
powerpc: vdso: Make vdso32 installation conditional in vdso_install
ARM: dts: ls1021: Fix SGMII PCS link remaining down after PHY disconnect
media: ov2659: fix unbalanced mutex_lock/unlock
6lowpan: Off by one handling ->nexthdr
dmaengine: axi-dmac: Don't check the number of frames for alignment
ALSA: usb-audio: Handle the error from snd_usb_mixer_apply_create_quirk()
packet: in recvmsg msg_name return at least sizeof sockaddr_ll
ASoC: fix valid stream condition
usb: gadget: fsl: fix link error against usb-gadget module
IB/mlx5: Add missing XRC options to QP optional params mask
iommu/vt-d: Make kernel parameter igfx_off work with vIOMMU
net: ena: fix swapped parameters when calling ena_com_indirect_table_fill_entry
net: ena: fix: Free napi resources when ena_up() fails
net: ena: fix incorrect test of supported hash function
net: ena: fix ena_com_fill_hash_function() implementation
dmaengine: tegra210-adma: restore channel status
l2tp: Fix possible NULL pointer dereference
media: omap_vout: potential buffer overflow in vidioc_dqbuf()
media: davinci/vpbe: array underflow in vpbe_enum_outputs()
platform/x86: alienware-wmi: printing the wrong error code
netfilter: ebtables: CONFIG_COMPAT: reject trailing data after last rule
pwm: meson: Don't disable PWM when setting duty repeatedly
ARM: riscpc: fix lack of keyboard interrupts after irq conversion
kdb: do a sanity check on the cpu in kdb_per_cpu()
backlight: lm3630a: Return 0 on success in update_status functions
thermal: cpu_cooling: Actually trace CPU load in thermal_power_cpu_get_power
dmaengine: tegra210-adma: Fix crash during probe
spi: spi-fsl-spi: call spi_finalize_current_message() at the end
crypto: ccp - fix AES CFB error exposed by new test vectors
serial: stm32: fix transmit_chars when tx is stopped
misc: sgi-xp: Properly initialize buf in xpc_get_rsvd_page_pa
iommu: Use right function to get group for device
signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig
inet: frags: call inet_frags_fini() after unregister_pernet_subsys()
media: vivid: fix incorrect assignment operation when setting video mode
powerpc/cacheinfo: add cacheinfo_teardown, cacheinfo_rebuild
drm/msm/mdp5: Fix mdp5_cfg_init error return
net: netem: fix backlog accounting for corrupted GSO frames
net/af_iucv: always register net_device notifier
ASoC: ti: davinci-mcasp: Fix slot mask settings when using multiple AXRs
rtc: pcf8563: Clear event flags and disable interrupts before requesting irq
drm/msm/a3xx: remove TPL1 regs from snapshot
perf/ioctl: Add check for the sample_period value
dmaengine: hsu: Revert "set HSU_CH_MTSR to memory width"
clk: qcom: Fix -Wunused-const-variable
iommu/amd: Make iommu_disable safer
mfd: intel-lpss: Release IDA resources
rxrpc: Fix uninitialized error code in rxrpc_send_data_packet()
devres: allow const resource arguments
RDMA/hns: Fixs hw access invalid dma memory error
net: pasemi: fix an use-after-free in pasemi_mac_phy_init()
scsi: libfc: fix null pointer dereference on a null lport
libertas_tf: Use correct channel range in lbtf_geo_init
qed: reduce maximum stack frame size
usb: host: xhci-hub: fix extra endianness conversion
mic: avoid statically declaring a 'struct device'.
x86/kgbd: Use NMI_VECTOR not APIC_DM_NMI
ALSA: aoa: onyx: always initialize register read value
net/mlx5: Fix mlx5_ifc_query_lag_out_bits
cifs: fix rmmod regression in cifs.ko caused by force_sig changes
crypto: caam - free resources in case caam_rng registration failed
ext4: set error return correctly when ext4_htree_store_dirent fails
ASoC: es8328: Fix copy-paste error in es8328_right_line_controls
ASoC: cs4349: Use PM ops 'cs4349_runtime_pm'
ASoC: wm8737: Fix copy-paste error in wm8737_snd_controls
signal: Allow cifs and drbd to receive their terminating signals
ASoC: sun4i-i2s: RX and TX counter registers are swapped
dmaengine: dw: platform: Switch to acpi_dma_controller_register()
mac80211: minstrel_ht: fix per-group max throughput rate initialization
mips: avoid explicit UB in assignment of mips_io_port_base
ahci: Do not export local variable ahci_em_messages
Partially revert "kfifo: fix kfifo_alloc() and kfifo_init()"
hwmon: (lm75) Fix write operations for negative temperatures
power: supply: Init device wakeup after device_add()
x86, perf: Fix the dependency of the x86 insn decoder selftest
staging: greybus: light: fix a couple double frees
bcma: fix incorrect update of BCMA_CORE_PCI_MDIO_DATA
iio: dac: ad5380: fix incorrect assignment to val
ath9k: dynack: fix possible deadlock in ath_dynack_node_{de}init
net: sonic: return NETDEV_TX_OK if failed to map buffer
Btrfs: fix hang when loading existing inode cache off disk
hwmon: (shtc1) fix shtc1 and shtw1 id mask
net: sonic: replace dev_kfree_skb in sonic_send_packet
net/rds: Fix 'ib_evt_handler_call' element in 'rds_ib_stat_names'
iommu/amd: Wait for completion of IOTLB flush in attach_device
net: hisilicon: Fix signedness bug in hix5hd2_dev_probe()
net: broadcom/bcmsysport: Fix signedness in bcm_sysport_probe()
net: stmmac: dwmac-meson8b: Fix signedness bug in probe
of: mdio: Fix a signedness bug in of_phy_get_and_connect()
net: ethernet: stmmac: Fix signedness bug in ipq806x_gmac_of_parse()
nvme: retain split access workaround for capability reads
net: stmmac: gmac4+: Not all Unicast addresses may be available
mac80211: accept deauth frames in IBSS mode
llc: fix another potential sk_buff leak in llc_ui_sendmsg()
llc: fix sk_buff refcounting in llc_conn_state_process()
net: stmmac: fix length of PTP clock's name string
act_mirred: Fix mirred_init_module error handling
drm/msm/dsi: Implement reset correctly
dmaengine: imx-sdma: fix size check for sdma script_number
net: netem: fix error path for corrupted GSO frames
net: netem: correct the parent's backlog when corrupted packet was dropped
net: qca_spi: Move reset_count to struct qcaspi
afs: Fix large file support
media: ov6650: Fix incorrect use of JPEG colorspace
media: ov6650: Fix some format attributes not under control
media: ov6650: Fix .get_fmt() V4L2_SUBDEV_FORMAT_TRY support
MIPS: Loongson: Fix return value of loongson_hwmon_init
net: neigh: use long type to store jiffies delta
packet: fix data-race in fanout_flow_is_huge()
dmaengine: ti: edma: fix missed failure handling
drm/radeon: fix bad DMA from INTERRUPT_CNTL2
arm64: dts: juno: Fix UART frequency
IB/iser: Fix dma_nents type definition
m68k: Call timer_interrupt() with interrupts disabled
net: ethtool: Add back transceiver type
net: phy: Keep reporting transceiver type
can, slip: Protect tty->disc_data in write_wakeup and close with RCU
firestream: fix memory leaks
net: cxgb3_main: Add CAP_NET_ADMIN check to CHELSIO_GET_MEM
net, ip6_tunnel: fix namespaces move
net, ip_tunnel: fix namespaces move
net_sched: fix datalen for ematch
tcp_bbr: improve arithmetic division in bbr_update_bw()
net: usb: lan78xx: Add .ndo_features_check
gtp: make sure only SOCK_DGRAM UDP sockets are accepted
hwmon: (adt7475) Make volt2reg return same reg as reg2volt input
hwmon: (core) Simplify sysfs attribute name allocation
hwmon: Deal with errors from the thermal subsystem
hwmon: (core) Fix double-free in __hwmon_device_register()
hwmon: (core) Do not use device managed functions for memory allocations
Input: keyspan-remote - fix control-message timeouts
ARM: 8950/1: ftrace/recordmcount: filter relocation types
mmc: tegra: fix SDR50 tuning override
mmc: sdhci: fix minimum clock rate for v3 controller
Input: sur40 - fix interface sanity checks
Input: gtco - fix endpoint sanity check
Input: aiptek - fix endpoint sanity check
Input: pegasus_notetaker - fix endpoint sanity check
Input: sun4i-ts - add a check for devm_thermal_zone_of_sensor_register
hwmon: (nct7802) Fix voltage limits to wrong registers
scsi: RDMA/isert: Fix a recently introduced regression related to logout
tracing: xen: Ordered comparison of function pointers
do_last(): fetch directory ->i_mode and ->i_uid before it's too late
Documentation: Document arm64 kpti control
arm64: kpti: Whitelist Cortex-A CPUs that don't implement the CSV3 field
coresight: etb10: Do not call smp_processor_id from preemptible
coresight: tmc-etf: Do not call smp_processor_id from preemptible
libertas: Fix two buffer overflows at parsing bss descriptor
bcache: silence static checker warning
scsi: iscsi: Avoid potential deadlock in iscsi_if_rx func
md: Avoid namespace collision with bitmap API
bitmap: Add bitmap_alloc(), bitmap_zalloc() and bitmap_free()
netfilter: ipset: use bitmap infrastructure completely
net/x25: fix nonblocking connect
Linux 4.9.212
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2e83a05c5f119a7467a4d6984045d45d0c06b764
[ Upstream commit 33da8e7c814f77310250bb54a9db36a44c5de784 ]
My recent to change to only use force_sig for a synchronous events
wound up breaking signal reception cifs and drbd. I had overlooked
the fact that by default kthreads start out with all signals set to
SIG_IGN. So a change I thought was safe turned out to have made it
impossible for those kernel thread to catch their signals.
Reverting the work on force_sig is a bad idea because what the code
was doing was very much a misuse of force_sig. As the way force_sig
ultimately allowed the signal to happen was to change the signal
handler to SIG_DFL. Which after the first signal will allow userspace
to send signals to these kernel threads. At least for
wake_ack_receiver in drbd that does not appear actively wrong.
So correct this problem by adding allow_kernel_signal that will allow
signals whose siginfo reports they were sent by the kernel through,
but will not allow userspace generated signals, and update cifs and
drbd to call allow_kernel_signal in an appropriate place so that their
thread can receive this signal.
Fixing things this way ensures that userspace won't be able to send
signals and cause problems, that it is clear which signals the
threads are expecting to receive, and it guarantees that nothing
else in the system will be affected.
This change was partly inspired by similar cifs and drbd patches that
added allow_signal.
Reported-by: ronnie sahlberg <ronniesahlberg@gmail.com>
Reported-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: Steve French <smfrench@gmail.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Fixes: 247bc9470b1e ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
Fixes: fee109901f39 ("signal/drbd: Use send_sig not force_sig")
Fixes: 3cf5d076fb4d ("signal: Remove task parameter from force_sig")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl3blr8ACgkQONu9yGCS
aT5BrQ//bhVt4Fy5NPJ/QlzpslU7XrwTYtGH0foNbUVvUeeJjiTYZj4XrRDE1VhC
dQJNYRs8wqxG1EhNV4ydpCyLZgG+iLFW0pJc+k47WL+KJAChN0gQT0lGRkZ90GUJ
1+FAM9/QJ5T5Z5T3sMfigp+nUDDoWgWrXl4RBxxbROmYu+xgOOa5agWxqzJDh3CU
bjMF9heQEbpK26NyCqpaumL/nBaj86c/Qspq9PHeHwQDF3Sv+XEDgx8S94t8Zc1K
DJojA3Nyl87FQOpdR/1UFCjge7GcdMP2xozRYGzMXqUwNCHjm6crKoM2EaGqHf2a
HrFDUKL+tC6pE0YrPnrsaQkEFauZEB4Zm9zegwuIqxEovM3o+VkjiTi4XI4VdrZT
t5VVE2tirlvNEAIN1BSbMqZnlE+c+8PpObABLhQtpmNAJP+PNXVeze600uissLbp
grAU7hVvC89uzhPTbfhUyAginAAlqftDaO/YvdPlHXYCv4bLGcF1PYyEH2WfzAPp
fpFJwyVE5tUev5WT4Xm1tAzK5EsW0aUZ69y0+RqwC1mKXTvcHQy9A0u+5ccmQTl9
KSoR2Z/y2qvzfUAB0umAsXQohLUhZYtE4cCyj1us0WDy/0DI3HiJK5rDmJElM3DQ
4PcGN4QtH2PZUogtPaHdwOnL9WS7kwFv+RKDdV3Mjw8bTZPiZg0=
=TWjb
-----END PGP SIGNATURE-----
Merge 4.9.203 into android-4.9-q
Changes in 4.9.203
ax88172a: fix information leak on short answers
slip: Fix memory leak in slip_open error path
ALSA: usb-audio: Fix missing error check at mixer resolution test
ALSA: usb-audio: not submit urb for stopped endpoint
Input: ff-memless - kill timer in destroy()
Input: synaptics-rmi4 - fix video buffer size
Input: synaptics-rmi4 - clear IRQ enables for F54
Input: synaptics-rmi4 - destroy F54 poller workqueue when removing
IB/hfi1: Ensure full Gen3 speed in a Gen4 system
ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
iommu/vt-d: Fix QI_DEV_IOTLB_PFSID and QI_DEV_EIOTLB_PFSID macros
mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()
mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup()
mmc: sdhci-of-at91: fix quirk2 overwrite
ath10k: fix kernel panic by moving pci flush after napi_disable
iio: dac: mcp4922: fix error handling in mcp4922_write_raw
ALSA: pcm: signedness bug in snd_pcm_plug_alloc()
arm64: dts: tegra210-p2180: Correct sdmmc4 vqmmc-supply
ARM: dts: at91/trivial: Fix USART1 definition for at91sam9g45
cfg80211: Avoid regulatory restore when COUNTRY_IE_IGNORE is set
ALSA: seq: Do error checks at creating system ports
ath9k: fix tx99 with monitor mode interface
gfs2: Don't set GFS2_RDF_UPTODATE when the lvb is updated
ASoC: dpcm: Properly initialise hw->rate_max
MIPS: BCM47XX: Enable USB power on Netgear WNDR3400v3
ARM: dts: exynos: Fix sound in Snow-rev5 Chromebook
ARM: dts: exynos: Fix regulators configuration on Peach Pi/Pit Chromebooks
i40e: use correct length for strncpy
i40e: hold the rtnl lock on clearing interrupt scheme
i40e: Prevent deleting MAC address from VF when set by PF
IB/rxe: fixes for rdma read retry
iwlwifi: mvm: avoid sending too many BARs
ARM: dts: pxa: fix power i2c base address
rtl8187: Fix warning generated when strncpy() destination length matches the sixe argument
net: lan78xx: Bail out if lan78xx_get_endpoints fails
ASoC: sgtl5000: avoid division by zero if lo_vag is zero
ARM: dts: exynos: Disable pull control for S5M8767 PMIC
ath10k: wmi: disable softirq's while calling ieee80211_rx
mips: txx9: fix iounmap related issue
ASoC: Intel: hdac_hdmi: Limit sampling rates at dai creation
of: make PowerMac cache node search conditional on CONFIG_PPC_PMAC
ARM: dts: omap3-gta04: give spi_lcd node a label so that we can overwrite in other DTS files
ARM: dts: omap3-gta04: fixes for tvout / venc
ARM: dts: omap3-gta04: tvout: enable as display1 alias
ARM: dts: omap3-gta04: fix touchscreen tsc2007
ARM: dts: omap3-gta04: make NAND partitions compatible with recent U-Boot
ARM: dts: omap3-gta04: keep vpll2 always on
dmaengine: dma-jz4780: Don't depend on MACH_JZ4780
dmaengine: dma-jz4780: Further residue status fix
ath9k: add back support for using active monitor interfaces for tx99
signal: Always ignore SIGKILL and SIGSTOP sent to the global init
signal: Properly deliver SIGILL from uprobes
signal: Properly deliver SIGSEGV from x86 uprobes
f2fs: fix memory leak of percpu counter in fill_super()
scsi: sym53c8xx: fix NULL pointer dereference panic in sym_int_sir()
ARM: imx6: register pm_power_off handler if "fsl,pmic-stby-poweroff" is set
scsi: pm80xx: Corrected dma_unmap_sg() parameter
scsi: pm80xx: Fixed system hang issue during kexec boot
kprobes: Don't call BUG_ON() if there is a kprobe in use on free list
nvmem: core: return error code instead of NULL from nvmem_device_get
media: fix: media: pci: meye: validate offset to avoid arbitrary access
media: dvb: fix compat ioctl translation
ALSA: intel8x0m: Register irq handler after register initializations
pinctrl: at91-pio4: fix has_config check in atmel_pctl_dt_subnode_to_map()
llc: avoid blocking in llc_sap_close()
ARM: dts: qcom: ipq4019: fix cpu0's qcom,saw2 reg value
powerpc/vdso: Correct call frame information
ARM: dts: socfpga: Fix I2C bus unit-address error
pinctrl: at91: don't use the same irqchip with multiple gpiochips
cxgb4: Fix endianness issue in t4_fwcache()
power: supply: ab8500_fg: silence uninitialized variable warnings
power: reset: at91-poweroff: do not procede if at91_shdwc is allocated
power: supply: max8998-charger: Fix platform data retrieval
component: fix loop condition to call unbind() if bind() fails
kernfs: Fix range checks in kernfs_get_target_path
ip_gre: fix parsing gre header in ipgre_err
ARM: dts: rockchip: Fix erroneous SPI bus dtc warnings on rk3036
ath9k: Fix a locking bug in ath9k_add_interface()
s390/qeth: invoke softirqs after napi_schedule()
PCI/ACPI: Correct error message for ASPM disabling
serial: mxs-auart: Fix potential infinite loop
powerpc/iommu: Avoid derefence before pointer check
powerpc/64s/hash: Fix stab_rr off by one initialization
powerpc/pseries: Disable CPU hotplug across migrations
RDMA/i40iw: Fix incorrect iterator type
libfdt: Ensure INT_MAX is defined in libfdt_env.h
power: supply: twl4030_charger: fix charging current out-of-bounds
power: supply: twl4030_charger: disable eoc interrupt on linear charge
net: toshiba: fix return type of ndo_start_xmit function
net: xilinx: fix return type of ndo_start_xmit function
net: broadcom: fix return type of ndo_start_xmit function
net: amd: fix return type of ndo_start_xmit function
usb: chipidea: imx: enable OTG overcurrent in case USB subsystem is already started
usb: chipidea: Fix otg event handler
mlxsw: spectrum: Init shaper for TCs 8..15
ARM: dts: am335x-evm: fix number of cpsw
f2fs: fix to recover inode's uid/gid during POR
ARM: dts: ux500: Correct SCU unit address
ARM: dts: ux500: Fix LCDA clock line muxing
ARM: dts: ste: Fix SPI controller node names
spi: pic32: Use proper enum in dmaengine_prep_slave_rg
cpufeature: avoid warning when compiling with clang
ARM: dts: marvell: Fix SPI and I2C bus warnings
bnx2x: Ignore bandwidth attention in single function mode
net: micrel: fix return type of ndo_start_xmit function
x86/CPU: Use correct macros for Cyrix calls
MIPS: kexec: Relax memory restriction
media: pci: ivtv: Fix a sleep-in-atomic-context bug in ivtv_yuv_init()
media: au0828: Fix incorrect error messages
media: davinci: Fix implicit enum conversion warning
usb: gadget: uvc: configfs: Drop leaked references to config items
usb: gadget: uvc: configfs: Prevent format changes after linking header
phy: phy-twl4030-usb: fix denied runtime access
usb: gadget: uvc: Factor out video USB request queueing
usb: gadget: uvc: Only halt video streaming endpoint in bulk mode
coresight: Fix handling of sinks
coresight: etm4x: Configure EL2 exception level when kernel is running in HYP
coresight: tmc: Fix byte-address alignment for RRP
misc: kgdbts: Fix restrict error
misc: genwqe: should return proper error value.
vfio/pci: Fix potential memory leak in vfio_msi_cap_len
vfio/pci: Mask buggy SR-IOV VF INTx support
scsi: libsas: always unregister the old device if going to discover new
ARM: dts: tegra30: fix xcvr-setup-use-fuses
ARM: tegra: apalis_t30: fix mmc1 cmd pull-up
ARM: dts: paz00: fix wakeup gpio keycode
net: smsc: fix return type of ndo_start_xmit function
EDAC: Raise the maximum number of memory controllers
ARM: dts: realview: Fix SPI controller node names
Bluetooth: L2CAP: Detect if remote is not able to use the whole MPS
crypto: s5p-sss: Fix Fix argument list alignment
crypto: fix a memory leak in rsa-kcs1pad's encryption mode
scsi: NCR5380: Clear all unissued commands on host reset
scsi: NCR5380: Use DRIVER_SENSE to indicate valid sense data
scsi: NCR5380: Check for invalid reselection target
scsi: NCR5380: Don't clear busy flag when abort fails
scsi: NCR5380: Don't call dsprintk() following reselection interrupt
scsi: NCR5380: Handle BUS FREE during reselection
arm64: dts: amd: Fix SPI bus warnings
arm64: dts: lg: Fix SPI controller node names
ARM: dts: lpc32xx: Fix SPI controller node names
usb: xhci-mtk: fix ISOC error when interval is zero
fuse: use READ_ONCE on congestion_threshold and max_background
IB/iser: Fix possible NULL deref at iser_inv_desc()
memfd: Use radix_tree_deref_slot_protected to avoid the warning.
slcan: Fix memory leak in error path
net: cdc_ncm: Signedness bug in cdc_ncm_set_dgram_size()
x86/atomic: Fix smp_mb__{before,after}_atomic()
kprobes/x86: Prohibit probing on exception masking instructions
uprobes/x86: Prohibit probing on MOV SS instruction
fbdev: Ditch fb_edid_add_monspecs
block: introduce blk_rq_is_passthrough
libata: have ata_scsi_rw_xlat() fail invalid passthrough requests
net: ovs: fix return type of ndo_start_xmit function
net: xen-netback: fix return type of ndo_start_xmit function
ARM: dts: omap5: enable OTG role for DWC3 controller
f2fs: return correct errno in f2fs_gc
SUNRPC: Fix priority queue fairness
kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table
arm64/numa: Report correct memblock range for the dummy node
ath10k: fix vdev-start timeout on error
ata: ahci_brcm: Allow using driver or DSL SoCs
ath9k: fix reporting calculated new FFT upper max
usb: gadget: udc: fotg210-udc: Fix a sleep-in-atomic-context bug in fotg210_get_status()
nl80211: Fix a GET_KEY reply attribute
dmaengine: ep93xx: Return proper enum in ep93xx_dma_chan_direction
dmaengine: timb_dma: Use proper enum in td_prep_slave_sg
mei: samples: fix a signedness bug in amt_host_if_call()
cxgb4: Use proper enum in cxgb4_dcb_handle_fw_update
cxgb4: Use proper enum in IEEE_FAUX_SYNC
powerpc/pseries: Fix DTL buffer registration
powerpc/pseries: Fix how we iterate over the DTL entries
mtd: rawnand: sh_flctl: Use proper enum for flctl_dma_fifo0_transfer
ixgbe: Fix crash with VFs and flow director on interface flap
IB/mthca: Fix error return code in __mthca_init_one()
IB/mlx4: Avoid implicit enumerated type conversion
ACPICA: Never run _REG on system_memory and system_IO
ata: ep93xx: Use proper enums for directions
media: pxa_camera: Fix check for pdev->dev.of_node
ALSA: hda/sigmatel - Disable automute for Elo VuPoint
KVM: PPC: Book3S PR: Exiting split hack mode needs to fixup both PC and LR
USB: serial: cypress_m8: fix interrupt-out transfer length
mtd: physmap_of: Release resources on error
cpu/SMT: State SMT is disabled even with nosmt and without "=force"
brcmfmac: reduce timeout for action frame scan
brcmfmac: fix full timeout waiting for action frame on-channel tx
clk: samsung: Use clk_hw API for calling clk framework from clk notifiers
i2c: brcmstb: Allow enabling the driver on DSL SoCs
NFSv4.x: fix lock recovery during delegation recall
dmaengine: ioat: fix prototype of ioat_enumerate_channels
Input: st1232 - set INPUT_PROP_DIRECT property
Input: silead - try firmware reload after unsuccessful resume
x86/olpc: Fix build error with CONFIG_MFD_CS5535=m
crypto: mxs-dcp - Fix SHA null hashes and output length
crypto: mxs-dcp - Fix AES issues
ACPI / SBS: Fix rare oops when removing modules
iwlwifi: mvm: don't send keys when entering D3
fbdev: sbuslib: use checked version of put_user()
fbdev: sbuslib: integer overflow in sbusfb_ioctl_helper()
reset: Fix potential use-after-free in __of_reset_control_get()
bcache: recal cached_dev_sectors on detach
s390/kasan: avoid vdso instrumentation
proc/vmcore: Fix i386 build error of missing copy_oldmem_page_encrypted()
backlight: lm3639: Unconditionally call led_classdev_unregister
mfd: ti_am335x_tscadc: Keep ADC interface on if child is wakeup capable
printk: Give error on attempt to set log buffer length to over 2G
media: isif: fix a NULL pointer dereference bug
GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads
media: cx231xx: fix potential sign-extension overflow on large shift
x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error
gpio: syscon: Fix possible NULL ptr usage
spi: spidev: Fix OF tree warning logic
ARM: 8802/1: Call syscall_trace_exit even when system call skipped
orangefs: rate limit the client not running info message
hwmon: (pwm-fan) Silence error on probe deferral
hwmon: (ina3221) Fix INA3221_CONFIG_MODE macros
misc: cxl: Fix possible null pointer dereference
mac80211: minstrel: fix CCK rate group streams value
spi: rockchip: initialize dma_slave_config properly
ARM: dts: omap5: Fix dual-role mode on Super-Speed port
arm64: uaccess: Ensure PAN is re-enabled after unhandled uaccess fault
Linux 4.9.203
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 86989c41b5ea08776c450cb759592532314a4ed6 ]
If the first process started (aka /sbin/init) receives a SIGKILL it
will panic the system if it is delivered. Making the system unusable
and undebugable. It isn't much better if the first process started
receives SIGSTOP.
So always ignore SIGSTOP and SIGKILL sent to init.
This is done in a separate clause in sig_task_ignored as force_sig_info
can clear SIG_UNKILLABLE and this protection should work even then.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
This patch adds polling support to pidfd.
Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.
For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.
We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
it in task_struct would not be option in this case because the task can
be reaped and destroyed before the poll returns.
2. By including the struct pid for the waitqueue means that during
de_thread(), the new thread group leader automatically gets the new
waitqueue/pid even though its task_struct is different.
Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@android.com
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Co-developed-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24)
Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I02f259d2875bec46b198d580edfbb067f077084e
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Improve the comments for pidfd_send_signal().
First, the comment still referred to a file descriptor for a process as a
"task file descriptor" which stems from way back at the beginning of the
discussion. Replace this with "pidfd" for consistency.
Second, the wording for the explanation of the arguments to the syscall
was a bit inconsistent, e.g. some used the past tense some used present
tense. Make the wording more consistent.
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit c732327f04a3818f35fa97d07b1d64d31b691d78)
Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I06c6bdd1dddaeb8ac75a78dd21f9cdd0dc139a4c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD. With this
patch pidfd_send_signal() becomes independent of procfs. This fullfils
the request made when we merged the pidfd_send_signal() patchset. The
pidfd_send_signal() syscall is now always available allowing for it to
be used by users without procfs mounted or even users without procfs
support compiled into the kernel.
Signed-off-by: Christian Brauner <christian@brauner.io>
Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 2151ad1b067275730de1b38c7257478cae47d29e)
Conflicts:
kernel/sys_ni.c
(1. Replaced COND_SYSCALL with cond_syscall.)
Bug: 135608568
Test: test program using syscall(__NR_sys_pidfd_open,..) and poll()
Change-Id: I621fe6547397e0e68c560d7da60ef7715deb290c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
As stated in the original commit for pidfd_send_signal() we don't allow
to signal processes through O_PATH file descriptors since it is
semantically equivalent to a write on the pidfd.
We already correctly error out right now and return EBADF if an O_PATH
fd is passed. This is because we use file->f_op to detect whether a
pidfd is passed and O_PATH fds have their file->f_op set to empty_fops
in do_dentry_open() and thus fail the test.
Thus, there is no regression. It's just semantically correct to use
fdget() and return an error right from there instead of taking a
reference and returning an error later.
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jann@thejh.net>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 738a7832d21e3d911fcddab98ce260b79010b461)
Bug: 135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: Id52eaadf9da371fb2d9caae4df49627760de7229
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
The current sys_pidfd_send_signal() silently turns signals with explicit
SI_USER context that are sent to non-current tasks into signals with
kernel-generated siginfo.
This is unlike do_rt_sigqueueinfo(), which returns -EPERM in this case.
If a user actually wants to send a signal with kernel-provided siginfo,
they can do that with pidfd_send_signal(pidfd, sig, NULL, 0); so allowing
this case is unnecessary.
Instead of silently replacing the siginfo, just bail out with an error;
this is consistent with other interfaces and avoids special-casing behavior
based on security checks.
Fixes: 3eb39f47934f ("signal: add pidfd_send_signal() syscall")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
(cherry picked from commit 556a888a14afe27164191955618990fb3ccc9aad)
Bug: 135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: I493af671b82c43bff1425ee24550d2fb9aa6d961
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
The kill() syscall operates on process identifiers (pid). After a process
has exited its pid can be reused by another process. If a caller sends a
signal to a reused pid it will end up signaling the wrong process. This
issue has often surfaced and there has been a push to address this problem [1].
This patch uses file descriptors (fd) from proc/<pid> as stable handles on
struct pid. Even if a pid is recycled the handle will not change. The fd
can be used to send signals to the process it refers to.
Thus, the new syscall pidfd_send_signal() is introduced to solve this
problem. Instead of pids it operates on process fds (pidfd).
/* prototype and argument /*
long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags);
/* syscall number 424 */
The syscall number was chosen to be 424 to align with Arnd's rework in his
y2038 to minimize merge conflicts (cf. [25]).
In addition to the pidfd and signal argument it takes an additional
siginfo_t and flags argument. If the siginfo_t argument is NULL then
pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it
is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo().
The flags argument is added to allow for future extensions of this syscall.
It currently needs to be passed as 0. Failing to do so will cause EINVAL.
/* pidfd_send_signal() replaces multiple pid-based syscalls */
The pidfd_send_signal() syscall currently takes on the job of
rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a
positive pid is passed to kill(2). It will however be possible to also
replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended.
/* sending signals to threads (tid) and process groups (pgid) */
Specifically, the pidfd_send_signal() syscall does currently not operate on
process groups or threads. This is left for future extensions.
In order to extend the syscall to allow sending signal to threads and
process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and
PIDFD_TYPE_TID) should be added. This implies that the flags argument will
determine what is signaled and not the file descriptor itself. Put in other
words, grouping in this api is a property of the flags argument not a
property of the file descriptor (cf. [13]). Clarification for this has been
requested by Eric (cf. [19]).
When appropriate extensions through the flags argument are added then
pidfd_send_signal() can additionally replace the part of kill(2) which
operates on process groups as well as the tgkill(2) and
rt_tgsigqueueinfo(2) syscalls.
How such an extension could be implemented has been very roughly sketched
in [14], [15], and [16]. However, this should not be taken as a commitment
to a particular implementation. There might be better ways to do it.
Right now this is intentionally left out to keep this patchset as simple as
possible (cf. [4]).
/* naming */
The syscall had various names throughout iterations of this patchset:
- procfd_signal()
- procfd_send_signal()
- taskfd_send_signal()
In the last round of reviews it was pointed out that given that if the
flags argument decides the scope of the signal instead of different types
of fds it might make sense to either settle for "procfd_" or "pidfd_" as
prefix. The community was willing to accept either (cf. [17] and [18]).
Given that one developer expressed strong preference for the "pidfd_"
prefix (cf. [13]) and with other developers less opinionated about the name
we should settle for "pidfd_" to avoid further bikeshedding.
The "_send_signal" suffix was chosen to reflect the fact that the syscall
takes on the job of multiple syscalls. It is therefore intentional that the
name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the
fomer because it might imply that pidfd_send_signal() is a replacement for
kill(2), and not the latter because it is a hassle to remember the correct
spelling - especially for non-native speakers - and because it is not
descriptive enough of what the syscall actually does. The name
"pidfd_send_signal" makes it very clear that its job is to send signals.
/* zombies */
Zombies can be signaled just as any other process. No special error will be
reported since a zombie state is an unreliable state (cf. [3]). However,
this can be added as an extension through the @flags argument if the need
ever arises.
/* cross-namespace signals */
The patch currently enforces that the signaler and signalee either are in
the same pid namespace or that the signaler's pid namespace is an ancestor
of the signalee's pid namespace. This is done for the sake of simplicity
and because it is unclear to what values certain members of struct
siginfo_t would need to be set to (cf. [5], [6]).
/* compat syscalls */
It became clear that we would like to avoid adding compat syscalls
(cf. [7]). The compat syscall handling is now done in kernel/signal.c
itself by adding __copy_siginfo_from_user_generic() which lets us avoid
compat syscalls (cf. [8]). It should be noted that the addition of
__copy_siginfo_from_user_any() is caused by a bug in the original
implementation of rt_sigqueueinfo(2) (cf. 12).
With upcoming rework for syscall handling things might improve
significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain
any additional callers.
/* testing */
This patch was tested on x64 and x86.
/* userspace usage */
An asciinema recording for the basic functionality can be found under [9].
With this patch a process can be killed via:
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
unsigned int flags)
{
#ifdef __NR_pidfd_send_signal
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
#else
return -ENOSYS;
#endif
}
int main(int argc, char *argv[])
{
int fd, ret, saved_errno, sig;
if (argc < 3)
exit(EXIT_FAILURE);
fd = open(argv[1], O_DIRECTORY | O_CLOEXEC);
if (fd < 0) {
printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]);
exit(EXIT_FAILURE);
}
sig = atoi(argv[2]);
printf("Sending signal %d to process %s\n", sig, argv[1]);
ret = do_pidfd_send_signal(fd, sig, NULL, 0);
saved_errno = errno;
close(fd);
errno = saved_errno;
if (ret < 0) {
printf("%s - Failed to send signal %d to process %s\n",
strerror(errno), sig, argv[1]);
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
/* Q&A
* Given that it seems the same questions get asked again by people who are
* late to the party it makes sense to add a Q&A section to the commit
* message so it's hopefully easier to avoid duplicate threads.
*
* For the sake of progress please consider these arguments settled unless
* there is a new point that desperately needs to be addressed. Please make
* sure to check the links to the threads in this commit message whether
* this has not already been covered.
*/
Q-01: (Florian Weimer [20], Andrew Morton [21])
What happens when the target process has exited?
A-01: Sending the signal will fail with ESRCH (cf. [22]).
Q-02: (Andrew Morton [21])
Is the task_struct pinned by the fd?
A-02: No. A reference to struct pid is kept. struct pid - as far as I
understand - was created exactly for the reason to not require to
pin struct task_struct (cf. [22]).
Q-03: (Andrew Morton [21])
Does the entire procfs directory remain visible? Just one entry
within it?
A-03: The same thing that happens right now when you hold a file descriptor
to /proc/<pid> open (cf. [22]).
Q-04: (Andrew Morton [21])
Does the pid remain reserved?
A-04: No. This patchset guarantees a stable handle not that pids are not
recycled (cf. [22]).
Q-05: (Andrew Morton [21])
Do attempts to signal that fd return errors?
A-05: See {Q,A}-01.
Q-06: (Andrew Morton [22])
Is there a cleaner way of obtaining the fd? Another syscall perhaps.
A-06: Userspace can already trivially retrieve file descriptors from procfs
so this is something that we will need to support anyway. Hence,
there's no immediate need to add another syscalls just to make
pidfd_send_signal() not dependent on the presence of procfs. However,
adding a syscalls to get such file descriptors is planned for a
future patchset (cf. [22]).
Q-07: (Andrew Morton [21] and others)
This fd-for-a-process sounds like a handy thing and people may well
think up other uses for it in the future, probably unrelated to
signals. Are the code and the interface designed to permit such
future applications?
A-07: Yes (cf. [22]).
Q-08: (Andrew Morton [21] and others)
Now I think about it, why a new syscall? This thing is looking
rather like an ioctl?
A-08: This has been extensively discussed. It was agreed that a syscall is
preferred for a variety or reasons. Here are just a few taken from
prior threads. Syscalls are safer than ioctl()s especially when
signaling to fds. Processes are a core kernel concept so a syscall
seems more appropriate. The layout of the syscall with its four
arguments would require the addition of a custom struct for the
ioctl() thereby causing at least the same amount or even more
complexity for userspace than a simple syscall. The new syscall will
replace multiple other pid-based syscalls (see description above).
The file-descriptors-for-processes concept introduced with this
syscall will be extended with other syscalls in the future. See also
[22], [23] and various other threads already linked in here.
Q-09: (Florian Weimer [24])
What happens if you use the new interface with an O_PATH descriptor?
A-09:
pidfds opened as O_PATH fds cannot be used to send signals to a
process (cf. [2]). Signaling processes through pidfds is the
equivalent of writing to a file. Thus, this is not an operation that
operates "purely at the file descriptor level" as required by the
open(2) manpage. See also [4].
/* References */
[1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/
[2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/
[3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/
[4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/
[5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/
[6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/
[7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/
[8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/
[9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy
[11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/
[12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/
[13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/
[14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/
[15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/
[16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/
[17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/
[18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/
[19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/
[20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/
[22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/
[23]: https://lwn.net/Articles/773459/
[24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/
[25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
(cherry picked from commit 3eb39f47934f9d5a3027fe00d906a45fe3a15fad)
Conflicts:
arch/x86/entry/syscalls/syscall_32.tbl - trivial manual merge
arch/x86/entry/syscalls/syscall_64.tbl - trivial manual merge
include/linux/proc_fs.h - trivial manual merge
include/linux/syscalls.h - trivial manual merge
include/uapi/asm-generic/unistd.h - trivial manual merge
kernel/signal.c - struct kernel_siginfo does not exist in 4.9
kernel/sys_ni.c - cond_syscall is used instead of COND_SYSCALL
arch/x86/entry/syscalls/syscall_32.tbl
arch/x86/entry/syscalls/syscall_64.tbl
(1. manual merges because of 4.9 differences
2. change prepare_kill_siginfo() to use struct siginfo instead of
kernel_siginfo
3. exclude kill() changes to avoid struct kernel_siginfo usage
4. exclude copy_siginfo_from_user_any() to avoid struct kernel_siginfo usage
5. use copy_from_user() instead of copy_siginfo_from_user() in copy_siginfo_from_user_any()
6. replaced COND_SYSCALL with cond_syscall
7. Removed __ia32_sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_32.tbl.
8. Replaced __x64_sys_pidfd_send_signal with sys_pidfd_send_signal in arch/x86/entry/syscalls/syscall_64.tbl.)
Bug: 135608568
Test: test program using syscall(__NR_pidfd_send_signal,..) to send SIGKILL
Change-Id: I00f1c618b2e9dbafae4d4113ad4d8a1a44b6957c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
commit 98af37d624ed8c83f1953b1b6b2f6866011fc064 upstream.
In the fixes commit, removing SIGKILL from each thread signal mask and
executing "goto fatal" directly will skip the call to
"trace_signal_deliver". At this point, the delivery tracking of the
SIGKILL signal will be inaccurate.
Therefore, we need to add trace_signal_deliver before "goto fatal" after
executing sigdelset.
Note: SEND_SIG_NOINFO matches the fact that SIGKILL doesn't have any info.
Link: http://lkml.kernel.org/r/20190425025812.91424-1-weizhenliang@huawei.com
Fixes: cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT")
Signed-off-by: Zhenliang Wei <weizhenliang@huawei.com>
Reviewed-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ivan Delalande <colona@arista.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cf43a757fd49442bc38f76088b70c2299eed2c2f upstream.
In the middle of do_exit() there is there is a call
"ptrace_event(PTRACE_EVENT_EXIT, code);" That call places the process
in TACKED_TRACED aka "(TASK_WAKEKILL | __TASK_TRACED)" and waits for
for the debugger to release the task or SIGKILL to be delivered.
Skipping past dequeue_signal when we know a fatal signal has already
been delivered resulted in SIGKILL remaining pending and
TIF_SIGPENDING remaining set. This in turn caused the
scheduler to not sleep in PTACE_EVENT_EXIT as it figured
a fatal signal was pending. This also caused ptrace_freeze_traced
in ptrace_check_attach to fail because it left a per thread
SIGKILL pending which is what fatal_signal_pending tests for.
This difference in signal state caused strace to report
strace: Exit of unknown pid NNNNN ignored
Therefore update the signal handling state like dequeue_signal
would when removing a per thread SIGKILL, by removing SIGKILL
from the per thread signal mask and clearing TIF_SIGPENDING.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Ivan Delalande <colona@arista.com>
Cc: stable@vger.kernel.org
Fixes: 35634ffa1751 ("signal: Always notice exiting tasks")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7146db3317c67b517258cb5e1b08af387da0618b upstream.
Recently syzkaller was able to create unkillablle processes by
creating a timer that is delivered as a thread local signal on SIGHUP,
and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop failing
to deliver SIGHUP but always trying.
When the stack overflows delivery of SIGHUP fails and force_sigsegv is
called. Unfortunately because SIGSEGV is numerically higher than
SIGHUP next_signal tries again to deliver a SIGHUP.
From a quality of implementation standpoint attempting to deliver the
timer SIGHUP signal is wrong. We should attempt to deliver the
synchronous SIGSEGV signal we just forced.
We can make that happening in a fairly straight forward manner by
instead of just looking at the signal number we also look at the
si_code. In particular for exceptions (aka synchronous signals) the
si_code is always greater than 0.
That still has the potential to pick up a number of asynchronous
signals as in a few cases the same si_codes that are used
for synchronous signals are also used for asynchronous signals,
and SI_KERNEL is also included in the list of possible si_codes.
Still the heuristic is much better and timer signals are definitely
excluded. Which is enough to prevent all known ways for someone
sending a process signals fast enough to cause unexpected and
arguably incorrect behavior.
Cc: stable@vger.kernel.org
Fixes: a27341cd5f ("Prioritize synchronous signals over 'normal' signals")
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 35634ffa1751b6efd8cf75010b509dcb0263e29b upstream.
Recently syzkaller was able to create unkillablle processes by
creating a timer that is delivered as a thread local signal on SIGHUP,
and receiving SIGHUP SA_NODEFERER. Ultimately causing a loop
failing to deliver SIGHUP but always trying.
Upon examination it turns out part of the problem is actually most of
the solution. Since 2.5 signal delivery has found all fatal signals,
marked the signal group for death, and queued SIGKILL in every threads
thread queue relying on signal->group_exit_code to preserve the
information of which was the actual fatal signal.
The conversion of all fatal signals to SIGKILL results in the
synchronous signal heuristic in next_signal kicking in and preferring
SIGHUP to SIGKILL. Which is especially problematic as all
fatal signals have already been transformed into SIGKILL.
Instead of dequeueing signals and depending upon SIGKILL to
be the first signal dequeued, first test if the signal group
has already been marked for death. This guarantees that
nothing in the signal queue can prevent a process that needs
to exit from exiting.
Cc: stable@vger.kernel.org
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Ref: ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 22839869f21ab3850fbbac9b425ccc4c0023926f upstream.
The sigaltstack(2) system call fails with -ENOMEM if the new alternative
signal stack is found to be smaller than SIGMINSTKSZ. On architectures
such as arm64, where the native value for SIGMINSTKSZ is larger than
the compat value, this can result in an unexpected error being reported
to a compat task. See, for example:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904385
This patch fixes the problem by extending do_sigaltstack to take the
minimum signal stack size as an additional parameter, allowing the
native and compat system call entry code to pass in their respective
values. COMPAT_SIGMINSTKSZ is just defined as SIGMINSTKSZ if it has not
been defined by the architecture.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Steve McIntyre <steve.mcintyre@arm.com>
Tested-by: Steve McIntyre <93sam@debian.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
[signal: Fix up cherry-pick conflicts for 22839869f21a]
Signed-off-by: Steve McIntyre <93sam@debian.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3597dfe01d12f570bc739da67f857fd222a3ea66 ]
Instead of playing whack-a-mole and changing SEND_SIG_PRIV to
SEND_SIG_FORCED throughout the kernel to ensure a pid namespace init
gets signals sent by the kernel, stop allowing a pid namespace init to
ignore SIGKILL or SIGSTOP sent by the kernel. A pid namespace init is
only supposed to be able to ignore signals sent from itself and
children with SIG_DFL.
Fixes: 921cf9f630 ("signals: protect cinit from unblocked SIG_DFL signals")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c7be96af89d4b53211862d8599b2430e8900ed92 upstream.
When running certain database workload on a high-end system with many
CPUs, it was found that spinlock contention in the sigprocmask syscalls
became a significant portion of the overall CPU cycles as shown below.
9.30% 9.30% 905387 dataserver /proc/kcore 0x7fff8163f4d2
[k] _raw_spin_lock_irq
|
---_raw_spin_lock_irq
|
|--99.34%-- __set_current_blocked
| sigprocmask
| sys_rt_sigprocmask
| system_call_fastpath
| |
| |--50.63%-- __swapcontext
| | |
| | |--99.91%-- upsleepgeneric
| |
| |--49.36%-- __setcontext
| | ktskRun
Looking further into the swapcontext function in glibc, it was found that
the function always call sigprocmask() without checking if there are
changes in the signal mask.
A check was added to the __set_current_blocked() function to avoid taking
the sighand->siglock spinlock if there is no change in the signal mask.
This will prevent unneeded spinlock contention when many threads are
trying to call sigprocmask().
With this patch applied, the spinlock contention in sigprocmask() was
gone.
Link: http://lkml.kernel.org/r/1474979209-11867-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stas Sergeev <stsp@list.ru>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 426915796ccaf9c2bd9bb06dc5702225957bc2e5 upstream.
complete_signal() checks SIGNAL_UNKILLABLE before it starts to destroy
the thread group, today this is wrong in many ways.
If nothing else, fatal_signal_pending() should always imply that the
whole thread group (except ->group_exit_task if it is not NULL) is
killed, this check breaks the rule.
After the previous changes we can rely on sig_task_ignored();
sig_fatal(sig) && SIGNAL_UNKILLABLE can only be true if we actually want
to kill this task and sig == SIGKILL OR it is traced and debugger can
intercept the signal.
This should hopefully fix the problem reported by Dmitry. This
test-case
static int init(void *arg)
{
for (;;)
pause();
}
int main(void)
{
char stack[16 * 1024];
for (;;) {
int pid = clone(init, stack + sizeof(stack)/2,
CLONE_NEWPID | SIGCHLD, NULL);
assert(pid > 0);
assert(ptrace(PTRACE_ATTACH, pid, 0, 0) == 0);
assert(waitpid(-1, NULL, WSTOPPED) == pid);
assert(ptrace(PTRACE_DETACH, pid, 0, SIGSTOP) == 0);
assert(syscall(__NR_tkill, pid, SIGKILL) == 0);
assert(pid == wait(NULL));
}
}
triggers the WARN_ON_ONCE(!(task->jobctl & JOBCTL_STOP_PENDING)) in
task_participate_group_stop(). do_signal_stop()->signal_group_exit()
checks SIGNAL_GROUP_EXIT and return false, but task_set_jobctl_pending()
checks fatal_signal_pending() and does not set JOBCTL_STOP_PENDING.
And his should fix the minor security problem reported by Kyle,
SECCOMP_RET_TRACE can miss fatal_signal_pending() the same way if the
task is the root of a pid namespace.
Link: http://lkml.kernel.org/r/20171103184246.GD21036@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Kyle Huey <me@kylehuey.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ac25385089f673560867eb5179228a44ade0cfc1 upstream.
Change sig_task_ignored() to drop the SIG_DFL && !sig_kernel_only()
signals even if force == T. This simplifies the next change and this
matches the same check in get_signal() which will drop these signals
anyway.
Link: http://lkml.kernel.org/r/20171103184227.GC21036@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 628c1bcba204052d19b686b5bac149a644cdb72e upstream.
The comment in sig_ignored() says "Tracers may want to know about even
ignored signals" but SIGKILL can not be reported to debugger and it is
just wrong to return 0 in this case: SIGKILL should only kill the
SIGNAL_UNKILLABLE task if it comes from the parent ns.
Change sig_ignored() to ignore ->ptrace if sig == SIGKILL and rely on
sig_task_ignored().
SISGTOP coming from within the namespace is not really right too but at
least debugger can intercept it, and we can't drop it here because this
will break "gdb -p 1": ptrace_attach() won't work. Perhaps we will add
another ->ptrace check later, we will see.
Link: http://lkml.kernel.org/r/20171103184206.GB21036@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2d39b3cd34e6d323720d4c61bd714f5ae202c022 ]
Since commit 00cd5c37af ("ptrace: permit ptracing of /sbin/init") we
can now trace init processes. init is initially protected with
SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
there are a number of paths during tracing where SIGNAL_UNKILLABLE can
be implicitly cleared.
This can result in init becoming stoppable/killable after tracing. For
example, running:
while true; do kill -STOP 1; done &
strace -p 1
and then stopping strace and the kill loop will result in init being
left in state TASK_STOPPED. Sending SIGCONT to init will resume it, but
init will now respond to future SIGSTOP signals rather than ignoring
them.
Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
that we don't clear SIGNAL_UNKILLABLE.
Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 57db7e4a2d92c2d3dfbca4ef8057849b2682436b upstream.
Thomas Gleixner wrote:
> The CRIU support added a 'feature' which allows a user space task to send
> arbitrary (kernel) signals to itself. The changelog says:
>
> The kernel prevents sending of siginfo with positive si_code, because
> these codes are reserved for kernel. I think we can allow a task to
> send such a siginfo to itself. This operation should not be dangerous.
>
> Quite contrary to that claim, it turns out that it is outright dangerous
> for signals with info->si_code == SI_TIMER. The following code sequence in
> a user space task allows to crash the kernel:
>
> id = timer_create(CLOCK_XXX, ..... signo = SIGX);
> timer_set(id, ....);
> info->si_signo = SIGX;
> info->si_code = SI_TIMER:
> info->_sifields._timer._tid = id;
> info->_sifields._timer._sys_private = 2;
> rt_[tg]sigqueueinfo(..., SIGX, info);
> sigemptyset(&sigset);
> sigaddset(&sigset, SIGX);
> rt_sigtimedwait(sigset, info);
>
> For timers based on CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID this
> results in a kernel crash because sigwait() dequeues the signal and the
> dequeue code observes:
>
> info->si_code == SI_TIMER && info->_sifields._timer._sys_private != 0
>
> which triggers the following callchain:
>
> do_schedule_next_timer() -> posix_cpu_timer_schedule() -> arm_timer()
>
> arm_timer() executes a list_add() on the timer, which is already armed via
> the timer_set() syscall. That's a double list add which corrupts the posix
> cpu timer list. As a consequence the kernel crashes on the next operation
> touching the posix cpu timer list.
>
> Posix clocks which are internally implemented based on hrtimers are not
> affected by this because hrtimer_start() can handle already armed timers
> nicely, but it's a reliable way to trigger the WARN_ON() in
> hrtimer_forward(), which complains about calling that function on an
> already armed timer.
This problem has existed since the posix timer code was merged into
2.5.63. A few releases earlier in 2.5.60 ptrace gained the ability to
inject not just a signal (which linux has supported since 1.0) but the
full siginfo of a signal.
The core problem is that the code will reschedule in response to
signals getting dequeued not just for signals the timers sent but
for other signals that happen to a si_code of SI_TIMER.
Avoid this confusion by testing to see if the queued signal was
preallocated as all timer signals are preallocated, and so far
only the timer code preallocates signals.
Move the check for if a timer needs to be rescheduled up into
collect_signal where the preallocation check must be performed,
and pass the result back to dequeue_signal where the code reschedules
timers. This makes it clear why the code cares about preallocated
timers.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reference: 66dd34ad31 ("signal: allow to send any siginfo to itself")
Reference: 1669ce53e2ff ("Add PTRACE_GETSIGINFO and PTRACE_SETSIGINFO")
Fixes: db8b50ba75f2 ("[PATCH] POSIX clocks & timers")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 441398d378f29a5ad6d0fcda07918e54e4961800 upstream.
Currently SS_AUTODISARM is not supported in compatibility mode, but does
not return -EINVAL either. This makes dosemu built with -m32 on x86_64
to crash. Also the kernel's sigaltstack selftest fails if compiled with
-m32.
This patch adds the needed support.
Link: http://lkml.kernel.org/r/20170205101213.8163-2-stsp@list.ru
Signed-off-by: Stas Sergeev <stsp@users.sourceforge.net>
Cc: Milosz Tanski <milosz@adfin.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Waiman Long <Waiman.Long@hpe.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Introduce new flags that defines which ABI to use on creating sigframe.
Those flags kernel will set according to sigaction syscall ABI,
which set handler for the signal being delivered.
So that will drop the dependency on TIF_IA32/TIF_X32 flags on signal deliver.
Those flags will be used only under CONFIG_COMPAT.
Similar way ARM uses sa_flags to differ in which mode deliver signal
for 26-bit applications (look at SA_THIRYTWO).
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: 0x7f454c46@gmail.com
Cc: oleg@redhat.com
Cc: linux-mm@kvack.org
Cc: gorcunov@openvz.org
Cc: xemul@virtuozzo.com
Link: http://lkml.kernel.org/r/20160905133308.28234-7-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We've converted most timeout related syscalls to hrtimers, but
sigtimedwait() did not get this treatment.
Convert it so we get a reasonable accuracy and remove the
user space exposure to the timer wheel properties.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Cyril Hrubis <chrubis@suse.cz>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.787164909@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use pr_<level> instead of printk(KERN_<LEVEL> ).
Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sigaltstack()'s reported previous state uses a somewhat odd
convention, but the concept of flag bits is new, and we can do the
flag bits sensibly. Specifically, let's just report them directly.
This will allow saving and restoring the sigaltstack state using
sigaltstack() to work correctly.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amanieu d'Antras <amanieu@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Stas Sergeev <stsp@list.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/94b291ec9fd47741a9264851e316e158ded0b00d.1462296606.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch implements the SS_AUTODISARM flag that can be OR-ed with
SS_ONSTACK when forming ss_flags.
When this flag is set, sigaltstack will be disabled when entering
the signal handler; more precisely, after saving sas to uc_stack.
When leaving the signal handler, the sigaltstack is restored by
uc_stack.
When this flag is used, it is safe to switch from sighandler with
swapcontext(). Without this flag, the subsequent signal will corrupt
the state of the switched-away sighandler.
To detect the support of this functionality, one can do:
err = sigaltstack(SS_DISABLE | SS_AUTODISARM);
if (err && errno == EINVAL)
unsupported();
Signed-off-by: Stas Sergeev <stsp@list.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Amanieu d'Antras <amanieu@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Jason Low <jason.low2@hp.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-api@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1460665206-13646-4-git-send-email-stsp@list.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds SS_FLAG_BITS - the mask that splits sigaltstack
mode values and bit-flags. Since there is no bit-flags yet, the
mask is defined to 0. The flags are added by subsequent patches.
With every new flag, the mask should have the appropriate bit cleared.
This makes sure if some flag is tried on a kernel that doesn't
support it, the -EINVAL error will be returned, because such a
flag will be treated as an invalid mode rather than the bit-flag.
That way the existence of the particular features can be probed
at run-time.
This change was suggested by Andy Lutomirski:
https://lkml.org/lkml/2016/3/6/158
Signed-off-by: Stas Sergeev <stsp@list.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amanieu d'Antras <amanieu@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-api@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1460665206-13646-3-git-send-email-stsp@list.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The value of __ARCH_SI_PREAMBLE_SIZE defines the size (including
padding) of the part of the struct siginfo that is before the union, and
it is then used to calculate the needed padding (SI_PAD_SIZE) to make
the size of struct siginfo equal to 128 (SI_MAX_SIZE) bytes.
Depending on the target architecture and word width it equals to either
3 or 4 times sizeof int.
Since the very beginning we had __ARCH_SI_PREAMBLE_SIZE wrong on the
parisc architecture for the 64bit kernel build. It's even more
frustrating, because it can easily be checked at compile time if the
value was defined correctly.
This patch adds such a check for the correctness of
__ARCH_SI_PREAMBLE_SIZE in the hope that it will prevent existing and
future architectures from running into the same problem.
I refrained from replacing __ARCH_SI_PREAMBLE_SIZE by offsetof() in
copy_siginfo() in include/asm-generic/siginfo.h, because a) it doesn't
make any difference and b) it's used in the Documentation/kmemcheck.txt
example.
I ran this patch through the 0-DAY kernel test infrastructure and only
the parisc architecture triggered as expected. That means that this
patch should be OK for all major architectures.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A protection key fault is very similar to any other access error.
There must be a VMA, etc... We even want to take the same action
(SIGSEGV) that we do with a normal access fault.
However, we do need to let userspace know that something is
different. We do this the same way what we did with SEGV_BNDERR
with Memory Protection eXtensions (MPX): define a new SEGV code:
SEGV_PKUERR.
We add a siginfo field: si_pkey that reveals to userspace which
protection key was set on the PTE that we faulted on. There is
no other easy way for userspace to figure this out. They could
parse smaps but that would be a bit cruel.
We share space with in siginfo with _addr_bnd. #BR faults from
MPX are completely separate from page faults (#PF) that trigger
from protection key violations, so we never need both at the same
time.
Note that _pkey is a 64-bit value. The current hardware only
supports 4-bit protection keys. We do this because there is
_plenty_ of space in _sigfault and it is possible that future
processors would support more than 4 bits of protection keys.
The x86 code to actually fill in the siginfo is in the next
patch.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amanieu d'Antras <amanieu@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210212.3A9B83AC@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A random wakeup can get us out of sigsuspend() without TIF_SIGPENDING
being set.
Avoid that by making sure we were signaled, like sys_pause() does.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sigsuspend() is nowhere used except in signal.c itself, so we can mark it
static do not pollute the global namespace.
But this patch is more than a boring cleanup patch, it fixes a real issue
on UserModeLinux. UML has a special console driver to display ttys using
xterm, or other terminal emulators, on the host side. Vegard reported
that sometimes UML is unable to spawn a xterm and he's facing the
following warning:
WARNING: CPU: 0 PID: 908 at include/linux/thread_info.h:128 sigsuspend+0xab/0xc0()
It turned out that this warning makes absolutely no sense as the UML
xterm code calls sigsuspend() on the host side, at least it tries. But
as the kernel itself offers a sigsuspend() symbol the linker choose this
one instead of the glibc wrapper. Interestingly this code used to work
since ever but always blocked signals on the wrong side. Some recent
kernel change made the WARN_ON() trigger and uncovered the bug.
It is a wonderful example of how much works by chance on computers. :-)
Fixes: 68f3f16d9a ("new helper: sigsuspend()")
Signed-off-by: Richard Weinberger <richard@nod.at>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_will_free_mem() is wrong in many ways, and in particular the
SIGNAL_GROUP_COREDUMP check is not reliable: a task can participate in the
coredumping without SIGNAL_GROUP_COREDUMP bit set.
change zap_threads() paths to always set SIGNAL_GROUP_COREDUMP even if
other CLONE_VM processes can't react to SIGKILL. Fortunately, at least
oom-kill case if fine; it kills all tasks sharing the same mm, so it
should also kill the process which actually dumps the core.
The change in prepare_signal() is not strictly necessary, it just ensures
that the patch does not bring another subtle behavioural change. But it
reminds us that this SIGNAL_GROUP_EXIT/COREDUMP case needs more changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kyle Walker <kwalker@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Stanislav Kozina <skozina@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is hardly possible to enumerate all problems with block_all_signals()
and unblock_all_signals(). Just for example,
1. block_all_signals(SIGSTOP/etc) simply can't help if the caller is
multithreaded. Another thread can dequeue the signal and force the
group stop.
2. Even is the caller is single-threaded, it will "stop" anyway. It
will not sleep, but it will spin in kernel space until SIGCONT or
SIGKILL.
And a lot more. In short, this interface doesn't work at all, at least
the last 10+ years.
Daniel said:
Yeah the only times I played around with the DRM_LOCK stuff was when
old drivers accidentally deadlocked - my impression is that the entire
DRM_LOCK thing was never really tested properly ;-) Hence I'm all for
purging where this leaks out of the drm subsystem.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Dave Airlie <airlied@redhat.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function may copy the si_addr_lsb, si_lower and si_upper fields to
user mode when they haven't been initialized, which can leak kernel
stack data to user mode.
Just checking the value of si_code is insufficient because the same
si_code value is shared between multiple signals. This is solved by
checking the value of si_signo in addition to si_code.
Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function can leak kernel stack data when the user siginfo_t has a
positive si_code value. The top 16 bits of si_code descibe which fields
in the siginfo_t union are active, but they are treated inconsistently
between copy_siginfo_from_user32, copy_siginfo_to_user32 and
copy_siginfo_to_user.
copy_siginfo_from_user32 is called from rt_sigqueueinfo and
rt_tgsigqueueinfo in which the user has full control overthe top 16 bits
of si_code.
This fixes the following information leaks:
x86: 8 bytes leaked when sending a signal from a 32-bit process to
itself. This leak grows to 16 bytes if the process uses x32.
(si_code = __SI_CHLD)
x86: 100 bytes leaked when sending a signal from a 32-bit process to
a 64-bit process. (si_code = -1)
sparc: 4 bytes leaked when sending a signal from a 32-bit process to a
64-bit process. (si_code = any)
parsic and s390 have similar bugs, but they are not vulnerable because
rt_[tg]sigqueueinfo have checks that prevent sending a positive si_code
to a different process. These bugs are also fixed for consistency.
Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull security subsystem updates from James Morris:
"The main change in this kernel is Casey's generalized LSM stacking
work, which removes the hard-coding of Capabilities and Yama stacking,
allowing multiple arbitrary "small" LSMs to be stacked with a default
monolithic module (e.g. SELinux, Smack, AppArmor).
See
https://lwn.net/Articles/636056/
This will allow smaller, simpler LSMs to be incorporated into the
mainline kernel and arbitrarily stacked by users. Also, this is a
useful cleanup of the LSM code in its own right"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
tpm, tpm_crb: fix le64_to_cpu conversions in crb_acpi_add()
vTPM: set virtual device before passing to ibmvtpm_reset_crq
tpm_ibmvtpm: remove unneccessary message level.
ima: update builtin policies
ima: extend "mask" policy matching support
ima: add support for new "euid" policy condition
ima: fix ima_show_template_data_ascii()
Smack: freeing an error pointer in smk_write_revoke_subj()
selinux: fix setting of security labels on NFS
selinux: Remove unused permission definitions
selinux: enable genfscon labeling for sysfs and pstore files
selinux: enable per-file labeling for debugfs files.
selinux: update netlink socket classes
signals: don't abuse __flush_signals() in selinux_bprm_committed_creds()
selinux: Print 'sclass' as string when unrecognized netlink message occurs
Smack: allow multiple labels in onlycap
Smack: fix seq operations in smackfs
ima: pass iint to ima_add_violation()
ima: wrap event related data to the new ima_event_data structure
integrity: add validity checks for 'path' parameter
...
selinux_bprm_committed_creds()->__flush_signals() is not right, we
shouldn't clear TIF_SIGPENDING unconditionally. There can be other
reasons for signal_pending(): freezing(), JOBCTL_PENDING_MASK, and
potentially more.
Also change this code to check fatal_signal_pending() rather than
SIGNAL_GROUP_EXIT, it looks a bit better.
Now we can kill __flush_signals() before it finds another buggy user.
Note: this code looks racy, we can flush a signal which was sent after
the task SID has been updated.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
c56fb6564dcd ("Fix a misaligned load inside ptrace_attach()") makes
jobctl an "unsigned long". It makes sense to have the masks applied
to it match that type. This is currently just a cosmetic change, but
it will prevent the mask from being unexpectedly truncated if we ever
end up with masks with more bits.
One instance of "signr" is an int, but I left this alone because the
mask ensures that it will never overflow.
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bobby.prani@gmail.com
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: richard@nod.at
Cc: vdavydov@parallels.com
Link: http://lkml.kernel.org/r/1430453997-32459-4-git-send-email-palmer@dabbelt.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Sending SI_TKILL from rt_[tg]sigqueueinfo was deprecated, so now we issue
a warning on the first attempt of doing it. We use WARN_ON_ONCE, which is
not informative and, what is worse, taints the kernel, making the trinity
syscall fuzzer complain false-positively from time to time.
It does not look like we need this warning at all, because the behaviour
changed quite a long time ago (2.6.39), and if an application relies on
the old API, it gets EPERM anyway and can issue a warning by itself.
So let us zap the warning in kernel.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>