We can avoid an ifdef over wq_power_efficient's declaration
by just using IS_ENABLED().
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: cocci@systeme.lip6.fr
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: mydongistiny <jaysonedson@gmail.com>
Signed-off-by: Eliminater74 <eliminater74@gmail.com>
(cherry picked from commit 6de9b44a13409c30e13a7948be108b16501db39b)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl50ed8ACgkQONu9yGCS
aT7A8A//WF8aq34T1LLR70ESa6fJVj2dEzCQA94qvUCPHDJfBj1x00R8OGvZfKEc
E7YPI8MPrvMmXnyxImge2UcqCJhmnis+gk1fLn+N3Yi5nfZWrxC1dnShCnCTeye1
tA+BuhyzOXHDmLqrktBcT3TCTrIUWK8lehR8m7jlkAh7wG6Z2QmgKlaiq5fRAVE9
Z0je4fw0HxdVrCM84BJiM1M6w3pa+AYUkSBfqvU+fgm8xfOWGUPZFEitx9O1PwaH
m0UOS51gM5ZSzpAMDBlXYnmUgPxUTlpwqIL77zgMFzYcpKbf2rVphFRxmacWWqQ2
MWnt6vn/eatDeJ6TFAF08APy63SUAR7BvssSYoySz8Y4m6LbL4yUfCWYkL9lhLFH
dJL6TTjUwFw1Rd7UVzHongEr24NASCS37MiR8V/3RUiDqIYtrlaJKxhhVdTKg7mm
xRKA+QBXkGUfwFJm2mL+3E3D7/Xgl4kNo22lp1YgamTgc50/UMB5JEf8Z3aqa/qh
ie/4oKnmaa+SSgOxQmvqS3lWF7RYU+axvc9K0FK5CYSXnc6Jfgtq0yDB5twslj05
HdCwNJX9KOL0NZAzQKs+FLjnhOtnRigANMG7KMm5nmS3vwsxQqIkm0b6tPhgBcd/
k4QQm8h6C5Tdi+R39fGQOgtBK0gVVJl+n3XfA3gbmunvXhSo6IQ=
=fFH1
-----END PGP SIGNATURE-----
Merge 4.9.217 into android-4.9-q
Changes in 4.9.217
NFS: Remove superfluous kmap in nfs_readdir_xdr_to_array
phy: Revert toggling reset changes.
net: phy: Avoid multiple suspends
cgroup, netclassid: periodically release file_lock on classid updating
gre: fix uninit-value in __iptunnel_pull_header
ipv6/addrconf: call ipv6_mc_up() for non-Ethernet interface
net: macsec: update SCI upon MAC address change.
net: nfc: fix bounds checking bugs on "pipe"
r8152: check disconnect status after long sleep
bnxt_en: reinitialize IRQs when MTU is modified
fib: add missing attribute validation for tun_id
nl802154: add missing attribute validation
nl802154: add missing attribute validation for dev_type
macsec: add missing attribute validation for port
net: fq: add missing attribute validation for orphan mask
team: add missing attribute validation for port ifindex
team: add missing attribute validation for array index
nfc: add missing attribute validation for SE API
nfc: add missing attribute validation for vendor subcommand
ipvlan: add cond_resched_rcu() while processing muticast backlog
ipvlan: do not add hardware address of master to its unicast filter list
ipvlan: egress mcast packets are not exceptional
ipvlan: do not use cond_resched_rcu() in ipvlan_process_multicast()
ipvlan: don't deref eth hdr before checking it's set
macvlan: add cond_resched() during multicast processing
net: fec: validate the new settings in fec_enet_set_coalesce()
slip: make slhc_compress() more robust against malicious packets
bonding/alb: make sure arp header is pulled before accessing it
cgroup: memcg: net: do not associate sock with unrelated cgroup
net: phy: fix MDIO bus PM PHY resuming
virtio-blk: fix hw_queue stopped on arbitrary error
iommu/vt-d: quirk_ioat_snb_local_iommu: replace WARN_TAINT with pr_warn + add_taint
workqueue: don't use wq_select_unbound_cpu() for bound works
drm/amd/display: remove duplicated assignment to grph_obj_type
cifs_atomic_open(): fix double-put on late allocation failure
gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache
KVM: x86: clear stale x86_emulate_ctxt->intercept value
ARC: define __ALIGN_STR and __ALIGN symbols for ARC
efi: Fix a race and a buffer overflow while reading efivars via sysfs
iommu/vt-d: dmar: replace WARN_TAINT with pr_warn + add_taint
iommu/vt-d: Fix a bug in intel_iommu_iova_to_phys() for huge page
nl80211: add missing attribute validation for critical protocol indication
nl80211: add missing attribute validation for beacon report scanning
nl80211: add missing attribute validation for channel switch
netfilter: cthelper: add missing attribute validation for cthelper
mwifiex: Fix heap overflow in mmwifiex_process_tdls_action_frame()
iommu/vt-d: Fix the wrong printing in RHSA parsing
iommu/vt-d: Ignore devices with out-of-spec domain number
ipv6: restrict IPV6_ADDRFORM operation
efi: Add a sanity check to efivar_store_raw()
batman-adv: Fix double free during fragment merge error
batman-adv: Fix transmission of final, 16th fragment
batman-adv: Initialize gw sel_class via batadv_algo
batman-adv: Fix rx packet/bytes stats on local ARP reply
batman-adv: Use default throughput value on cfg80211 error
batman-adv: Accept only filled wifi station info
batman-adv: fix TT sync flag inconsistencies
batman-adv: Avoid spurious warnings from bat_v neigh_cmp implementation
batman-adv: Always initialize fragment header priority
batman-adv: Fix check of retrieved orig_gw in batadv_v_gw_is_eligible
batman-adv: Fix lock for ogm cnt access in batadv_iv_ogm_calc_tq
batman-adv: Fix internal interface indices types
batman-adv: Avoid race in TT TVLV allocator helper
batman-adv: Fix TT sync flags for intermediate TT responses
batman-adv: prevent TT request storms by not sending inconsistent TT TLVLs
batman-adv: Fix debugfs path for renamed hardif
batman-adv: Fix debugfs path for renamed softif
batman-adv: Avoid storing non-TT-sync flags on singular entries too
batman-adv: Fix multicast TT issues with bogus ROAM flags
batman-adv: Prevent duplicated gateway_node entry
batman-adv: Fix duplicated OGMs on NETDEV_UP
batman-adv: Avoid free/alloc race when handling OGM2 buffer
batman-adv: Avoid free/alloc race when handling OGM buffer
batman-adv: Don't schedule OGM for disabled interface
batman-adv: update data pointers after skb_cow()
batman-adv: Avoid probe ELP information leak
batman-adv: Use explicit tvlv padding for ELP packets
perf/amd/uncore: Replace manual sampling check with CAP_NO_INTERRUPT flag
ACPI: watchdog: Allow disabling WDAT at boot
HID: apple: Add support for recent firmware on Magic Keyboards
HID: i2c-hid: add Trekstor Surfbook E11B to descriptor override
cfg80211: check reg_rule for NULL in handle_channel_custom()
net: ks8851-ml: Fix IRQ handling and locking
mac80211: rx: avoid RCU list traversal under mutex
signal: avoid double atomic counter increments for user accounting
jbd2: fix data races at struct journal_head
ARM: 8957/1: VDSO: Match ARMv8 timer in cntvct_functional()
ARM: 8958/1: rename missed uaccess .fixup section
mm: slub: add missing TID bump in kmem_cache_alloc_bulk()
ipv4: ensure rcu_read_lock() in cipso_v4_error()
Linux 4.9.217
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia7aeed273cd7548dc8d0dfaaad8b96bedfe499b1
commit aa202f1f56960c60e7befaa0f49c72b8fa11b0a8 upstream.
wq_select_unbound_cpu() is designed for unbound workqueues only, but
it's wrongly called when using a bound workqueue too.
Fixing this ensures work queued to a bound workqueue with
cpu=WORK_CPU_UNBOUND always runs on the local CPU.
Before, that would happen only if wq_unbound_cpumask happened to include
it (likely almost always the case), or was empty, or we got lucky with
forced round-robin placement. So restricting
/sys/devices/virtual/workqueue/cpumask to a small subset of a machine's
CPUs would cause some bound work items to run unexpectedly there.
Fixes: ef55718044 ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Hillf Danton <hdanton@sina.com>
[dj: massage changelog]
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl396QsACgkQONu9yGCS
aT63DA//V6C58MC6I6ISv/2zcGrWOzfP4q1et4lu1eBPgUTzbnB7gdnwH2V6l3QU
Kt9V4GUI6GFsVmRGsEJZjg4kvA1HDjx6sbs3odHuvDnhGFrHstnxKOJiuZeQU4Ph
wxDEQghyeYod+2VDKNWw3gxjBgjP95eAvZMmhnQ13dwY20HZgGjoUMHBKpKpwOly
ZwV0YWsW9Zko8bTmuyrqXUG8ChSWZCWMuNKNpUDIjVj/NPcAyEAr9wkYOGf8UnGq
rK/owI0Y02qZJS9ZGxv4IDNjW7yjIGQzSYWUrY8ajSQzFqHErWSQ7tGwJ8fb5EBM
0zH2GGp8XLXXOCMMrg8nsTlNaj+wg7a5C0Vx0VtijYyIQCE39M91rltO+oVBWdjR
jCUbp//2v7XFN+eQSvwuWHrUyOmGGj9C0mDGoMC4rSHES4jYS4b4wHzn9CK8Iqy2
izvNjz93TrNj+YLLUlf6tTSUfWYuGSFNyahbhjFL7MPBp2RUJM1f5uzhBj2sQqwN
olCUvsNSZRWkR5/f15/kdEyPAhYt0i66aV3JNpEaRHZDqpUXAMTRv+AgtEPBNA5r
mORdDq9Zw+m+BGYSiuotArINRY8PSn1tP8tnhNjGE0RwnAx+S0Fs6+0AyHhEY+wQ
OBpNfRTaJep9L1Yl4Hj4ZDU6Lr5xqWLoplV2X9OIUzLleAY44+M=
=Gqb9
-----END PGP SIGNATURE-----
Merge 4.9.207 into android-4.9-q
Changes in 4.9.207
arm64: tegra: Fix 'active-low' warning for Jetson TX1 regulator
usb: gadget: u_serial: add missing port entry locking
tty: serial: fsl_lpuart: use the sg count from dma_map_sg
tty: serial: msm_serial: Fix flow control
serial: pl011: Fix DMA ->flush_buffer()
serial: serial_core: Perform NULL checks for break_ctl ops
serial: ifx6x60: add missed pm_runtime_disable
autofs: fix a leak in autofs_expire_indirect()
RDMA/hns: Correct the value of HNS_ROCE_HEM_CHUNK_LEN
exportfs_decode_fh(): negative pinned may become positive without the parent locked
audit_get_nd(): don't unlock parent too early
NFC: nxp-nci: Fix NULL pointer dereference after I2C communication error
Input: cyttsp4_core - fix use after free bug
ALSA: pcm: Fix stream lock usage in snd_pcm_period_elapsed()
rsxx: add missed destroy_workqueue calls in remove
net: ep93xx_eth: fix mismatch of request_mem_region in remove
serial: core: Allow processing sysrq at port unlock time
cxgb4vf: fix memleak in mac_hlist initialization
iwlwifi: mvm: Send non offchannel traffic via AP sta
ARM: 8813/1: Make aligned 2-byte getuser()/putuser() atomic on ARMv6+
net/mlx5: Release resource on error flow
extcon: max8997: Fix lack of path setting in USB device mode
clk: rockchip: fix rk3188 sclk_smc gate data
clk: rockchip: fix rk3188 sclk_mac_lbtest parameter ordering
ARM: dts: rockchip: Fix rk3288-rock2 vcc_flash name
dlm: fix missing idr_destroy for recover_idr
MIPS: SiByte: Enable ZONE_DMA32 for LittleSur
scsi: zfcp: drop default switch case which might paper over missing case
pinctrl: qcom: ssbi-gpio: fix gpio-hog related boot issues
Staging: iio: adt7316: Fix i2c data reading, set the data field
regulator: Fix return value of _set_load() stub
MIPS: OCTEON: octeon-platform: fix typing
math-emu/soft-fp.h: (_FP_ROUND_ZERO) cast 0 to void to fix warning
rtc: max8997: Fix the returned value in case of error in 'max8997_rtc_read_alarm()'
rtc: dt-binding: abx80x: fix resistance scale
ARM: dts: exynos: Use Samsung SoC specific compatible for DWC2 module
media: pulse8-cec: return 0 when invalidating the logical address
dmaengine: coh901318: Fix a double-lock bug
dmaengine: coh901318: Remove unused variable
usb: dwc3: don't log probe deferrals; but do log other error codes
ACPI: fix acpi_find_child_device() invocation in acpi_preset_companion()
dma-mapping: fix return type of dma_set_max_seg_size()
altera-stapl: check for a null key before strcasecmp'ing it
serial: imx: fix error handling in console_setup
i2c: imx: don't print error message on probe defer
dlm: NULL check before kmem_cache_destroy is not needed
ARM: debug: enable UART1 for socfpga Cyclone5
nfsd: fix a warning in __cld_pipe_upcall()
ARM: OMAP1/2: fix SoC name printing
net/x25: fix called/calling length calculation in x25_parse_address_block
net/x25: fix null_x25_address handling
ARM: dts: mmp2: fix the gpio interrupt cell number
ARM: dts: realview-pbx: Fix duplicate regulator nodes
tcp: fix off-by-one bug on aborting window-probing socket
tcp: fix SNMP TCP timeout under-estimation
modpost: skip ELF local symbols during section mismatch check
kbuild: fix single target build for external module
mtd: fix mtd_oobavail() incoherent returned value
ARM: dts: pxa: clean up USB controller nodes
clk: sunxi-ng: h3/h5: Fix CSI_MCLK parent
ARM: dts: realview: Fix some more duplicate regulator nodes
dlm: fix invalid cluster name warning
net/mlx4_core: Fix return codes of unsupported operations
powerpc/math-emu: Update macros from GCC
MIPS: OCTEON: cvmx_pko_mem_debug8: use oldest forward compatible definition
nfsd: Return EPERM, not EACCES, in some SETATTR cases
tty: Don't block on IO when ldisc change is pending
media: stkwebcam: Bugfix for wrong return values
mlx4: Use snprintf instead of complicated strcpy
ARM: dts: sunxi: Fix PMU compatible strings
sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
fuse: verify nlink
fuse: verify attributes
ALSA: pcm: oss: Avoid potential buffer overflows
Input: goodix - add upside-down quirk for Teclast X89 tablet
coresight: etm4x: Fix input validation for sysfs.
x86/PCI: Avoid AMD FCH XHCI USB PME# from D0 defect
CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
CIFS: Fix SMB2 oplock break processing
tty: vt: keyboard: reject invalid keycodes
can: slcan: Fix use-after-free Read in slcan_open
jbd2: Fix possible overflow in jbd2_log_space_left()
drm/i810: Prevent underflow in ioctl
KVM: x86: do not modify masked bits of shared MSRs
KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES
crypto: crypto4xx - fix double-free in crypto4xx_destroy_sdr
crypto: ccp - fix uninitialized list head
crypto: ecdh - fix big endian bug in ECC library
crypto: user - fix memory leak in crypto_report
spi: atmel: Fix CS high support
RDMA/qib: Validate ->show()/store() callbacks before calling them
thermal: Fix deadlock in thermal thermal_zone_device_check
KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332)
appletalk: Fix potential NULL pointer dereference in unregister_snap_client
appletalk: Set error code if register_snap_client failed
usb: gadget: configfs: Fix missing spin_lock_init()
USB: uas: honor flag to avoid CAPACITY16
USB: uas: heed CAPACITY_HEURISTICS
usb: Allow USB device to be warm reset in suspended state
staging: rtl8188eu: fix interface sanity check
staging: rtl8712: fix interface sanity check
staging: gigaset: fix general protection fault on probe
staging: gigaset: fix illegal free on probe errors
staging: gigaset: add endpoint-type sanity check
xhci: Increase STS_HALT timeout in xhci_suspend()
ARM: dts: pandora-common: define wl1251 as child node of mmc3
iio: humidity: hdc100x: fix IIO_HUMIDITYRELATIVE channel reporting
USB: atm: ueagle-atm: add missing endpoint check
USB: idmouse: fix interface sanity checks
USB: serial: io_edgeport: fix epic endpoint lookup
USB: adutux: fix interface sanity check
usb: core: urb: fix URB structure initialization function
usb: mon: Fix a deadlock in usbmon between mmap and read
mtd: spear_smi: Fix Write Burst mode
virtio-balloon: fix managed page counts when migrating pages between zones
btrfs: check page->mapping when loading free space cache
btrfs: Remove btrfs_bio::flags member
Btrfs: send, skip backreference walking for extents with many references
btrfs: record all roots for rename exchange on a subvol
rtlwifi: rtl8192de: Fix missing code to retrieve RX buffer address
rtlwifi: rtl8192de: Fix missing callback that tests for hw release of buffer
rtlwifi: rtl8192de: Fix missing enable interrupt flag
lib: raid6: fix awk build warnings
ALSA: hda - Fix pending unsol events at shutdown
workqueue: Fix spurious sanity check failures in destroy_workqueue()
workqueue: Fix pwq ref leak in rescuer_thread()
ASoC: Jack: Fix NULL pointer dereference in snd_soc_jack_report
blk-mq: avoid sysfs buffer overflow with too many CPU cores
cgroup: pids: use atomic64_t for pids->limit
ar5523: check NULL before memcpy() in ar5523_cmd()
media: bdisp: fix memleak on release
media: radio: wl1273: fix interrupt masking on release
cpuidle: Do not unset the driver if it is there already
PM / devfreq: Lock devfreq in trans_stat_show
ACPI: OSL: only free map once in osl.c
ACPI: bus: Fix NULL pointer check in acpi_bus_get_private_data()
ACPI: PM: Avoid attaching ACPI PM domain to certain devices
pinctrl: samsung: Fix device node refcount leaks in S3C24xx wakeup controller init
pinctrl: samsung: Fix device node refcount leaks in init code
mmc: host: omap_hsmmc: add code for special init of wl1251 to get rid of pandora_wl1251_init_card
ppdev: fix PPGETTIME/PPSETTIME ioctls
powerpc: Allow 64bit VDSO __kernel_sync_dicache to work across ranges >4GB
video/hdmi: Fix AVI bar unpack
quota: Check that quota is not dirty before release
ext2: check err when partial != NULL
quota: fix livelock in dquot_writeback_dquots
scsi: zfcp: trace channel log even for FCP command responses
usb: xhci: only set D3hot for pci device
xhci: Fix memory leak in xhci_add_in_port()
xhci: make sure interrupts are restored to correct state
iio: adis16480: Add debugfs_reg_access entry
Btrfs: fix negative subv_writers counter and data space leak after buffered write
omap: pdata-quirks: remove openpandora quirks for mmc3 and wl1251
scsi: lpfc: Cap NPIV vports to 256
e100: Fix passing zero to 'PTR_ERR' warning in e100_load_ucode_wait
x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
ath10k: fix fw crash by moving chip reset after napi disabled
ARM: dts: omap3-tao3530: Fix incorrect MMC card detection GPIO polarity
pinctrl: samsung: Fix device node refcount leaks in S3C64xx wakeup controller init
scsi: qla2xxx: Fix DMA unmap leak
scsi: qla2xxx: Fix session lookup in qlt_abort_work()
scsi: qla2xxx: Fix qla24xx_process_bidir_cmd()
scsi: qla2xxx: Always check the qla2x00_wait_for_hba_online() return value
powerpc: Fix vDSO clock_getres()
reiserfs: fix extended attributes on the root directory
firmware: qcom: scm: Ensure 'a0' status code is treated as signed
mm/shmem.c: cast the type of unmap_start to u64
ext4: fix a bug in ext4_wait_for_tail_page_commit
blk-mq: make sure that line break can be printed
workqueue: Fix missing kfree(rescuer) in destroy_workqueue()
sunrpc: fix crash when cache_head become valid before update
net/mlx5e: Fix SFF 8472 eeprom length
kernel/module.c: wakeup processes in module_wq on module unload
nvme: host: core: fix precedence of ternary operator
net: bridge: deny dev_set_mac_address() when unregistering
net: ethernet: ti: cpsw: fix extra rx interrupt
openvswitch: support asymmetric conntrack
tcp: md5: fix potential overestimation of TCP option space
tipc: fix ordering of tipc module init and exit routine
inet: protect against too small mtu values.
tcp: fix rejected syncookies due to stale timestamps
tcp: tighten acceptance of ACKs not matching a child socket
tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
Revert "regulator: Defer init completion for a while after late_initcall"
PCI: Fix Intel ACS quirk UPDCR register address
PCI/MSI: Fix incorrect MSI-X masking on resume
xtensa: fix TLB sanity checker
CIFS: Respect O_SYNC and O_DIRECT flags during reconnect
ARM: dts: s3c64xx: Fix init order of clock providers
ARM: tegra: Fix FLOW_CTLR_HALT register clobbering by tegra_resume()
vfio/pci: call irq_bypass_unregister_producer() before freeing irq
dma-buf: Fix memory leak in sync_file_merge()
dm btree: increase rebalance threshold in __rebalance2()
scsi: iscsi: Fix a potential deadlock in the timeout handler
drm/radeon: fix r1xx/r2xx register checker for POT textures
xhci: fix USB3 device initiated resume race with roothub autosuspend
net: stmmac: use correct DMA buffer size in the RX descriptor
net: stmmac: don't stop NAPI processing when dropping a packet
Linux 4.9.207
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit e66b39af00f426b3356b96433d620cb3367ba1ff upstream.
008847f66c ("workqueue: allow rescuer thread to do more work.") made
the rescuer worker requeue the pwq immediately if there may be more
work items which need rescuing instead of waiting for the next mayday
timer expiration. Unfortunately, it doesn't check whether the pwq is
already on the mayday list and unconditionally gets the ref and moves
it onto the list. This doesn't corrupt the list but creates an
additional reference to the pwq. It got queued twice but will only be
removed once.
This leak later can trigger pwq refcnt warning on workqueue
destruction and prevent freeing of the workqueue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Williams, Gerald S" <gerald.s.williams@intel.com>
Cc: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org # v3.19+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit def98c84b6cdf2eeea19ec5736e90e316df5206b upstream.
Before actually destrying a workqueue, destroy_workqueue() checks
whether it's actually idle. If it isn't, it prints out a bunch of
warning messages and leaves the workqueue dangling. It unfortunately
has a couple issues.
* Mayday list queueing increments pwq's refcnts which gets detected as
busy and fails the sanity checks. However, because mayday list
queueing is asynchronous, this condition can happen without any
actual work items left in the workqueue.
* Sanity check failure leaves the sysfs interface behind too which can
lead to init failure of newer instances of the workqueue.
This patch fixes the above two by
* If a workqueue has a rescuer, disable and kill the rescuer before
sanity checks. Disabling and killing is guaranteed to flush the
existing mayday list.
* Remove sysfs interface before sanity checks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marcin Pawlowski <mpawlowski@fb.com>
Reported-by: "Williams, Gerald S" <gerald.s.williams@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1b69ac6b40ebd85eed73e4dbccde2a36961ab990)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I2877fec3d381b1006b8bd1261895fdfd68bd21db
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Workqueue is currently initialized in an early init call; however,
there are cases where early boot code has to be split and reordered to
come after workqueue initialization or the same code path which makes
use of workqueues is used both before workqueue initailization and
after. The latter cases have to gate workqueue usages with
keventd_up() tests, which is nasty and easy to get wrong.
Workqueue usages have become widespread and it'd be a lot more
convenient if it can be used very early from boot. This patch splits
workqueue initialization into two steps. workqueue_init_early() which
sets up the basic data structures so that workqueues can be created
and work items queued, and workqueue_init() which actually brings up
workqueues online and starts executing queued work items. The former
step can be done very early during boot once memory allocation,
cpumasks and idr are initialized. The latter right after kthreads
become available.
This allows work item queueing and canceling from very early boot
which is what most of these use cases want.
* As systemd_wq being initialized doesn't indicate that workqueue is
fully online anymore, update keventd_up() to test wq_online instead.
The follow-up patches will get rid of all its usages and the
function itself.
* Flushing doesn't make sense before workqueue is fully initialized.
The flush functions trigger WARN and return immediately before fully
online.
* Work items are never in-flight before fully online. Canceling can
always succeed by skipping the flush step.
* Some code paths can no longer assume to be called with irq enabled
as irq is disabled during early boot. Use irqsave/restore
operations instead.
v2: Watchdog init, which requires timer to be running, moved from
workqueue_init_early() to workqueue_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@mail.gmail.com
(cherry picked from commit 3347fa0928210d96aaa2bd6cd5a8391d5e630873)
Bug: 111308141
Test: modified lmkd to use PSI and tested using lmkd_unit_test
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I6348ea0969efd7142d5e61aa350904834a82d229
[ Upstream commit 537f4146c53c95aac977852b371bafb9c6755ee1 ]
Never directly free @dev after calling device_register(), even
if it returned an error! Always use put_device() to give up the
reference initialized in this function instead.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 27d4ee03078aba88c5e07dcc4917e8d01d046f38 upstream.
Introduce a helper to retrieve the current task's work struct if it is
a workqueue worker.
This allows us to fix a long-standing deadlock in several DRM drivers
wherein the ->runtime_suspend callback waits for a specific worker to
finish and that worker in turn calls a function which waits for runtime
suspend to finish. That function is invoked from multiple call sites
and waiting for runtime suspend to finish is the correct thing to do
except if it's executing in the context of the worker.
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 62635ea8c18f0f62df4cc58379e4f1d33afd5801 upstream.
show_workqueue_state() can print out a lot of messages while being in
atomic context, e.g. sysrq-t -> show_workqueue_state(). If the console
device is slow it may end up triggering NMI hard lockup watchdog.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 637fdbae60d6cb9f6e963c1079d7e0445c86ff7d ]
If queue_delayed_work() gets called with NULL @wq, the kernel will
oops asynchronuosly on timer expiration which isn't too helpful in
tracking down the offender. This actually happened with smc.
__queue_delayed_work() already does several input sanity checks
synchronously. Add NULL @wq check.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.uk
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 692b48258dda7c302e777d7d5f4217244478f1f6 upstream.
Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
lockdep:
[ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
[ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
[ 1270.473240] -----------------------------------------------------
[ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
[ 1270.474994]
[ 1270.474994] and this task is already holding:
[ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
[ 1270.476046] which would create a new lock dependency:
[ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
[ 1270.476949]
[ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
[ 1270.477553] (&pool->lock/1){-.-.}
...
[ 1270.488900] to a HARDIRQ-irq-unsafe lock:
[ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.}
...
[ 1270.494735] Possible interrupt unsafe locking scenario:
[ 1270.494735]
[ 1270.495250] CPU0 CPU1
[ 1270.495600] ---- ----
[ 1270.495947] lock(&(&lock->wait_lock)->rlock);
[ 1270.496295] local_irq_disable();
[ 1270.496753] lock(&pool->lock/1);
[ 1270.497205] lock(&(&lock->wait_lock)->rlock);
[ 1270.497744] <Interrupt>
[ 1270.497948] lock(&pool->lock/1);
, which will cause a irq inversion deadlock if the above lock scenario
happens.
The root cause of this safe -> unsafe lock order is the
mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
held.
Unlocking mutex while holding an irq spinlock was never safe and this
problem has been around forever but it never got noticed because the
only time the mutex is usually trylocked while holding irqlock making
actual failures very unlikely and lockdep annotation missed the
condition until the recent b9c16a0e1f73 ("locking/mutex: Fix
lockdep_assert_held() fail").
Using mutex for pool->manager_arb has always been a bit of stretch.
It primarily is an mechanism to arbitrate managership between workers
which can easily be done with a pool flag. The only reason it became
a mutex is that pool destruction path wants to exclude parallel
managing operations.
This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
and make the destruction path wait for the current manager on a wait
queue.
v2: Drop unnecessary flag clearing before pool destruction as
suggested by Boqun.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0a94efb5acbb6980d7c9ab604372d93cd507e4d8 upstream.
5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered") automatically enabled ordered attribute for unbound
workqueues w/ max_active == 1. Because ordered workqueues reject
max_active and some attribute changes, this implicit ordered mode
broke cases where the user creates an unbound workqueue w/ max_active
== 1 and later explicitly changes the related attributes.
This patch distinguishes explicit and implicit ordered setting and
overrides from attribute changes if implict.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
Cc: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5c0338c68706be53b3dc472e4308961c36e4ece1 upstream.
The combination of WQ_UNBOUND and max_active == 1 used to imply
ordered execution. After NUMA affinity 4c16bd327c ("workqueue:
implement NUMA affinity for unbound workqueues"), this is no longer
true due to per-node worker pools.
While the right way to create an ordered workqueue is
alloc_ordered_workqueue(), the documentation has been misleading for a
long time and people do use WQ_UNBOUND and max_active == 1 for ordered
workqueues which can lead to subtle bugs which are very difficult to
trigger.
It's unlikely that we'd see noticeable performance impact by enforcing
ordering on WQ_UNBOUND / max_active == 1 workqueues. Let's
automatically set __WQ_ORDERED for those workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Reported-by: Alexei Potashnik <alexei@purestorage.com>
Fixes: 4c16bd327c ("workqueue: implement NUMA affinity for unbound workqueues")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Patch series "kthread: Kthread worker API improvements"
The intention of this patchset is to make it easier to manipulate and
maintain kthreads. Especially, I want to replace all the custom main
cycles with a generic one. Also I want to make the kthreads sleep in a
consistent state in a common place when there is no work.
This patch (of 11):
A good practice is to prefix the names of functions by the name of the
subsystem.
This patch fixes the name of probe_kthread_data(). The other wrong
functions names are part of the kthread worker API and will be fixed
separately.
Link: http://lkml.kernel.org/r/1470754545-17632-2-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull smp hotplug updates from Thomas Gleixner:
"This is the next part of the hotplug rework.
- Convert all notifiers with a priority assigned
- Convert all CPU_STARTING/DYING notifiers
The final removal of the STARTING/DYING infrastructure will happen
when the merge window closes.
Another 700 hundred line of unpenetrable maze gone :)"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
timers/core: Correct callback order during CPU hot plug
leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
powerpc/numa: Convert to hotplug state machine
arm/perf: Fix hotplug state machine conversion
irqchip/armada: Avoid unused function warnings
ARC/time: Convert to hotplug state machine
clocksource/atlas7: Convert to hotplug state machine
clocksource/armada-370-xp: Convert to hotplug state machine
clocksource/exynos_mct: Convert to hotplug state machine
clocksource/arm_global_timer: Convert to hotplug state machine
rcu: Convert rcutree to hotplug state machine
KVM/arm/arm64/vgic-new: Convert to hotplug state machine
smp/cfd: Convert core to hotplug state machine
x86/x2apic: Convert to CPU hotplug state machine
profile: Convert to hotplug state machine
timers/core: Convert to hotplug state machine
hrtimer: Convert to hotplug state machine
x86/tboot: Convert to hotplug state machine
arm64/armv8 deprecated: Convert to hotplug state machine
hwtracing/coresight-etm4x: Convert to hotplug state machine
...
* pm-sleep:
PM / hibernate: Introduce test_resume mode for hibernation
x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
PM / hibernate: Image data protection during restoration
PM / hibernate: Add missing braces in __register_nosave_region()
PM / hibernate: Clean up comments in snapshot.c
PM / hibernate: Clean up function headers in snapshot.c
PM / hibernate: Add missing braces in hibernate_setup()
PM / hibernate: Recycle safe pages after image restoration
PM / hibernate: Simplify mark_unsafe_pages()
PM / hibernate: Do not free preallocated safe pages during image restore
PM / suspend: show workqueue state in suspend flow
PM / sleep: make PM notifiers called symmetrically
PM / sleep: Make pm_prepare_console() return void
PM / Hibernate: Don't let kasan instrument snapshot.c
* pm-tools:
PM / tools: scripts: AnalyzeSuspend v4.2
tools/turbostat: allow user to alter DESTDIR and PREFIX
Get rid of the prio ordering of the separate notifiers and use a proper state
callback pair.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153335.197083890@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If freezable workqueue aborts suspend flow, show
workqueue state for debug purpose.
Signed-off-by: Roger Lu <roger.lu@mediatek.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
With commit e9d867a67f ("sched: Allow per-cpu kernel threads to
run on online && !active"), __set_cpus_allowed_ptr() expects that only
strict per-cpu kernel threads can have affinity to an online CPU which
is not yet active.
This assumption is currently broken in the CPU_ONLINE notification
handler for the workqueues where restore_unbound_workers_cpumask()
calls set_cpus_allowed_ptr() when the first cpu in the unbound
worker's pool->attr->cpumask comes online. Since
set_cpus_allowed_ptr() is called with pool->attr->cpumask in which
only one CPU is online which is not yet active, we get the following
WARN_ON during an CPU online operation.
------------[ cut here ]------------
WARNING: CPU: 40 PID: 248 at kernel/sched/core.c:1166
__set_cpus_allowed_ptr+0x228/0x2e0
Modules linked in:
CPU: 40 PID: 248 Comm: cpuhp/40 Not tainted 4.6.0-autotest+ #4
<..snip..>
Call Trace:
[c000000f273ff920] [c00000000010493c] __set_cpus_allowed_ptr+0x2cc/0x2e0 (unreliable)
[c000000f273ffac0] [c0000000000ed4b0] workqueue_cpu_up_callback+0x2c0/0x470
[c000000f273ffb70] [c0000000000f5c58] notifier_call_chain+0x98/0x100
[c000000f273ffbc0] [c0000000000c5ed0] __cpu_notify+0x70/0xe0
[c000000f273ffc00] [c0000000000c6028] notify_online+0x38/0x50
[c000000f273ffc30] [c0000000000c5214] cpuhp_invoke_callback+0x84/0x250
[c000000f273ffc90] [c0000000000c562c] cpuhp_up_callbacks+0x5c/0x120
[c000000f273ffce0] [c0000000000c64d4] cpuhp_thread_fun+0x184/0x1c0
[c000000f273ffd20] [c0000000000fa050] smpboot_thread_fn+0x290/0x2a0
[c000000f273ffd80] [c0000000000f45b0] kthread+0x110/0x130
[c000000f273ffe30] [c000000000009570] ret_from_kernel_thread+0x5c/0x6c
---[ end trace 00f1456578b2a3b2 ]---
This patch fixes this by limiting the mask to the intersection of
the pool affinity and online CPUs.
Changelog-cribbed-from: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
When activating a static object we need make sure that the object is
tracked in the object tracker. If it is a non-static object then the
activation is illegal.
In previous implementation, each subsystem need take care of this in
their fixup callbacks. Actually we can put it into debugobjects core.
Thus we can save duplicated code, and have *pure* fixup callbacks.
To achieve this, a new callback "is_static_object" is introduced to let
the type specific code decide whether a object is static or not. If
yes, we take it into object tracker, otherwise give warning and invoke
fixup callback.
This change has paassed debugobjects selftest, and I also do some test
with all debugobjects supports enabled.
At last, I have a concern about the fixups that can it change the object
which is in incorrect state on fixup? Because the 'addr' may not point
to any valid object if a non-static object is not tracked. Then Change
such object can overwrite someone's memory and cause unexpected
behaviour. For example, the timer_fixup_activate bind timer to function
stub_timer.
Link: http://lkml.kernel.org/r/1462576157-14539-1-git-send-email-changbin.du@intel.com
[changbin.du@intel.com: improve code comments where invoke the new is_static_object callback]
Link: http://lkml.kernel.org/r/1462777431-8171-1-git-send-email-changbin.du@intel.com
Signed-off-by: Du, Changbin <changbin.du@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Triplett <josh@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the return type to use bool instead of int, corresponding to
change (debugobjects: make fixup functions return bool instead of int)
Signed-off-by: Du, Changbin <changbin.du@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Triplett <josh@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull workqueue fix from Tejun Heo:
"CPU hotplug callbacks can invoke DOWN_FAILED w/o preceding
DOWN_PREPARE which can trigger a WARN_ON() in workqueue.
The bug has been there for a very long time. It only triggers if CPU
down fails at a specific point and I don't think it has adverse
effects other than the warning messages. The fix is very low impact"
* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix rebind bound workers warning
------------[ cut here ]------------
WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
Modules linked in:
CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
Call Trace:
dump_stack+0x89/0xd4
__warn+0xfd/0x120
warn_slowpath_null+0x1d/0x20
rebind_workers+0x1c0/0x1d0
workqueue_cpu_up_callback+0xf5/0x1d0
notifier_call_chain+0x64/0x90
? trace_hardirqs_on_caller+0xf2/0x220
? notify_prepare+0x80/0x80
__raw_notifier_call_chain+0xe/0x10
__cpu_notify+0x35/0x50
notify_down_prepare+0x5e/0x80
? notify_prepare+0x80/0x80
cpuhp_invoke_callback+0x73/0x330
? __schedule+0x33e/0x8a0
cpuhp_down_callbacks+0x51/0xc0
cpuhp_thread_fun+0xc1/0xf0
smpboot_thread_fn+0x159/0x2a0
? smpboot_create_threads+0x80/0x80
kthread+0xef/0x110
? wait_for_completion+0xf0/0x120
? schedule_tail+0x35/0xf0
ret_from_fork+0x22/0x50
? __init_kthread_worker+0x70/0x70
---[ end trace eb12ae47d2382d8f ]---
notify_down_prepare: attempt to take down CPU 0 failed
This bug can be reproduced by below config w/ nohz_full= all cpus:
CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
CONFIG_DEBUG_HOTPLUG_CPU0=y
CONFIG_NO_HZ_FULL=y
As Thomas pointed out:
| If a down prepare callback fails, then DOWN_FAILED is invoked for all
| callbacks which have successfully executed DOWN_PREPARE.
|
| But, workqueue has actually two notifiers. One which handles
| UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
|
| Now look at the priorities of those callbacks:
|
| CPU_PRI_WORKQUEUE_UP = 5
| CPU_PRI_WORKQUEUE_DOWN = -5
|
| So the call order on DOWN_PREPARE is:
|
| CB 1
| CB ...
| CB workqueue_up() -> Ignores DOWN_PREPARE
| CB ...
| CB X ---> Fails
|
| So we call up to CB X with DOWN_FAILED
|
| CB 1
| CB ...
| CB workqueue_up() -> Handles DOWN_FAILED
| CB ...
| CB X-1
|
| So the problem is that the workqueue stuff handles DOWN_FAILED in the up
| callback, while it should do it in the down callback. Which is not a good idea
| either because it wants to be called early on rollback...
|
| Brilliant stuff, isn't it? The hotplug rework will solve this problem because
| the callbacks become symetric, but for the existing mess, we need some
| workaround in the workqueue code.
The boot CPU handles housekeeping duty(unbound timers, workqueues,
timekeeping, ...) on behalf of full dynticks CPUs. It must remain
online when nohz full is enabled. There is a priority set to every
notifier_blocks:
workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down
So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
notifier_blocks behind tick_nohz_cpu_down will not be called any
more, which leads to workers are actually not unbound. Then hotplug
state machine will fallback to undo and online cpu 0 again. Workers
will be rebound unconditionally even if they are not unbound and
trigger the warning in this progress.
This patch fix it by catching !DISASSOCIATED to avoid rebind bound
workers.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Pull workqueue fix from Tejun Heo:
"So, it turns out we had a silly bug in the most fundamental part of
workqueue for a very long time. AFAICS, this dates back to pre-git
era and has quite likely been there from the time workqueue was first
introduced.
A work item uses its PENDING bit to synchronize multiple queuers.
Anyone who wins the PENDING bit owns the pending state of the work
item. Whether a queuer wins or loses the race, one thing should be
guaranteed - there will soon be at least one execution of the work
item - where "after" means that the execution instance would be able
to see all the changes that the queuer has made prior to the queueing
attempt.
Unfortunately, we were missing a smp_mb() after clearing PENDING for
execution, so nothing guaranteed visibility of the changes that a
queueing loser has made, which manifested as a reproducible blk-mq
stall.
Lots of kudos to Roman for debugging the problem. The patch for
-stable is the minimal one. For v3.7, Peter is working on a patch to
make the code path slightly more efficient and less fragile"
* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix ghost PENDING flag while doing MQ IO
The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:
[ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
[ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
[ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[ 601.350965] Call Trace:
[ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80
[ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50
The underlying device is a null_blk, with default parameters:
queue_mode = MQ
submit_queues = 1
Verification that nullb0 has something inflight:
root@pserver8:~# cat /sys/block/nullb0/inflight
0 1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
ffff8838038e2400
...
During debug it became clear that stalled request is always inserted in
the rq_list from the following path:
save_stack_trace_tsk + 34
blk_mq_insert_requests + 231
blk_mq_flush_plug_list + 281
blk_flush_plug_list + 199
wait_on_page_bit + 192
__filemap_fdatawait_range + 228
filemap_fdatawait_range + 20
filemap_write_and_wait_range + 63
blkdev_fsync + 27
vfs_fsync_range + 73
blkdev_write_iter + 202
__vfs_write + 170
vfs_write + 169
kernel_write + 56
So blk_flush_plug_list() was called with from_schedule == true.
If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().
That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.
Further debugging shows the following traces from different CPUs:
CPU#0 CPU#1
---------------------------------- -------------------------------
reqeust A inserted
STORE hctx->ctx_map[0] bit marked
kblockd_schedule...() returns 1
<schedule to kblockd workqueue>
request B inserted
STORE hctx->ctx_map[1] bit marked
kblockd_schedule...() returns 0
*** WORK PENDING bit is cleared ***
flush_busy_ctxs() is executed, but
bit 1, set by CPU#1, is not observed
As a result request B pended forever.
This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.
The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull workqueue updates from Tejun Heo:
"Three trivial workqueue changes"
* 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix comment for work_on_cpu()
sched/core: Get rid of 'cpu' argument in wq_worker_sleeping()
workqueue: Replace usage of init_name with dev_set_name()
$ make tags
GEN tags
ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
Which are all the result of the DEFINE_PER_CPU pattern:
scripts/tags.sh:200: '/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
scripts/tags.sh:201: '/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
The below cures them. All except the workqueue one are within reasonable
distance of the 80 char limit. TJ do you have any preference on how to
fix the wq one, or shall we just not care its too long?
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Function is processed in thread context, not in user context.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Given that wq_worker_sleeping() could only be called for a
CPU it is running on, we do not need passing a CPU ID as an
argument.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The init_name property of the device struct is sort of a hack and should
only be used for statically allocated devices. Since the device is
dynamically allocated here it is safe to use the proper way to set a
devices name by calling dev_set_name().
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
When looking up the pool_workqueue to use for an unbound workqueue,
workqueue assumes that the target CPU is always bound to a valid NUMA
node. However, currently, when a CPU goes offline, the mapping is
destroyed and cpu_to_node() returns NUMA_NO_NODE.
This has always been broken but hasn't triggered often enough before
874bbfe600 ("workqueue: make sure delayed work run in local cpu").
After the commit, workqueue forcifully assigns the local CPU for
delayed work items without explicit target CPU to fix a different
issue. This widens the window where CPU can go offline while a
delayed work item is pending causing delayed work items dispatched
with target CPU set to an already offlined CPU. The resulting
NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
NULL pool_workqueue and thus crash.
While 874bbfe600 has been reverted for a different reason making the
bug less visible again, it can still happen. Fix it by mapping
NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
This is a temporary workaround. The long term solution is keeping CPU
-> NODE mapping stable across CPU off/online cycles which is being
worked on.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
Workqueue used to guarantee local execution for work items queued
without explicit target CPU. The guarantee is gone now which can
break some usages in subtle ways. To flush out those cases, this
patch implements a debug feature which forces round-robin CPU
selection for all such work items.
The debug feature defaults to off and can be enabled with a kernel
parameter. The default can be flipped with a debug config option.
If you hit this commit during bisection, please refer to 041bd12e27
("Revert "workqueue: make sure delayed work run in local cpu"") for
more information and ping me.
Signed-off-by: Tejun Heo <tj@kernel.org>
WORK_CPU_UNBOUND work items queued to a bound workqueue always run
locally. This is a good thing normally, but not when the user has
asked us to keep unbound work away from certain CPUs. Round robin
these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
trumps performance.
tj: Cosmetic and comment changes. WARN_ON_ONCE() dropped from empty
(wq_unbound_cpumask AND cpu_online_mask). If we want that, it
should be done when config changes.
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit 874bbfe600.
Workqueue used to implicity guarantee that work items queued without
explicit CPU specified are put on the local CPU. Recent changes in
timer broke the guarantee and led to vmstat breakage which was fixed
by 176bed1de5 ("vmstat: explicitly schedule per-cpu work on the CPU
we need it to run on").
vmstat is the most likely to expose the issue and it's quite possible
that there are other similar problems which are a lot more difficult
to trigger. As a preventive measure, 874bbfe600 ("workqueue: make
sure delayed work run in local cpu") was applied to restore the local
CPU guarnatee. Unfortunately, the change exposed a bug in timer code
which got fixed by 22b886dd10 ("timers: Use proper base migration in
add_timer_on()"). Due to code restructuring, the commit couldn't be
backported beyond certain point and stable kernels which only had
874bbfe600 started crashing.
The local CPU guarantee was accidental more than anything else and we
want to get rid of it anyway. As, with the vmstat case fixed,
874bbfe600 is causing more problems than it's fixing, it has been
decided to take the chance and officially break the guarantee by
reverting the commit. A debug feature will be added to force foreign
CPU assignment to expose cases relying on the guarantee and fixes for
the individual cases will be backported to stable as necessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 874bbfe600 ("workqueue: make sure delayed work run in local cpu")
Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
Cc: stable@vger.kernel.org
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shli@fb.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Bilik <daniel.bilik@neosystem.cz>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
fca839c00a ("workqueue: warn if memory reclaim tries to flush
!WQ_MEM_RECLAIM workqueue") implemented flush dependency warning which
triggers if a PF_MEMALLOC task or WQ_MEM_RECLAIM workqueue tries to
flush a !WQ_MEM_RECLAIM workquee.
This assumes that workqueues marked with WQ_MEM_RECLAIM sit in memory
reclaim path and making it depend on something which may need more
memory to make forward progress can lead to deadlocks. Unfortunately,
workqueues created with the legacy create*_workqueue() interface
always have WQ_MEM_RECLAIM regardless of whether they are depended
upon memory reclaim or not. These spurious WQ_MEM_RECLAIM markings
cause spurious triggering of the flush dependency checks.
WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2361 check_flush_dependency+0x138/0x144()
workqueue: WQ_MEM_RECLAIM deferwq:deferred_probe_work_func is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
...
Workqueue: deferwq deferred_probe_work_func
[<c0017acc>] (unwind_backtrace) from [<c0013134>] (show_stack+0x10/0x14)
[<c0013134>] (show_stack) from [<c0245f18>] (dump_stack+0x94/0xd4)
[<c0245f18>] (dump_stack) from [<c0026f9c>] (warn_slowpath_common+0x80/0xb0)
[<c0026f9c>] (warn_slowpath_common) from [<c0026ffc>] (warn_slowpath_fmt+0x30/0x40)
[<c0026ffc>] (warn_slowpath_fmt) from [<c00390b8>] (check_flush_dependency+0x138/0x144)
[<c00390b8>] (check_flush_dependency) from [<c0039ca0>] (flush_work+0x50/0x15c)
[<c0039ca0>] (flush_work) from [<c00c51b0>] (lru_add_drain_all+0x130/0x180)
[<c00c51b0>] (lru_add_drain_all) from [<c00f728c>] (migrate_prep+0x8/0x10)
[<c00f728c>] (migrate_prep) from [<c00bfbc4>] (alloc_contig_range+0xd8/0x338)
[<c00bfbc4>] (alloc_contig_range) from [<c00f8f18>] (cma_alloc+0xe0/0x1ac)
[<c00f8f18>] (cma_alloc) from [<c001cac4>] (__alloc_from_contiguous+0x38/0xd8)
[<c001cac4>] (__alloc_from_contiguous) from [<c001ceb4>] (__dma_alloc+0x240/0x278)
[<c001ceb4>] (__dma_alloc) from [<c001cf78>] (arm_dma_alloc+0x54/0x5c)
[<c001cf78>] (arm_dma_alloc) from [<c0355ea4>] (dmam_alloc_coherent+0xc0/0xec)
[<c0355ea4>] (dmam_alloc_coherent) from [<c039cc4c>] (ahci_port_start+0x150/0x1dc)
[<c039cc4c>] (ahci_port_start) from [<c0384734>] (ata_host_start.part.3+0xc8/0x1c8)
[<c0384734>] (ata_host_start.part.3) from [<c03898dc>] (ata_host_activate+0x50/0x148)
[<c03898dc>] (ata_host_activate) from [<c039d558>] (ahci_host_activate+0x44/0x114)
[<c039d558>] (ahci_host_activate) from [<c039f05c>] (ahci_platform_init_host+0x1d8/0x3c8)
[<c039f05c>] (ahci_platform_init_host) from [<c039e6bc>] (tegra_ahci_probe+0x448/0x4e8)
[<c039e6bc>] (tegra_ahci_probe) from [<c0347058>] (platform_drv_probe+0x50/0xac)
[<c0347058>] (platform_drv_probe) from [<c03458cc>] (driver_probe_device+0x214/0x2c0)
[<c03458cc>] (driver_probe_device) from [<c0343cc0>] (bus_for_each_drv+0x60/0x94)
[<c0343cc0>] (bus_for_each_drv) from [<c03455d8>] (__device_attach+0xb0/0x114)
[<c03455d8>] (__device_attach) from [<c0344ab8>] (bus_probe_device+0x84/0x8c)
[<c0344ab8>] (bus_probe_device) from [<c0344f48>] (deferred_probe_work_func+0x68/0x98)
[<c0344f48>] (deferred_probe_work_func) from [<c003b738>] (process_one_work+0x120/0x3f8)
[<c003b738>] (process_one_work) from [<c003ba48>] (worker_thread+0x38/0x55c)
[<c003ba48>] (worker_thread) from [<c0040f14>] (kthread+0xdc/0xf4)
[<c0040f14>] (kthread) from [<c000f778>] (ret_from_fork+0x14/0x3c)
Fix it by marking workqueues created via create*_workqueue() with
__WQ_LEGACY and disabling flush dependency checks on them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Thierry Reding <thierry.reding@gmail.com>
Link: http://lkml.kernel.org/g/20160126173843.GA11115@ulmo.nvidia.com
Fixes: fca839c00a ("workqueue: warn if memory reclaim tries to flush !WQ_MEM_RECLAIM workqueue")
If the apply_wqattrs_prepare() returns NULL, it has already cleaned up
the related resources, so it can return directly and avoid calling the
clean up function again.
This doesn't introduce any functional changes.
Signed-off-by: wanghaibin <wanghaibin.wang@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Task or work item involved in memory reclaim trying to flush a
non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
deadlock. Trigger WARN_ONCE() if such conditions are detected.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Pull workqueue update from Tejun Heo:
"This pull request contains one patch to make an unbound worker pool
allocated from the NUMA node containing it if such node exists. As
unbound worker pools are node-affine by default, this makes most pools
allocated on the right node"
* 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Allocate the unbound pool using local node memory
Currently, get_unbound_pool() uses kzalloc() to allocate the
worker pool. Actually, we can use the right node to do the
allocation, achieving local memory access.
This patch selects target node first, and uses kzalloc_node()
instead.
Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
My system keeps crashing with below message. vmstat_update() schedules a delayed
work in current cpu and expects the work runs in the cpu.
schedule_delayed_work() is expected to make delayed work run in local cpu. The
problem is timer can be migrated with NO_HZ. __queue_work() queues work in
timer handler, which could run in a different cpu other than where the delayed
work is scheduled. The end result is the delayed work runs in different cpu.
The patch makes __queue_delayed_work records local cpu earlier. Where the timer
runs doesn't change where the work runs with the change.
[ 28.010131] ------------[ cut here ]------------
[ 28.010609] kernel BUG at ../mm/vmstat.c:1392!
[ 28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
[ 28.011860] Modules linked in:
[ 28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G W4.3.0-rc3+ #634
[ 28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
[ 28.014160] Workqueue: events vmstat_update
[ 28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
[ 28.015445] RIP: 0010:[<ffffffff8115f921>] [<ffffffff8115f921>]vmstat_update+0x31/0x80
[ 28.016282] RSP: 0018:ffff8800ba42fd80 EFLAGS: 00010297
[ 28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
[ 28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
[ 28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
[ 28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
[ 28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
[ 28.020071] FS: 0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
[ 28.020071] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
[ 28.020071] Stack:
[ 28.020071] ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
[ 28.020071] ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
[ 28.020071] ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
[ 28.020071] Call Trace:
[ 28.020071] [<ffffffff8106bd88>] process_one_work+0x1c8/0x540
[ 28.020071] [<ffffffff8106bd0b>] ? process_one_work+0x14b/0x540
[ 28.020071] [<ffffffff8106c214>] worker_thread+0x114/0x460
[ 28.020071] [<ffffffff8106c100>] ? process_one_work+0x540/0x540
[ 28.020071] [<ffffffff81071bf8>] kthread+0xf8/0x110
[ 28.020071] [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200
[ 28.020071] [<ffffffff81a6522f>] ret_from_fork+0x3f/0x70
[ 28.020071] [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v2.6.31+
Pull workqueue updates from Tejun Heo:
"Only three trivial changes for workqueue this time - doc, MAINTAINERS
and EXPORT_SYMBOL updates"
* 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix some docbook warnings
workqueue: Make flush_workqueue() available again to non GPL modules
workqueue: add myself as a dedicated reviwer