Commit graph

23 commits

Author SHA1 Message Date
FAROVITUS
af1d3ae977 Merge 4.9.212 branch 'android-4.9-q' into tw10-android-4.9-q
Documentation/filesystems/fscrypt.rst
	arch/arm/common/Kconfig
	arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
	arch/arm64/boot/dts/amd/amd-seattle-soc.dtsi
	arch/arm64/boot/dts/arm/juno-clocks.dtsi
	arch/arm64/boot/dts/broadcom/ns2.dtsi
	arch/arm64/boot/dts/lg/lg1312.dtsi
	arch/arm64/boot/dts/lg/lg1313.dtsi
	arch/arm64/boot/dts/marvell/armada-37xx.dtsi
	arch/arm64/boot/dts/nvidia/tegra210-p2180.dtsi
	arch/arm64/boot/dts/nvidia/tegra210-p2597.dtsi
	arch/arm64/boot/dts/nvidia/tegra210.dtsi
	arch/arm64/boot/dts/qcom/apq8016-sbc.dtsi
	arch/arm64/boot/dts/qcom/msm8996.dtsi
	arch/arm64/configs/ranchu64_defconfig
	arch/arm64/include/asm/cpucaps.h
	arch/arm64/kernel/cpufeature.c
	arch/arm64/kernel/traps.c
	arch/arm64/mm/mmu.c
	crypto/Makefile
	crypto/ablkcipher.c
	crypto/blkcipher.c
	crypto/testmgr.h
	crypto/zstd.c
	drivers/android/binder.c
	drivers/android/binder_alloc.c
	drivers/char/random.c
	drivers/clocksource/exynos_mct.c
	drivers/dma/pl330.c
	drivers/hid/hid-sony.c
	drivers/hid/uhid.c
	drivers/hid/usbhid/hiddev.c
	drivers/i2c/i2c-core.c
	drivers/md/dm-crypt.c
	drivers/media/v4l2-core/videobuf2-v4l2.c
	drivers/mmc/host/dw_mmc.c
	drivers/net/ethernet/broadcom/tg3.c
	drivers/net/usb/r8152.c
	drivers/scsi/scsi_logging.c
	drivers/scsi/sd.c
	drivers/scsi/ufs/ufshcd-pci.c
	drivers/scsi/ufs/ufshcd-pltfrm.c
	drivers/staging/android/Kconfig
	drivers/staging/android/ion/ion.c
	drivers/staging/android/ion/ion_priv.h
	drivers/staging/android/ion/ion_system_heap.c
	drivers/staging/android/lowmemorykiller.c
	drivers/tty/serial/samsung.c
	drivers/usb/dwc3/core.c
	drivers/usb/dwc3/gadget.c
	drivers/usb/host/xhci-hub.c
	drivers/video/fbdev/core/fbmon.c
	drivers/video/fbdev/core/modedb.c
	fs/crypto/fname.c
	fs/crypto/fscrypt_private.h
	fs/crypto/keyinfo.c
	fs/ext4/ialloc.c
	fs/ext4/namei.c
	fs/ext4/xattr.c
	fs/f2fs/checkpoint.c
	fs/f2fs/data.c
	fs/f2fs/debug.c
	fs/f2fs/dir.c
	fs/f2fs/f2fs.h
	fs/f2fs/file.c
	fs/f2fs/gc.c
	fs/f2fs/inline.c
	fs/f2fs/inode.c
	fs/f2fs/namei.c
	fs/f2fs/node.c
	fs/f2fs/recovery.c
	fs/f2fs/segment.c
	fs/f2fs/segment.h
	fs/f2fs/super.c
	fs/f2fs/sysfs.c
	fs/fat/dir.c
	fs/fat/fatent.c
	fs/file.c
	fs/namespace.c
	fs/pnode.c
	fs/proc/inode.c
	fs/proc/root.c
	fs/proc/task_mmu.c
	fs/sdcardfs/dentry.c
	fs/sdcardfs/derived_perm.c
	fs/sdcardfs/file.c
	fs/sdcardfs/inode.c
	fs/sdcardfs/lookup.c
	fs/sdcardfs/main.c
	fs/sdcardfs/sdcardfs.h
	fs/sdcardfs/super.c
	include/linux/blk_types.h
	include/linux/cpuhotplug.h
	include/linux/cred.h
	include/linux/fb.h
	include/linux/power_supply.h
	include/linux/sched.h
	include/linux/zstd.h
	include/trace/events/sched.h
	include/uapi/linux/android/binder.h
	init/Kconfig
	init/main.c
	kernel/bpf/hashtab.c
	kernel/cpu.c
	kernel/cred.c
	kernel/fork.c
	kernel/locking/spinlock_debug.c
	kernel/panic.c
	kernel/printk/printk.c
	kernel/sched/Makefile
	kernel/sched/core.c
	kernel/sched/fair.c
	kernel/sched/rt.c
	kernel/sched/walt.c
	kernel/sched/walt.h
	kernel/trace/trace.c
	lib/bug.c
	lib/list_debug.c
	lib/vsprintf.c
	lib/zstd/bitstream.h
	lib/zstd/compress.c
	lib/zstd/decompress.c
	lib/zstd/fse.h
	lib/zstd/fse_compress.c
	lib/zstd/fse_decompress.c
	lib/zstd/huf_compress.c
	lib/zstd/huf_decompress.c
	lib/zstd/zstd_internal.h
	mm/debug.c
	mm/filemap.c
	mm/rmap.c
	net/core/filter.c
	net/ipv4/sysctl_net_ipv4.c
	net/ipv4/sysfs_net_ipv4.c
	net/ipv4/tcp_input.c
	net/ipv4/tcp_output.c
	net/ipv4/udp.c
	net/ipv6/netfilter/nf_conntrack_reasm.c
	net/netfilter/Kconfig
	net/netfilter/Makefile
	net/netfilter/xt_qtaguid.c
	net/netfilter/xt_qtaguid_internal.h
	net/xfrm/xfrm_policy.c
	net/xfrm/xfrm_state.c
	scripts/checkpatch.pl
	security/selinux/hooks.c
	sound/core/compress_offload.c
2020-02-12 12:32:38 +02:00
FAROVITUS
2b92eefa41 import G965FXXU7DTAA OSRC
*First release for Android (Q).

Signed-off-by: FAROVITUS <farovitus@gmail.com>
2020-02-04 13:50:09 +02:00
Greg Kroah-Hartman
cfe25622c3 This is the 4.9.178 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlzkK/4ACgkQONu9yGCS
 aT7MpBAAwnw0HPr67VKF/PD18KNz5e8suu9OD8lUI7z5+CVA62bAe4cx4FQNOudZ
 gR9iS2Q57k0b6pKIX1sOITkAuJCHIH7J9vL8PDIu+NeJFJiz82vg/k/3dq/QjL64
 1Z6jjcDu/i4yNF/Jv+eAMbs1faRDOdMM4uLHqpwqdoOhp6aKVkqmPe1gMSXpCF4U
 KTaypH/3f/0cMCWVSv75qPBK4eiiW5WgoS3CWtOcZAaxqpQpPdo3zKgzjxgfYCIR
 7U0LpTSp9EEFIKtOYNvAJEUb2WZRKqSXHhPaWFmfYsTL6Zl5S4V7rbQaepmrKHpV
 uzHSsPt6cXrzCnfPu0kcXzPDrQrW5zDg70ndYhuxR0mChcwazopRIVtDRHP6DXso
 g/F1tKORhz5DPYuAkz7GCdyFQNXqipz6ybx54Z3JC7C+kXO6mc0GzFmihMjrlWCg
 oIK0Va6WL7f1aYsakv1ngI2v7NqVq/EylOFHkqMZMxlrtUOWKbznz5CHg9D5oxO4
 P/cZBkOhmsvKqtV5wtKDAISXmAIQMQabgKlPa7jPayjBk9b/4EEXp2DAJJ3lnNPQ
 KfK+W6iChZho+YIyd4KWGpqu88bmDUssR11tw2xAJlnFKv2e1+tQQaCUjdp1meUM
 ewbOuBEuElmPWMmh6TACzxa1pF6cPW72C1JJ1xYoUx/yihUo6n0=
 =js4a
 -----END PGP SIGNATURE-----

Merge 4.9.178 into android-4.9-q

Changes in 4.9.178
	net: core: another layer of lists, around PF_MEMALLOC skb handling
	locking/rwsem: Prevent decrement of reader count before increment
	PCI: hv: Fix a memory leak in hv_eject_device_work()
	x86/speculation/mds: Revert CPU buffer clear on double fault exit
	x86/speculation/mds: Improve CPU buffer clear documentation
	objtool: Fix function fallthrough detection
	ARM: exynos: Fix a leaked reference by adding missing of_node_put
	power: supply: axp288_charger: Fix unchecked return value
	arm64: compat: Reduce address limit
	arm64: Clear OSDLR_EL1 on CPU boot
	sched/x86: Save [ER]FLAGS on context switch
	crypto: chacha20poly1305 - set cra_name correctly
	crypto: vmx - fix copy-paste error in CTR mode
	crypto: crct10dif-generic - fix use via crypto_shash_digest()
	crypto: x86/crct10dif-pcl - fix use via crypto_shash_digest()
	ALSA: usb-audio: Fix a memory leak bug
	ALSA: hda/hdmi - Read the pin sense from register when repolling
	ALSA: hda/hdmi - Consider eld_valid when reporting jack event
	ALSA: hda/realtek - EAPD turn on later
	ASoC: max98090: Fix restore of DAPM Muxes
	ASoC: RT5677-SPI: Disable 16Bit SPI Transfers
	mm/mincore.c: make mincore() more conservative
	ocfs2: fix ocfs2 read inode data panic in ocfs2_iget
	mfd: da9063: Fix OTP control register names to match datasheets for DA9063/63L
	mfd: max77620: Fix swapped FPS_PERIOD_MAX_US values
	tty/vt: fix write/write race in ioctl(KDSKBSENT) handler
	jbd2: check superblock mapped prior to committing
	ext4: actually request zeroing of inode table after grow
	ext4: fix ext4_show_options for file systems w/o journal
	Btrfs: do not start a transaction at iterate_extent_inodes()
	bcache: fix a race between cache register and cacheset unregister
	bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim()
	ipmi:ssif: compare block number correctly for multi-part return messages
	crypto: gcm - Fix error return code in crypto_gcm_create_common()
	crypto: gcm - fix incompatibility between "gcm" and "gcm_base"
	crypto: salsa20 - don't access already-freed walk.iv
	crypto: arm/aes-neonbs - don't access already-freed walk.iv
	fib_rules: fix error in backport of e9919a24d302 ("fib_rules: return 0...")
	writeback: synchronize sync(2) against cgroup writeback membership switches
	fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
	ext4: zero out the unused memory region in the extent tree block
	ext4: fix data corruption caused by overlapping unaligned and aligned IO
	ALSA: hda/realtek - Fix for Lenovo B50-70 inverted internal microphone bug
	KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes
	Linux 4.9.178

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-05-21 19:03:22 +02:00
Tejun Heo
1cfaba5b49 writeback: synchronize sync(2) against cgroup writeback membership switches
commit 7fc5854f8c6efae9e7624970ab49a1eac2faefb1 upstream.

sync_inodes_sb() can race against cgwb (cgroup writeback) membership
switches and fail to writeback some inodes.  For example, if an inode
switches to another wb while sync_inodes_sb() is in progress, the new
wb might not be visible to bdi_split_work_to_wbs() at all or the inode
might jump from a wb which hasn't issued writebacks yet to one which
already has.

This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb
switch path against sync_inodes_sb() so that sync_inodes_sb() is
guaranteed to see all the target wbs and inodes can't jump wbs to
escape syncing.

v2: Fixed misplaced rwsem init.  Spotted by Jiufei.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jiufei Xue <xuejiufei@gmail.com>
Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 18:49:01 +02:00
Greg Kroah-Hartman
fe0eb27ac6 This is the 4.9.153 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlxMHIwACgkQONu9yGCS
 aT6d1g/+JIP/Y3Q0C4VKzDu2zKs0KWuaFPX+YvDHHOBEbRyG1pxSShWGYTiD0NkL
 N7dtl9fU9SMQKZ99XDtVtbScqMveG7b4BXshJG6VKZ3U8oXx+DljOFmZZvOmibsN
 7UeQxC/kFmnQYxdBPb6kNV+HWqbwX3XLt8n+PuBJuQAnBb7RqLPJTNKJgIQTAyaJ
 cwxibcbvDWvtAOugd3zlj7v6mapTaRnNCJZF4T+0bPRKIj9QMX2uFHiQHfCHkznC
 jieynCDlVuHqHVo4KoyLyViobyGySyWvgIIlqw4kQAktrkXhpDtrf+ET+0Ve/fku
 ETmQ2NN4ibASKmKd1ELDv1hf1n2eQ8V0eROE4RQcUpsiQ/giPAO7BCO1IxD1/YAm
 dUEjIpUQ8ovPTx/voDV1c3VWhUWPUyMoETtTkSlMLb1r0IstvvWhbtOFH81G3gRM
 deBnqDXtAAQ225Zgng8jGGIAnEpJbzrIcDvVPsrXzsEqgyCuMFZXenGT9FyXUFoy
 tGis4P43ID8ovhbATZkWYs7GtirJI4Cdv7oYSTWh9WB0E8kKdk2GZz7yROhVHBy2
 RlC2GBAiFnD5n9E+KhO++Ad1ficSULSKFcG6gYKg1OeLt/JOn1E1XcJUf6YlVNJl
 0DStnLTd8CybVnQcFMiK3Y3VKecYcXCvH8vAqN8jsA52HU6BH2E=
 =Rdma
 -----END PGP SIGNATURE-----

Merge 4.9.153 into android-4.9

Changes in 4.9.153
	r8169: Add support for new Realtek Ethernet
	ipv6: Consider sk_bound_dev_if when binding a socket to a v4 mapped address
	ipv6: Take rcu_read_lock in __inet6_bind for mapped addresses
	platform/x86: asus-wmi: Tell the EC the OS will handle the display off hotkey
	e1000e: allow non-monotonic SYSTIM readings
	writeback: don't decrement wb->refcnt if !wb->bdi
	serial: set suppress_bind_attrs flag only if builtin
	ALSA: oxfw: add support for APOGEE duet FireWire
	MIPS: SiByte: Enable swiotlb for SWARM, LittleSur and BigSur
	arm64: perf: set suppress_bind_attrs flag to true
	selinux: always allow mounting submounts
	rxe: IB_WR_REG_MR does not capture MR's iova field
	jffs2: Fix use of uninitialized delayed_work, lockdep breakage
	pstore/ram: Do not treat empty buffers as valid
	powerpc/xmon: Fix invocation inside lock region
	powerpc/pseries/cpuidle: Fix preempt warning
	media: firewire: Fix app_info parameter type in avc_ca{,_app}_info
	net: call sk_dst_reset when set SO_DONTROUTE
	scsi: target: use consistent left-aligned ASCII INQUIRY data
	clk: imx6q: reset exclusive gates on init
	kconfig: fix file name and line number of warn_ignored_character()
	kconfig: fix memory leak when EOF is encountered in quotation
	mmc: atmel-mci: do not assume idle after atmci_request_end
	tty/serial: do not free trasnmit buffer page under port lock
	perf intel-pt: Fix error with config term "pt=0"
	perf svghelper: Fix unchecked usage of strncpy()
	perf parse-events: Fix unchecked usage of strncpy()
	dm kcopyd: Fix bug causing workqueue stalls
	tools lib subcmd: Don't add the kernel sources to the include path
	dm snapshot: Fix excessive memory usage and workqueue stalls
	ALSA: bebob: fix model-id of unit for Apogee Ensemble
	sysfs: Disable lockdep for driver bind/unbind files
	scsi: smartpqi: correct lun reset issues
	scsi: megaraid: fix out-of-bound array accesses
	ocfs2: fix panic due to unrecovered local alloc
	mm/page-writeback.c: don't break integrity writeback on ->writepage() error
	mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
	ipmi:ssif: Fix handling of multi-part return messages
	locking/qspinlock: Pull in asm/byteorder.h to ensure correct endianness
	Linux 4.9.153

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-01-26 12:03:01 +01:00
Anders Roxell
db16eb5539 writeback: don't decrement wb->refcnt if !wb->bdi
[ Upstream commit 347a28b586802d09604a149c1a1f6de5dccbe6fa ]

This happened while running in qemu-system-aarch64, the AMBA PL011 UART
driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE.
arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init),
devtmpfs' handle_remove() crashes because the reference count is a NULL
pointer only because wb->bdi hasn't been initialized yet.

Rework so that wb_put have an extra check if wb->bdi before decrement
wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens again
in other drivers.

Fixes: 52ebea749a ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26 09:38:32 +01:00
Jens Axboe
1e6563ffd0 UPSTREAM: mm: don't cap request size based on read-ahead setting
We ran into a funky issue, where someone doing 256K buffered reads saw
128K requests at the device level.  Turns out it is read-ahead capping
the request size, since we use 128K as the default setting.  This
doesn't make a lot of sense - if someone is issuing 256K reads, they
should see 256K reads, regardless of the read-ahead setting, if the
underlying device can support a 256K read in a single command.

This patch introduces a bdi hint, io_pages.  This is the soft max IO
size for the lower level, I've hooked it up to the bdev settings here.
Read-ahead is modified to issue the maximum of the user request size,
and the read-ahead max size, but capped to the max request size on the
device side.  The latter is done to avoid reading ahead too much, if the
application asks for a huge read.  With this patch, the kernel behaves
like the application expects.

Change-Id: Ibe52ffac7a6e1ac86ed0c6a59a0f7a32d651ee5f
Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
2019-01-23 22:15:45 +00:00
Greg Thelen
18484eb932 writeback: safer lock nesting
commit 2e898e4c0a3897ccd434adac5abb8330194f527b upstream.

lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Wang Long <wanglong19@meituan.com>
Acked-by: Wang Long <wanglong19@meituan.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org>	[v4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[natechancellor: Adjust context due to lack of b93b016313b3b]
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24 09:34:18 +02:00
Dan Williams
df08c32ce3 block: fix bdi vs gendisk lifetime mismatch
The name for a bdi of a gendisk is derived from the gendisk's devt.
However, since the gendisk is destroyed before the bdi it leaves a
window where a new gendisk could dynamically reuse the same devt while a
bdi with the same name is still live.  Arrange for the bdi to hold a
reference against its "owner" disk device while it is registered.
Otherwise we can hit sysfs duplicate name collisions like the following:

 WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
 sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

 Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
  0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
  ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
  0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
 Call Trace:
  [<ffffffff8134caec>] dump_stack+0x63/0x87
  [<ffffffff8108c351>] __warn+0xd1/0xf0
  [<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80
  [<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80
  [<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90
  [<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320
  [<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0
  [<ffffffff8134ff55>] kobject_add+0x75/0xd0
  [<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f
  [<ffffffff8148b0a5>] device_add+0x125/0x610
  [<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100
  [<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20
  [<ffffffff811b775c>] bdi_register+0x8c/0x180
  [<ffffffff811b7877>] bdi_register_dev+0x27/0x30
  [<ffffffff813317f5>] add_disk+0x175/0x4a0

Cc: <stable@vger.kernel.org>
Reported-by: Yi Zhang <yizhan@redhat.com>
Tested-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Fixed up missing 0 return in bdi_register_owner().

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Kirill A. Shutemov
ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Tejun Heo
b817525a4a writeback: bdi_writeback iteration must not skip dying ones
bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi->cgwb_tree; however, the
tree only indexes wb's which are currently active.

For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained.  As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.

This patch adds a RCU protected @bdi->wb_list which lists all wb's
beloinging to that bdi.  wb's are added on creation and removed on
release rather than on the start of destruction.  bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com>
Fixes: ebe41ab0c7 ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12 10:31:12 -06:00
Tejun Heo
a13f35e871 writeback: don't embed root bdi_writeback_congested in bdi_writeback
52ebea749a ("writeback: make backing_dev_info host cgroup-specific
bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
(bdi_writeback's).  As the congested state needs to be per-wb and
referenced from blkcg side and multiple wbs, the patch made all
non-root cong's (bdi_writeback_congested's) reference counted and
indexed on bdi.

When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
non-root cong's; however, this can hang indefinitely because wb's can
also be referenced from blkcg_gq's which are destroyed after bdi
destruction is complete.

To fix the bug, bdi destruction will be updated to not wait for cong's
to drain, which naturally means that cong's may outlive the associated
bdi.  This is fine for non-root cong's but is problematic for the root
cong's which are embedded in their bdi's as they may end up getting
dereferenced after the containing bdi's are freed.

This patch makes root cong's behave the same as non-root cong's.  They
are no longer embedded in their bdi's but allocated separately during
bdi initialization, indexed and reference counted the same way.

* As cong handling is the same for all wb's, wb->congested
  initialization is moved into wb_init().

* When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
  bdi->wb_congested is now a pointer pointing to the root cong
  allocated during bdi init and minimal refcnting operations are
  implemented.

* The above makes root wb init paths diverge depending on
  CONFIG_CGROUP_WRITEBACK.  root wb init is moved to cgwb_bdi_init().

This patch in itself shouldn't cause any consequential behavior
differences but prepares for the actual fix.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jon Christopherson <jon@jons.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
Tested-by: Jon Christopherson <jon@jons.org>

Added <linux/slab.h> include to backing-dev.h for kfree() definition.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-02 08:46:00 -06:00
Tejun Heo
e8a7abf5a5 writeback: disassociate inodes from dying bdi_writebacks
For the purpose of foreign inode detection, wb's (bdi_writeback's) are
identified by the associated memcg ID.  As we create a separate wb for
each memcg, this is enough to identify the active wb's; however, when
blkcg is enabled or disabled higher up in the hierarchy, the mapping
between memcg and blkcg changes which in turn creates a new wb to
service the new mapping.  The old wb is unlinked from index and
released after all references are drained.  The foreign inode
detection logic can't detect this condition because both the old and
new wb's point to the same memcg and thus never decides to move inodes
attached to the old wb to the new one.

This patch adds logic to initiate switching immediately in
wbc_attach_and_unlock_inode() if the associated wb is dying.  We can
make the usual foreign detection logic to distinguish the different
wb's mapped to the memcg but the dying wb is never gonna be in active
service again and there's no point in tracking the usage history and
reaching the switch verdict after enough data points are collected.
It's already known that the wb has to be switched.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:40:20 -06:00
Tejun Heo
21c6321fbb writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
Currently, majority of cgroup writeback support including all the
above functions are implemented in include/linux/backing-dev.h and
mm/backing-dev.c; however, the portion closely related to writeback
logic implemented in include/linux/writeback.h and mm/page-writeback.c
will expand to support foreign writeback detection and correction.

This patch moves wb[_try]_get() and wb_put() to
include/linux/backing-dev-defs.h so that they can be used from
writeback.h and inode_{attach|detach}_wb() to writeback.h and
page-writeback.c.

This is pure reorganization and doesn't introduce any functional
changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:39:08 -06:00
Tejun Heo
841710aa6e writeback: implement memcg wb_domain
Dirtyable memory is distributed to a wb (bdi_writeback) according to
the relative bandwidth the wb is writing out in the whole system.
This distribution is global - each wb is measured against all other
wb's and gets the proportinately sized portion of the memory in the
whole system.

For cgroup writeback, the amount of dirtyable memory is scoped by
memcg and thus each wb would need to be measured and controlled in its
memcg.  IOW, a wb will belong to two writeback domains - the global
and memcg domains.

The previous patches laid the groundwork to support the two wb_domains
and this patch implements memcg wb_domain.  memcg->cgwb_domain is
initialized on css online and destroyed on css release,
wb->memcg_completions is added, and __wb_writeout_inc() is updated to
increment completions against both global and memcg wb_domains.

The following patches will update balance_dirty_pages() and its
subroutines to actually consider memcg wb_domain for throttling.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:38:13 -06:00
Tejun Heo
cc395d7f1f writeback: implement bdi_wait_for_completion()
If the completion of a wb_writeback_work can be waited upon by setting
its ->done to a struct completion and waiting on it; however, for
cgroup writeback support, it's necessary to issue multiple work items
to multiple bdi_writebacks and wait for the completion of all.

This patch implements wb_completion which can wait for multiple work
items and replaces the struct completion with it.  It can be defined
using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
waited for by wb_wait_for_completion().

Nobody currently issues multiple work items and this patch doesn't
introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:36 -06:00
Tejun Heo
95a46c65e3 writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account
bdi_has_dirty_io() used to only reflect whether the root wb
(bdi_writeback) has dirty inodes.  For cgroup writeback support, it
needs to take all active wb's into account.  If any wb on the bdi has
dirty inodes, bdi_has_dirty_io() should return true.

To achieve that, as inode_wb_list_{move|del}_locked() now keep track
of the dirty state transition of each wb, the number of dirty wbs can
be counted in the bdi; however, bdi is already aggregating
wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
dip below 1.  bdi_has_dirty_io() can simply test whether
bdi->tot_write_bandwidth is zero or not.

While this bumps the value of wb->avg_write_bandwidth to one when it
used to be zero, this shouldn't cause any meaningful behavior
difference.

bdi_has_dirty_io() is made an inline function which tests whether
->tot_write_bandwidth is non-zero.  Also, WARN_ON_ONCE()'s on its
value are added to inode_wb_list_{move|del}_locked().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:36 -06:00
Tejun Heo
766a9d6e60 writeback: implement backing_dev_info->tot_write_bandwidth
cgroup writeback support needs to keep track of the sum of
avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
distribute write workload.  This patch adds bdi->tot_write_bandwidth
and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
and wb_update_write_bandwidth() to adjust it as wb's gain and lose
dirty inodes and its avg_write_bandwidth gets updated.

As the update events are not synchronized with each other,
bdi->tot_write_bandwidth is an atomic_long_t.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:35 -06:00
Tejun Heo
d6c10f1fc8 writeback: implement WB_has_dirty_io wb_state flag
Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback)
has any dirty inode by testing all three IO lists on each invocation
without actively keeping track.  For cgroup writeback support, a
single bdi will host multiple wb's each of which will host dirty
inodes separately and we'll need to make bdi_has_dirty_io(), which
currently only represents the root wb, aggregate has_dirty_io from all
member wb's, which requires tracking transitions in has_dirty_io state
on each wb.

This patch introduces inode_wb_list_{move|del}_locked() to consolidate
IO list operations leaving queue_io() the only other function which
directly manipulates IO lists (via move_expired_inodes()).  All three
functions are updated to call wb_io_lists_[de]populated() which keep
track of whether the wb has dirty inodes or not and record it using
the new WB_has_dirty_io flag.  inode_wb_list_moved_locked()'s return
value indicates whether the wb had no dirty inodes before.

mark_inode_dirty() is restructured so that the return value of
inode_wb_list_move_locked() can be used for deciding whether to wake
up the wb.

While at it, change {bdi|wb}_has_dirty_io()'s return values to bool.
These functions were returning 0 and 1 before.  Also, add a comment
explaining the synchronization of wb_state flags.

v2: Updated to accommodate b_dirty_time.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:35 -06:00
Tejun Heo
ec8a6f2643 writeback: make congestion functions per bdi_writeback
Currently, all congestion functions take bdi (backing_dev_info) and
always operate on the root wb (bdi->wb) and the congestion state from
the block layer is propagated only for the root blkcg.  This patch
introduces {set|clear}_wb_congested() and wb_congested() which take a
bdi_writeback_congested and bdi_writeback respectively.  The bdi
counteparts are now wrappers invoking the wb based functions on
@bdi->wb.

While converting clear_bdi_congested() to clear_wb_congested(), the
local variable declaration order between @wqh and @bit is swapped for
cosmetic reason.

This patch just adds the new wb based functions.  The following
patches will apply them.

v2: Updated for bdi_writeback_congested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:35 -06:00
Tejun Heo
52ebea749a writeback: make backing_dev_info host cgroup-specific bdi_writebacks
For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback).  This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).

On the default hierarchy, blkcg implicitly enables memcg.  This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg.  This means that there may be multiple
wb's of a bdi mapped to the same blkcg.  As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state.  This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.

bdi->wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
by its memcg id.  Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().

Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi->wb.

v3: inode_attach_wb() in account_page_dirtied() moved inside
    mapping_cap_account_dirty() block where it's known to be !NULL.
    Also, an unnecessary NULL check before kfree() removed.  Both
    detected by the kbuild bot.

v2: Updated so that wb association is per inode and wb is per memcg
    rather than blkcg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:35 -06:00
Tejun Heo
4aa9c692e0 bdi: separate out congested state into a separate struct
Currently, a wb's (bdi_writeback) congestion state is carried in its
->state field; however, cgroup writeback support will require multiple
wb's sharing the same congestion state.  This patch separates out
congestion state into its own struct - struct bdi_writeback_congested.
A new field wb field, wb_congested, points to its associated congested
struct.  The default wb, bdi->wb, always points to bdi->wb_congested.

While this patch adds a layer of indirection, it doesn't introduce any
behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:35 -06:00
Tejun Heo
66114cad64 writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00