Commit graph

574 commits

Author SHA1 Message Date
Tom Herbert
850e18a66f
proto_ops: Add locked held versions of sendmsg and sendpage
Add new proto_ops sendmsg_locked and sendpage_locked that can be
called when the socket lock is already held. Correspondingly, add
kernel_sendmsg_locked and kernel_sendpage_locked as front end
functions.

These functions will be used in zero proxy so that we can take
the socket lock in a ULP sendmsg/sendpage and then directly call the
backend transport proto_ops functions.

Change-Id: I4a8a6f5234486946ec2870ae22fa8ea561df3af0
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-25 16:54:42 +03:00
Lawrence Brakmo
a4efd0995b
bpf: BPF support for sock_ops
Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.

Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code.  For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.

The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.

This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.

This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.

Two new corresponding structs (one for the kernel one for the user/BPF
program):

/* kernel version */
struct bpf_sock_ops_kern {
        struct sock *sk;
        __u32  op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
};

/* user version
 * Some fields are in network byte order reflecting the sock struct
 * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
 * convert them to host byte order.
 */
struct bpf_sock_ops {
        __u32 op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
        __u32 family;
        __u32 remote_ip4;     /* In network byte order */
        __u32 local_ip4;      /* In network byte order */
        __u32 remote_ip6[4];  /* In network byte order */
        __u32 local_ip6[4];   /* In network byte order */
        __u32 remote_port;    /* In network byte order */
        __u32 local_port;     /* In host byte horder */
};

Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.

The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.

[Nguyễn Long: Update to match "UPSTREAM: bpf: multi program support
for cgroup+bpf"]

Change-Id: Ifbbde466367bc68333cc08f57601550944f264ec
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-25 16:54:39 +03:00
FAROVITUS
fb5f1ca08c Merge branch 'tw10-android-4.9-q-stock' into tw10-android-4.9-q (DTC5 OSRC)
Conflicts:
	drivers/input/ff-memless.c
	drivers/tty/serial/samsung.c
	sound/core/timer.c
2020-04-22 21:29:13 +03:00
FAROVITUS
e74e8cdc8d import G96xFXXU8DTC5 OSRC
Signed-off-by: FAROVITUS <farovitus@gmail.com>
2020-04-22 21:16:08 +03:00
FAROVITUS
af1d3ae977 Merge 4.9.212 branch 'android-4.9-q' into tw10-android-4.9-q
Documentation/filesystems/fscrypt.rst
	arch/arm/common/Kconfig
	arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
	arch/arm64/boot/dts/amd/amd-seattle-soc.dtsi
	arch/arm64/boot/dts/arm/juno-clocks.dtsi
	arch/arm64/boot/dts/broadcom/ns2.dtsi
	arch/arm64/boot/dts/lg/lg1312.dtsi
	arch/arm64/boot/dts/lg/lg1313.dtsi
	arch/arm64/boot/dts/marvell/armada-37xx.dtsi
	arch/arm64/boot/dts/nvidia/tegra210-p2180.dtsi
	arch/arm64/boot/dts/nvidia/tegra210-p2597.dtsi
	arch/arm64/boot/dts/nvidia/tegra210.dtsi
	arch/arm64/boot/dts/qcom/apq8016-sbc.dtsi
	arch/arm64/boot/dts/qcom/msm8996.dtsi
	arch/arm64/configs/ranchu64_defconfig
	arch/arm64/include/asm/cpucaps.h
	arch/arm64/kernel/cpufeature.c
	arch/arm64/kernel/traps.c
	arch/arm64/mm/mmu.c
	crypto/Makefile
	crypto/ablkcipher.c
	crypto/blkcipher.c
	crypto/testmgr.h
	crypto/zstd.c
	drivers/android/binder.c
	drivers/android/binder_alloc.c
	drivers/char/random.c
	drivers/clocksource/exynos_mct.c
	drivers/dma/pl330.c
	drivers/hid/hid-sony.c
	drivers/hid/uhid.c
	drivers/hid/usbhid/hiddev.c
	drivers/i2c/i2c-core.c
	drivers/md/dm-crypt.c
	drivers/media/v4l2-core/videobuf2-v4l2.c
	drivers/mmc/host/dw_mmc.c
	drivers/net/ethernet/broadcom/tg3.c
	drivers/net/usb/r8152.c
	drivers/scsi/scsi_logging.c
	drivers/scsi/sd.c
	drivers/scsi/ufs/ufshcd-pci.c
	drivers/scsi/ufs/ufshcd-pltfrm.c
	drivers/staging/android/Kconfig
	drivers/staging/android/ion/ion.c
	drivers/staging/android/ion/ion_priv.h
	drivers/staging/android/ion/ion_system_heap.c
	drivers/staging/android/lowmemorykiller.c
	drivers/tty/serial/samsung.c
	drivers/usb/dwc3/core.c
	drivers/usb/dwc3/gadget.c
	drivers/usb/host/xhci-hub.c
	drivers/video/fbdev/core/fbmon.c
	drivers/video/fbdev/core/modedb.c
	fs/crypto/fname.c
	fs/crypto/fscrypt_private.h
	fs/crypto/keyinfo.c
	fs/ext4/ialloc.c
	fs/ext4/namei.c
	fs/ext4/xattr.c
	fs/f2fs/checkpoint.c
	fs/f2fs/data.c
	fs/f2fs/debug.c
	fs/f2fs/dir.c
	fs/f2fs/f2fs.h
	fs/f2fs/file.c
	fs/f2fs/gc.c
	fs/f2fs/inline.c
	fs/f2fs/inode.c
	fs/f2fs/namei.c
	fs/f2fs/node.c
	fs/f2fs/recovery.c
	fs/f2fs/segment.c
	fs/f2fs/segment.h
	fs/f2fs/super.c
	fs/f2fs/sysfs.c
	fs/fat/dir.c
	fs/fat/fatent.c
	fs/file.c
	fs/namespace.c
	fs/pnode.c
	fs/proc/inode.c
	fs/proc/root.c
	fs/proc/task_mmu.c
	fs/sdcardfs/dentry.c
	fs/sdcardfs/derived_perm.c
	fs/sdcardfs/file.c
	fs/sdcardfs/inode.c
	fs/sdcardfs/lookup.c
	fs/sdcardfs/main.c
	fs/sdcardfs/sdcardfs.h
	fs/sdcardfs/super.c
	include/linux/blk_types.h
	include/linux/cpuhotplug.h
	include/linux/cred.h
	include/linux/fb.h
	include/linux/power_supply.h
	include/linux/sched.h
	include/linux/zstd.h
	include/trace/events/sched.h
	include/uapi/linux/android/binder.h
	init/Kconfig
	init/main.c
	kernel/bpf/hashtab.c
	kernel/cpu.c
	kernel/cred.c
	kernel/fork.c
	kernel/locking/spinlock_debug.c
	kernel/panic.c
	kernel/printk/printk.c
	kernel/sched/Makefile
	kernel/sched/core.c
	kernel/sched/fair.c
	kernel/sched/rt.c
	kernel/sched/walt.c
	kernel/sched/walt.h
	kernel/trace/trace.c
	lib/bug.c
	lib/list_debug.c
	lib/vsprintf.c
	lib/zstd/bitstream.h
	lib/zstd/compress.c
	lib/zstd/decompress.c
	lib/zstd/fse.h
	lib/zstd/fse_compress.c
	lib/zstd/fse_decompress.c
	lib/zstd/huf_compress.c
	lib/zstd/huf_decompress.c
	lib/zstd/zstd_internal.h
	mm/debug.c
	mm/filemap.c
	mm/rmap.c
	net/core/filter.c
	net/ipv4/sysctl_net_ipv4.c
	net/ipv4/sysfs_net_ipv4.c
	net/ipv4/tcp_input.c
	net/ipv4/tcp_output.c
	net/ipv4/udp.c
	net/ipv6/netfilter/nf_conntrack_reasm.c
	net/netfilter/Kconfig
	net/netfilter/Makefile
	net/netfilter/xt_qtaguid.c
	net/netfilter/xt_qtaguid_internal.h
	net/xfrm/xfrm_policy.c
	net/xfrm/xfrm_state.c
	scripts/checkpatch.pl
	security/selinux/hooks.c
	sound/core/compress_offload.c
2020-02-12 12:32:38 +02:00
FAROVITUS
2b92eefa41 import G965FXXU7DTAA OSRC
*First release for Android (Q).

Signed-off-by: FAROVITUS <farovitus@gmail.com>
2020-02-04 13:50:09 +02:00
Greg Kroah-Hartman
6f0b5b54fe This is the 4.9.207 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl396QsACgkQONu9yGCS
 aT63DA//V6C58MC6I6ISv/2zcGrWOzfP4q1et4lu1eBPgUTzbnB7gdnwH2V6l3QU
 Kt9V4GUI6GFsVmRGsEJZjg4kvA1HDjx6sbs3odHuvDnhGFrHstnxKOJiuZeQU4Ph
 wxDEQghyeYod+2VDKNWw3gxjBgjP95eAvZMmhnQ13dwY20HZgGjoUMHBKpKpwOly
 ZwV0YWsW9Zko8bTmuyrqXUG8ChSWZCWMuNKNpUDIjVj/NPcAyEAr9wkYOGf8UnGq
 rK/owI0Y02qZJS9ZGxv4IDNjW7yjIGQzSYWUrY8ajSQzFqHErWSQ7tGwJ8fb5EBM
 0zH2GGp8XLXXOCMMrg8nsTlNaj+wg7a5C0Vx0VtijYyIQCE39M91rltO+oVBWdjR
 jCUbp//2v7XFN+eQSvwuWHrUyOmGGj9C0mDGoMC4rSHES4jYS4b4wHzn9CK8Iqy2
 izvNjz93TrNj+YLLUlf6tTSUfWYuGSFNyahbhjFL7MPBp2RUJM1f5uzhBj2sQqwN
 olCUvsNSZRWkR5/f15/kdEyPAhYt0i66aV3JNpEaRHZDqpUXAMTRv+AgtEPBNA5r
 mORdDq9Zw+m+BGYSiuotArINRY8PSn1tP8tnhNjGE0RwnAx+S0Fs6+0AyHhEY+wQ
 OBpNfRTaJep9L1Yl4Hj4ZDU6Lr5xqWLoplV2X9OIUzLleAY44+M=
 =Gqb9
 -----END PGP SIGNATURE-----

Merge 4.9.207 into android-4.9-q

Changes in 4.9.207
	arm64: tegra: Fix 'active-low' warning for Jetson TX1 regulator
	usb: gadget: u_serial: add missing port entry locking
	tty: serial: fsl_lpuart: use the sg count from dma_map_sg
	tty: serial: msm_serial: Fix flow control
	serial: pl011: Fix DMA ->flush_buffer()
	serial: serial_core: Perform NULL checks for break_ctl ops
	serial: ifx6x60: add missed pm_runtime_disable
	autofs: fix a leak in autofs_expire_indirect()
	RDMA/hns: Correct the value of HNS_ROCE_HEM_CHUNK_LEN
	exportfs_decode_fh(): negative pinned may become positive without the parent locked
	audit_get_nd(): don't unlock parent too early
	NFC: nxp-nci: Fix NULL pointer dereference after I2C communication error
	Input: cyttsp4_core - fix use after free bug
	ALSA: pcm: Fix stream lock usage in snd_pcm_period_elapsed()
	rsxx: add missed destroy_workqueue calls in remove
	net: ep93xx_eth: fix mismatch of request_mem_region in remove
	serial: core: Allow processing sysrq at port unlock time
	cxgb4vf: fix memleak in mac_hlist initialization
	iwlwifi: mvm: Send non offchannel traffic via AP sta
	ARM: 8813/1: Make aligned 2-byte getuser()/putuser() atomic on ARMv6+
	net/mlx5: Release resource on error flow
	extcon: max8997: Fix lack of path setting in USB device mode
	clk: rockchip: fix rk3188 sclk_smc gate data
	clk: rockchip: fix rk3188 sclk_mac_lbtest parameter ordering
	ARM: dts: rockchip: Fix rk3288-rock2 vcc_flash name
	dlm: fix missing idr_destroy for recover_idr
	MIPS: SiByte: Enable ZONE_DMA32 for LittleSur
	scsi: zfcp: drop default switch case which might paper over missing case
	pinctrl: qcom: ssbi-gpio: fix gpio-hog related boot issues
	Staging: iio: adt7316: Fix i2c data reading, set the data field
	regulator: Fix return value of _set_load() stub
	MIPS: OCTEON: octeon-platform: fix typing
	math-emu/soft-fp.h: (_FP_ROUND_ZERO) cast 0 to void to fix warning
	rtc: max8997: Fix the returned value in case of error in 'max8997_rtc_read_alarm()'
	rtc: dt-binding: abx80x: fix resistance scale
	ARM: dts: exynos: Use Samsung SoC specific compatible for DWC2 module
	media: pulse8-cec: return 0 when invalidating the logical address
	dmaengine: coh901318: Fix a double-lock bug
	dmaengine: coh901318: Remove unused variable
	usb: dwc3: don't log probe deferrals; but do log other error codes
	ACPI: fix acpi_find_child_device() invocation in acpi_preset_companion()
	dma-mapping: fix return type of dma_set_max_seg_size()
	altera-stapl: check for a null key before strcasecmp'ing it
	serial: imx: fix error handling in console_setup
	i2c: imx: don't print error message on probe defer
	dlm: NULL check before kmem_cache_destroy is not needed
	ARM: debug: enable UART1 for socfpga Cyclone5
	nfsd: fix a warning in __cld_pipe_upcall()
	ARM: OMAP1/2: fix SoC name printing
	net/x25: fix called/calling length calculation in x25_parse_address_block
	net/x25: fix null_x25_address handling
	ARM: dts: mmp2: fix the gpio interrupt cell number
	ARM: dts: realview-pbx: Fix duplicate regulator nodes
	tcp: fix off-by-one bug on aborting window-probing socket
	tcp: fix SNMP TCP timeout under-estimation
	modpost: skip ELF local symbols during section mismatch check
	kbuild: fix single target build for external module
	mtd: fix mtd_oobavail() incoherent returned value
	ARM: dts: pxa: clean up USB controller nodes
	clk: sunxi-ng: h3/h5: Fix CSI_MCLK parent
	ARM: dts: realview: Fix some more duplicate regulator nodes
	dlm: fix invalid cluster name warning
	net/mlx4_core: Fix return codes of unsupported operations
	powerpc/math-emu: Update macros from GCC
	MIPS: OCTEON: cvmx_pko_mem_debug8: use oldest forward compatible definition
	nfsd: Return EPERM, not EACCES, in some SETATTR cases
	tty: Don't block on IO when ldisc change is pending
	media: stkwebcam: Bugfix for wrong return values
	mlx4: Use snprintf instead of complicated strcpy
	ARM: dts: sunxi: Fix PMU compatible strings
	sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
	fuse: verify nlink
	fuse: verify attributes
	ALSA: pcm: oss: Avoid potential buffer overflows
	Input: goodix - add upside-down quirk for Teclast X89 tablet
	coresight: etm4x: Fix input validation for sysfs.
	x86/PCI: Avoid AMD FCH XHCI USB PME# from D0 defect
	CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
	CIFS: Fix SMB2 oplock break processing
	tty: vt: keyboard: reject invalid keycodes
	can: slcan: Fix use-after-free Read in slcan_open
	jbd2: Fix possible overflow in jbd2_log_space_left()
	drm/i810: Prevent underflow in ioctl
	KVM: x86: do not modify masked bits of shared MSRs
	KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES
	crypto: crypto4xx - fix double-free in crypto4xx_destroy_sdr
	crypto: ccp - fix uninitialized list head
	crypto: ecdh - fix big endian bug in ECC library
	crypto: user - fix memory leak in crypto_report
	spi: atmel: Fix CS high support
	RDMA/qib: Validate ->show()/store() callbacks before calling them
	thermal: Fix deadlock in thermal thermal_zone_device_check
	KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332)
	appletalk: Fix potential NULL pointer dereference in unregister_snap_client
	appletalk: Set error code if register_snap_client failed
	usb: gadget: configfs: Fix missing spin_lock_init()
	USB: uas: honor flag to avoid CAPACITY16
	USB: uas: heed CAPACITY_HEURISTICS
	usb: Allow USB device to be warm reset in suspended state
	staging: rtl8188eu: fix interface sanity check
	staging: rtl8712: fix interface sanity check
	staging: gigaset: fix general protection fault on probe
	staging: gigaset: fix illegal free on probe errors
	staging: gigaset: add endpoint-type sanity check
	xhci: Increase STS_HALT timeout in xhci_suspend()
	ARM: dts: pandora-common: define wl1251 as child node of mmc3
	iio: humidity: hdc100x: fix IIO_HUMIDITYRELATIVE channel reporting
	USB: atm: ueagle-atm: add missing endpoint check
	USB: idmouse: fix interface sanity checks
	USB: serial: io_edgeport: fix epic endpoint lookup
	USB: adutux: fix interface sanity check
	usb: core: urb: fix URB structure initialization function
	usb: mon: Fix a deadlock in usbmon between mmap and read
	mtd: spear_smi: Fix Write Burst mode
	virtio-balloon: fix managed page counts when migrating pages between zones
	btrfs: check page->mapping when loading free space cache
	btrfs: Remove btrfs_bio::flags member
	Btrfs: send, skip backreference walking for extents with many references
	btrfs: record all roots for rename exchange on a subvol
	rtlwifi: rtl8192de: Fix missing code to retrieve RX buffer address
	rtlwifi: rtl8192de: Fix missing callback that tests for hw release of buffer
	rtlwifi: rtl8192de: Fix missing enable interrupt flag
	lib: raid6: fix awk build warnings
	ALSA: hda - Fix pending unsol events at shutdown
	workqueue: Fix spurious sanity check failures in destroy_workqueue()
	workqueue: Fix pwq ref leak in rescuer_thread()
	ASoC: Jack: Fix NULL pointer dereference in snd_soc_jack_report
	blk-mq: avoid sysfs buffer overflow with too many CPU cores
	cgroup: pids: use atomic64_t for pids->limit
	ar5523: check NULL before memcpy() in ar5523_cmd()
	media: bdisp: fix memleak on release
	media: radio: wl1273: fix interrupt masking on release
	cpuidle: Do not unset the driver if it is there already
	PM / devfreq: Lock devfreq in trans_stat_show
	ACPI: OSL: only free map once in osl.c
	ACPI: bus: Fix NULL pointer check in acpi_bus_get_private_data()
	ACPI: PM: Avoid attaching ACPI PM domain to certain devices
	pinctrl: samsung: Fix device node refcount leaks in S3C24xx wakeup controller init
	pinctrl: samsung: Fix device node refcount leaks in init code
	mmc: host: omap_hsmmc: add code for special init of wl1251 to get rid of pandora_wl1251_init_card
	ppdev: fix PPGETTIME/PPSETTIME ioctls
	powerpc: Allow 64bit VDSO __kernel_sync_dicache to work across ranges >4GB
	video/hdmi: Fix AVI bar unpack
	quota: Check that quota is not dirty before release
	ext2: check err when partial != NULL
	quota: fix livelock in dquot_writeback_dquots
	scsi: zfcp: trace channel log even for FCP command responses
	usb: xhci: only set D3hot for pci device
	xhci: Fix memory leak in xhci_add_in_port()
	xhci: make sure interrupts are restored to correct state
	iio: adis16480: Add debugfs_reg_access entry
	Btrfs: fix negative subv_writers counter and data space leak after buffered write
	omap: pdata-quirks: remove openpandora quirks for mmc3 and wl1251
	scsi: lpfc: Cap NPIV vports to 256
	e100: Fix passing zero to 'PTR_ERR' warning in e100_load_ucode_wait
	x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
	x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
	ath10k: fix fw crash by moving chip reset after napi disabled
	ARM: dts: omap3-tao3530: Fix incorrect MMC card detection GPIO polarity
	pinctrl: samsung: Fix device node refcount leaks in S3C64xx wakeup controller init
	scsi: qla2xxx: Fix DMA unmap leak
	scsi: qla2xxx: Fix session lookup in qlt_abort_work()
	scsi: qla2xxx: Fix qla24xx_process_bidir_cmd()
	scsi: qla2xxx: Always check the qla2x00_wait_for_hba_online() return value
	powerpc: Fix vDSO clock_getres()
	reiserfs: fix extended attributes on the root directory
	firmware: qcom: scm: Ensure 'a0' status code is treated as signed
	mm/shmem.c: cast the type of unmap_start to u64
	ext4: fix a bug in ext4_wait_for_tail_page_commit
	blk-mq: make sure that line break can be printed
	workqueue: Fix missing kfree(rescuer) in destroy_workqueue()
	sunrpc: fix crash when cache_head become valid before update
	net/mlx5e: Fix SFF 8472 eeprom length
	kernel/module.c: wakeup processes in module_wq on module unload
	nvme: host: core: fix precedence of ternary operator
	net: bridge: deny dev_set_mac_address() when unregistering
	net: ethernet: ti: cpsw: fix extra rx interrupt
	openvswitch: support asymmetric conntrack
	tcp: md5: fix potential overestimation of TCP option space
	tipc: fix ordering of tipc module init and exit routine
	inet: protect against too small mtu values.
	tcp: fix rejected syncookies due to stale timestamps
	tcp: tighten acceptance of ACKs not matching a child socket
	tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
	Revert "regulator: Defer init completion for a while after late_initcall"
	PCI: Fix Intel ACS quirk UPDCR register address
	PCI/MSI: Fix incorrect MSI-X masking on resume
	xtensa: fix TLB sanity checker
	CIFS: Respect O_SYNC and O_DIRECT flags during reconnect
	ARM: dts: s3c64xx: Fix init order of clock providers
	ARM: tegra: Fix FLOW_CTLR_HALT register clobbering by tegra_resume()
	vfio/pci: call irq_bypass_unregister_producer() before freeing irq
	dma-buf: Fix memory leak in sync_file_merge()
	dm btree: increase rebalance threshold in __rebalance2()
	scsi: iscsi: Fix a potential deadlock in the timeout handler
	drm/radeon: fix r1xx/r2xx register checker for POT textures
	xhci: fix USB3 device initiated resume race with roothub autosuspend
	net: stmmac: use correct DMA buffer size in the RX descriptor
	net: stmmac: don't stop NAPI processing when dropping a packet
	Linux 4.9.207

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-12-21 11:28:16 +01:00
Guillaume Nault
e7a3b025b8 tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
[ Upstream commit 721c8dafad26ccfa90ff659ee19755e3377b829d ]

Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the
timestamp of the last synflood. Protect them with READ_ONCE() and
WRITE_ONCE() since reads and writes aren't serialised.

Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was
introduced by a0f82f64e2 ("syncookies: remove last_synq_overflow from
struct tcp_sock"). But unprotected accesses were already there when
timestamp was stored in .last_synq_overflow.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21 10:42:27 +01:00
Guillaume Nault
0c8cd7f6bb tcp: tighten acceptance of ACKs not matching a child socket
[ Upstream commit cb44a08f8647fd2e8db5cc9ac27cd8355fa392d8 ]

When no synflood occurs, the synflood timestamp isn't updated.
Therefore it can be so old that time_after32() can consider it to be
in the future.

That's a problem for tcp_synq_no_recent_overflow() as it may report
that a recent overflow occurred while, in fact, it's just that jiffies
has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31.

Spurious detection of recent overflows lead to extra syncookie
verification in cookie_v[46]_check(). At that point, the verification
should fail and the packet dropped. But we should have dropped the
packet earlier as we didn't even send a syncookie.

Let's refine tcp_synq_no_recent_overflow() to report a recent overflow
only if jiffies is within the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This
way, no spurious recent overflow is reported when jiffies wraps and
'last_overflow' becomes in the future from the point of view of
time_after32().

However, if jiffies wraps and enters the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with
'last_overflow' being a stale synflood timestamp), then
tcp_synq_no_recent_overflow() still erroneously reports an
overflow. In such cases, we have to rely on syncookie verification
to drop the packet. We unfortunately have no way to differentiate
between a fresh and a stale syncookie timestamp.

In practice, using last_overflow as lower bound is problematic.
If the synflood timestamp is concurrently updated between the time
we read jiffies and the moment we store the timestamp in
'last_overflow', then 'now' becomes smaller than 'last_overflow' and
tcp_synq_no_recent_overflow() returns true, potentially dropping a
valid syncookie.

Reading jiffies after loading the timestamp could fix the problem,
but that'd require a memory barrier. Let's just accommodate for
potential timestamp growth instead and extend the interval using
'last_overflow - HZ' as lower bound.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21 10:42:26 +01:00
Guillaume Nault
3b4a534f2a tcp: fix rejected syncookies due to stale timestamps
[ Upstream commit 04d26e7b159a396372646a480f4caa166d1b6720 ]

If no synflood happens for a long enough period of time, then the
synflood timestamp isn't refreshed and jiffies can advance so much
that time_after32() can't accurately compare them any more.

Therefore, we can end up in a situation where time_after32(now,
last_overflow + HZ) returns false, just because these two values are
too far apart. In that case, the synflood timestamp isn't updated as
it should be, which can trick tcp_synq_no_recent_overflow() into
rejecting valid syncookies.

For example, let's consider the following scenario on a system
with HZ=1000:

  * The synflood timestamp is 0, either because that's the timestamp
    of the last synflood or, more commonly, because we're working with
    a freshly created socket.

  * We receive a new SYN, which triggers synflood protection. Let's say
    that this happens when jiffies == 2147484649 (that is,
    'synflood timestamp' + HZ + 2^31 + 1).

  * Then tcp_synq_overflow() doesn't update the synflood timestamp,
    because time_after32(2147484649, 1000) returns false.
    With:
      - 2147484649: the value of jiffies, aka. 'now'.
      - 1000: the value of 'last_overflow' + HZ.

  * A bit later, we receive the ACK completing the 3WHS. But
    cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow()
    says that we're not under synflood. That's because
    time_after32(2147484649, 120000) returns false.
    With:
      - 2147484649: the value of jiffies, aka. 'now'.
      - 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID.

    Of course, in reality jiffies would have increased a bit, but this
    condition will last for the next 119 seconds, which is far enough
    to accommodate for jiffie's growth.

Fix this by updating the overflow timestamp whenever jiffies isn't
within the [last_overflow, last_overflow + HZ] range. That shouldn't
have any performance impact since the update still happens at most once
per second.

Now we're guaranteed to have fresh timestamps while under synflood, so
tcp_synq_no_recent_overflow() can safely use it with time_after32() in
such situations.

Stale timestamps can still make tcp_synq_no_recent_overflow() return
the wrong verdict when not under synflood. This will be handled in the
next patch.

For 64 bits architectures, the problem was introduced with the
conversion of ->tw_ts_recent_stamp to 32 bits integer by commit
cca9bab1b72c ("tcp: use monotonic timestamps for PAWS").
The problem has always been there on 32 bits architectures.

Fixes: cca9bab1b72c ("tcp: use monotonic timestamps for PAWS")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21 10:42:25 +01:00
Greg Kroah-Hartman
3fcf271139 This is the 4.9.191 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl1yFqoACgkQONu9yGCS
 aT6Qhg/9EhDdNA67JIe4Wxns9YRb/yqAKKYGhvGQsUNud1fEYmnIeYOfkwSjPDQv
 OCt2gwgPfw1K49r96+zXsw4vywrCBAZShe2OnR7g63Toz9NHGQu1LJy6FyvzBubl
 vZ2WQBkY8T/viUfPMQ+xwJcuUsx2mWZ+VqgEzmgrqXxFCEPaxCerXqxgLAN3UhjI
 KGqylDKaKlHpNDZYc24GvO8rNoehRZS7FSrKOfjFpDld1bLjHq0ZMNp8ChBp/W99
 K+jp3MIeenjdoEiH4K2Hdf8zRRxxOVJf/hcbT9Hi8TKG2HsgAKAMuFz1X6IJtnYZ
 02bhYkvPoa+sKVS1wt9tscFzIiwBfAP7CyAROn9nv3A2V3A8ZfXIY/tMWh1yRS+E
 y6eVwH6gG9IW7AbjkJmZAlcgxCkYlag6WoTjc1F/s+XdB+fLyQIi8Tls85mKtzwi
 GXygbZIC7pKPlcVIfnEK4cMEm9FAZlxqThTXTckzSasteFsxTbQxy4AeflMY7MQE
 hn8c32gU4+zXhebF/2vQaEwVar+DkzRNwJUjRN5uLbISwJ1hrEvGWR3mhde2RIZg
 JiwGiJ5LmdegSemuLMXBvtOiZ6yTvfoT5p4Wqd5LLhtclrMvDOQ1ye4lBPOHGKFc
 gTuC+Hvo50THpNovzVjsJX2tf92z1FtF0jvH8fMEaQ3uPsLwGrA=
 =U6+Q
 -----END PGP SIGNATURE-----

Merge 4.9.191 into android-4.9-q

Changes in 4.9.191
	HID: Add 044f:b320 ThrustMaster, Inc. 2 in 1 DT
	MIPS: kernel: only use i8253 clocksource with periodic clockevent
	netfilter: ebtables: fix a memory leak bug in compat
	ASoC: dapm: Fix handling of custom_stop_condition on DAPM graph walks
	bonding: Force slave speed check after link state recovery for 802.3ad
	can: dev: call netif_carrier_off() in register_candev()
	st21nfca_connectivity_event_received: null check the allocation
	st_nci_hci_connectivity_event_received: null check the allocation
	ASoC: ti: davinci-mcasp: Correct slot_width posed constraint
	net: usb: qmi_wwan: Add the BroadMobi BM818 card
	isdn: mISDN: hfcsusb: Fix possible null-pointer dereferences in start_isoc_chain()
	isdn: hfcsusb: Fix mISDN driver crash caused by transfer buffer on the stack
	perf bench numa: Fix cpu0 binding
	can: sja1000: force the string buffer NULL-terminated
	can: peak_usb: force the string buffer NULL-terminated
	NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()
	HID: input: fix a4tech horizontal wheel custom usage
	net: cxgb3_main: Fix a resource leak in a error path in 'init_one()'
	net: hisilicon: make hip04_tx_reclaim non-reentrant
	net: hisilicon: fix hip04-xmit never return TX_BUSY
	net: hisilicon: Fix dma_map_single failed on arm64
	libata: add SG safety checks in SFF pio transfers
	x86/lib/cpu: Address missing prototypes warning
	drm/vmwgfx: fix memory leak when too many retries have occurred
	perf pmu-events: Fix missing "cpu_clk_unhalted.core" event
	selftests: kvm: Adding config fragments
	HID: wacom: correct misreported EKR ring values
	HID: wacom: Correct distance scale for 2nd-gen Intuos devices
	Revert "dm bufio: fix deadlock with loop device"
	gpiolib: never report open-drain/source lines as 'input' to user-space
	userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx
	x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386
	x86/apic: Handle missing global clockevent gracefully
	x86/boot: Save fields explicitly, zero out everything else
	x86/boot: Fix boot regression caused by bootparam sanitizing
	dm btree: fix order of block initialization in btree_split_beneath
	dm space map metadata: fix missing store of apply_bops() return value
	dm table: fix invalid memory accesses with too high sector number
	genirq: Properly pair kobject_del() with kobject_add()
	mm, page_owner: handle THP splits correctly
	mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely
	xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOT
	Revert "perf test 6: Fix missing kvm module load for s390"
	x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h
	dmaengine: ste_dma40: fix unneeded variable warning
	iommu/dma: Handle SG length overflow better
	usb: gadget: composite: Clear "suspended" on reset/disconnect
	xen/blkback: fix memory leaks
	i2c: emev2: avoid race when unregistering slave client
	usb: host: fotg2: restart hcd after port reset
	tools: hv: fix KVP and VSS daemons exit code
	watchdog: bcm2835_wdt: Fix module autoload
	scsi: ufs: Fix RX_TERMINATION_FORCE_ENABLE define value
	tcp: fix tcp_rtx_queue_tail in case of empty retransmit queue
	ALSA: usb-audio: Fix a stack buffer overflow bug in check_input_term
	ALSA: usb-audio: Fix an OOB bug in parse_audio_mixer_unit
	tcp: make sure EPOLLOUT wont be missed
	ALSA: line6: Fix memory leak at line6_init_pcm() error path
	ALSA: seq: Fix potential concurrent access to the deleted pool
	KVM: x86: Don't update RIP or do single-step on faulting emulation
	x86/apic: Do not initialize LDR and DFR for bigsmp
	x86/apic: Include the LDR when clearing out APIC registers
	mm/zsmalloc.c: fix race condition in zs_destroy_pool
	usb-storage: Add new JMS567 revision to unusual_devs
	USB: cdc-wdm: fix race between write and disconnect due to flag abuse
	usb: chipidea: udc: don't do hardware access if gadget has stopped
	usb: host: ohci: fix a race condition between shutdown and irq
	usb: host: xhci: rcar: Fix typo in compatible string matching
	USB: storage: ums-realtek: Update module parameter description for auto_delink_en
	USB: storage: ums-realtek: Whitelist auto-delink support
	ptrace,x86: Make user_64bit_mode() available to 32-bit builds
	uprobes/x86: Fix detection of 32-bit user mode
	mmc: sdhci-of-at91: add quirk for broken HS200
	mmc: core: Fix init of SD cards reporting an invalid VDD range
	stm class: Fix a double free of stm_source_device
	VMCI: Release resource if the work is already queued
	Revert "cfg80211: fix processing world regdomain when non modular"
	mac80211: fix possible sta leak
	KVM: arm/arm64: vgic: Fix potential deadlock when ap_list is long
	KVM: arm/arm64: vgic-v2: Handle SGI bits in GICD_I{S,C}PENDR0 as WI
	i2c: piix4: Fix port selection for AMD Family 16h Model 30h
	x86/ptrace: fix up botched merge of spectrev1 fix
	mm/zsmalloc.c: fix build when CONFIG_COMPACTION=n
	Linux 4.9.191

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-09-06 12:16:22 +02:00
Tim Froidcoeur
669a50559a tcp: fix tcp_rtx_queue_tail in case of empty retransmit queue
Commit 8c3088f895a0 ("tcp: be more careful in tcp_fragment()")
triggers following stack trace:

[25244.848046] kernel BUG at ./include/linux/skbuff.h:1406!
[25244.859335] RIP: 0010:skb_queue_prev+0x9/0xc
[25244.888167] Call Trace:
[25244.889182]  <IRQ>
[25244.890001]  tcp_fragment+0x9c/0x2cf
[25244.891295]  tcp_write_xmit+0x68f/0x988
[25244.892732]  __tcp_push_pending_frames+0x3b/0xa0
[25244.894347]  tcp_data_snd_check+0x2a/0xc8
[25244.895775]  tcp_rcv_established+0x2a8/0x30d
[25244.897282]  tcp_v4_do_rcv+0xb2/0x158
[25244.898666]  tcp_v4_rcv+0x692/0x956
[25244.899959]  ip_local_deliver_finish+0xeb/0x169
[25244.901547]  __netif_receive_skb_core+0x51c/0x582
[25244.903193]  ? inet_gro_receive+0x239/0x247
[25244.904756]  netif_receive_skb_internal+0xab/0xc6
[25244.906395]  napi_gro_receive+0x8a/0xc0
[25244.907760]  receive_buf+0x9a1/0x9cd
[25244.909160]  ? load_balance+0x17a/0x7b7
[25244.910536]  ? vring_unmap_one+0x18/0x61
[25244.911932]  ? detach_buf+0x60/0xfa
[25244.913234]  virtnet_poll+0x128/0x1e1
[25244.914607]  net_rx_action+0x12a/0x2b1
[25244.915953]  __do_softirq+0x11c/0x26b
[25244.917269]  ? handle_irq_event+0x44/0x56
[25244.918695]  irq_exit+0x61/0xa0
[25244.919947]  do_IRQ+0x9d/0xbb
[25244.921065]  common_interrupt+0x85/0x85
[25244.922479]  </IRQ>

tcp_rtx_queue_tail() (called by tcp_fragment()) can call
tcp_write_queue_prev() on the first packet in the queue, which will trigger
the BUG in tcp_write_queue_prev(), because there is no previous packet.

This happens when the retransmit queue is empty, for example in case of a
zero window.

Commit 8c3088f895a0 ("tcp: be more careful in tcp_fragment()") was not a
simple cherry-pick of the original one from master (b617158dc096)
because there is a specific TCP rtx queue only since v4.15. For more
details, please see the commit message of b617158dc096 ("tcp: be more
careful in tcp_fragment()").

The BUG() is hit due to the specific code added to versions older than
v4.15. The comment in skb_queue_prev() (include/linux/skbuff.h:1406),
just before the BUG_ON() somehow suggests to add a check before using
it, what Tim did.

In master, this code path causing the issue will not be taken because
the implementation of tcp_rtx_queue_tail() is different:

    tcp_fragment() → tcp_rtx_queue_tail() → tcp_write_queue_prev() →
skb_queue_prev() → BUG_ON()

Fixes: 8c3088f895a0 ("tcp: be more careful in tcp_fragment()")
Signed-off-by: Tim Froidcoeur <tim.froidcoeur@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-09-06 10:19:44 +02:00
Greg Kroah-Hartman
94c1fcc8e4 This is the 4.9.189 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl1P7H4ACgkQONu9yGCS
 aT4ajw//URV78175uHcZ6XC5kU4FSznr3tBbUKSmxL2SWRm8looA1iFLcItm3WTg
 jbV/9qETeGllRGuaLNLgG4AzSENh02KWjvlNk/WqpKOanjB+N/K0jnSZ3t2xqysh
 UgRN1sjYiX/kYbfbKyqXev+KrEsFDHqjrkVzd4vJEbhdCm6KOSQI835IbAcrST/w
 8NKTRinvU+OjHTIqTyjoh6h+qmJnc2HykZGGmBolZE2NmaqCagOo+eGKwd89YOWr
 JOV1DWoTRLrVRlR47ZoQ3zE62dpCgPJJpaJMlQkJRJmqLntu3rzvvQrgNSrcFT/2
 cUcnle14SXeMQb00HLBxEN1zhYkLj2QsEBd1MwASKpLjdCtwhxLAKZ7It3dNbAJ7
 N3kkjAg+4PHwHrWIvlfhwl85hrrz0QuFTRMWT6d6r/msyYhjxptTFCgiRq6pX29u
 lo9BqRSKUwwGrZlKuI7wfrafH+r6ujxzgMlAd0/ScgmrRhoWByX3luIaUlIEk5iF
 0y9KxXbQmlJCw/jOtTiw5VYlK8mvLQdpMEx5cuvK7tXuN5A2Cqc5ltjBIMo6ZBlK
 prRUJu8/GoKjWGBXYQmk6zJo8QHAh/sO6MyQXlLCRq8hHEscRX0sWnBxH6hAhsmc
 1DFb+YVdYxWZk63D3aHncNbs6GXHR4MB+Yoll8w7alH0O7jDP4I=
 =F4Ha
 -----END PGP SIGNATURE-----

Merge 4.9.189 into android-4.9-q

Changes in 4.9.189
	scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure
	ARM: dts: Add pinmuxing for i2c2 and i2c3 for LogicPD SOM-LV
	ARM: dts: Add pinmuxing for i2c2 and i2c3 for LogicPD torpedo
	ARM: dts: logicpd-som-lv: Fix Audio Mute
	arm64: cpufeature: Fix CTR_EL0 field definitions
	arm64: cpufeature: Fix feature comparison for CTR_EL0.{CWG,ERG}
	tcp: be more careful in tcp_fragment()
	HID: wacom: fix bit shift for Cintiq Companion 2
	HID: Add quirk for HP X1200 PIXART OEM mouse
	RDMA: Directly cast the sockaddr union to sockaddr
	IB: directly cast the sockaddr union to aockaddr
	objtool: Add machine_real_restart() to the noreturn list
	objtool: Add rewind_stack_do_exit() to the noreturn list
	libceph: use kbasename() and kill ceph_file_part()
	atm: iphase: Fix Spectre v1 vulnerability
	net: bridge: delete local fdb on device init failure
	net: bridge: mcast: don't delete permanent entries when fast leave is enabled
	net: fix ifindex collision during namespace removal
	net/mlx5: Use reversed order when unregister devices
	net: sched: Fix a possible null-pointer dereference in dequeue_func()
	tipc: compat: allow tipc commands without arguments
	compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
	ip6_tunnel: fix possible use-after-free on xmit
	ife: error out when nla attributes are empty
	bnx2x: Disable multi-cos feature.
	block: blk_init_allocated_queue() set q->fq as NULL in the fail case
	spi: bcm2835: Fix 3-wire mode if DMA is enabled
	x86: cpufeatures: Sort feature word 7
	x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations
	x86/speculation: Enable Spectre v1 swapgs mitigations
	x86/entry/64: Use JMP instead of JMPQ
	x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS
	Linux 4.9.189

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-08-11 15:30:44 +02:00
Eric Dumazet
06fac08201 tcp: be more careful in tcp_fragment()
[ Upstream commit b617158dc096709d8600c53b6052144d12b89fab ]

Some applications set tiny SO_SNDBUF values and expect
TCP to just work. Recent patches to address CVE-2019-11478
broke them in case of losses, since retransmits might
be prevented.

We should allow these flows to make progress.

This patch allows the first and last skb in retransmit queue
to be split even if memory limits are hit.

It also adds the some room due to the fact that tcp_sendmsg()
and tcp_sendpage() might overshoot sk_wmem_queued by about one full
TSO skb (64KB size). Note this allowance was already present
in stable backports for kernels < 4.15

Note for < 4.15 backports :
 tcp_rtx_queue_tail() will probably look like :

static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
{
	struct sk_buff *skb = tcp_send_head(sk);

	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
}

Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Tested-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Christoph Paasch <cpaasch@apple.com>
Cc: Jonathan Looney <jtl@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-11 12:22:14 +02:00
Greg Kroah-Hartman
0eb90dd8f7 This is the 4.9.187 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl1GilkACgkQONu9yGCS
 aT6ibw//QcoRk+6jjWIzW2R83tvcSpDREIYKF5xDPULqmaUKv7qcbGqdCT4OI0Vf
 CySao6jvmN+1V+5OlQHHw1qmithHF/Y4yEVi+eL0CcMgB6qA2tQxeA0ffFu0Fzxe
 oKrAIDANAJ2FKymqbIzTJU8AChNuy+Rc+C60L9O0ZhBHykLYvO7/VepTxdO2aMGs
 a9tdZyXnOAr/ZgKx+uN1F8Rx2ZorNfFmP2rDxEZ/B8pVrXsOjMAcxWRsZBv+w9mc
 zWaMPEL1vIR/kG35L18l2EFwZ++uIGvfGR2HxhNRUlTqN4m/3trATIc0eRn5PyAC
 qBlVdcwUeJKavBqK3cgHvK9CzWQdSMxNefk9A0H66ZGpfSuNCiFjw54AXEiRzx3t
 OvzUvDjfa36jsKW7yiYXdLbEa52gT14UDzmSIzlAYQLBplEadHikJ0FxCFzAWpFt
 C13gG1e4G0ZKEHH8wZMuRdanHbN87c/0Sm4rgTLMwCO6I1PeILcUwIq9El9M4RVL
 MmHSXLKduaHbPJWx+Qopl6NxjqTe/VR8paT6q1QnU00/4kP15aIVE08vZSiGSNP+
 Gp6xbv12skAx1m1zP7K82oGdwnCVCJUqlC9wafcjsaxcoCVBWvIgPkyMWAzUeFzc
 Ub2yIS9iulUeYOxPXBZKWCqgwpl7kovGztf4gwX2Kuy50mXSZQg=
 =ozfF
 -----END PGP SIGNATURE-----

Merge 4.9.187 into android-4.9-q

Changes in 4.9.187
	MIPS: ath79: fix ar933x uart parity mode
	MIPS: fix build on non-linux hosts
	arm64/efi: Mark __efistub_stext_offset as an absolute symbol explicitly
	dmaengine: imx-sdma: fix use-after-free on probe error path
	ath10k: Do not send probe response template for mesh
	ath9k: Check for errors when reading SREV register
	ath6kl: add some bounds checking
	ath: DFS JP domain W56 fixed pulse type 3 RADAR detection
	batman-adv: fix for leaked TVLV handler.
	media: dvb: usb: fix use after free in dvb_usb_device_exit
	crypto: talitos - fix skcipher failure due to wrong output IV
	media: marvell-ccic: fix DMA s/g desc number calculation
	media: vpss: fix a potential NULL pointer dereference
	media: media_device_enum_links32: clean a reserved field
	net: stmmac: dwmac1000: Clear unused address entries
	net: stmmac: dwmac4/5: Clear unused address entries
	signal/pid_namespace: Fix reboot_pid_ns to use send_sig not force_sig
	af_key: fix leaks in key_pol_get_resp and dump_sp.
	xfrm: Fix xfrm sel prefix length validation
	media: mc-device.c: don't memset __user pointer contents
	media: staging: media: davinci_vpfe: - Fix for memory leak if decoder initialization fails.
	net: phy: Check against net_device being NULL
	crypto: talitos - properly handle split ICV.
	crypto: talitos - Align SEC1 accesses to 32 bits boundaries.
	tua6100: Avoid build warnings.
	locking/lockdep: Fix merging of hlocks with non-zero references
	media: wl128x: Fix some error handling in fm_v4l2_init_video_device()
	cpupower : frequency-set -r option misses the last cpu in related cpu list
	net: fec: Do not use netdev messages too early
	net: axienet: Fix race condition causing TX hang
	s390/qdio: handle PENDING state for QEBSM devices
	perf cs-etm: Properly set the value of 'old' and 'head' in snapshot mode
	perf test 6: Fix missing kvm module load for s390
	gpio: omap: fix lack of irqstatus_raw0 for OMAP4
	gpio: omap: ensure irq is enabled before wakeup
	regmap: fix bulk writes on paged registers
	bpf: silence warning messages in core
	rcu: Force inlining of rcu_read_lock()
	blkcg, writeback: dead memcgs shouldn't contribute to writeback ownership arbitration
	xfrm: fix sa selector validation
	perf evsel: Make perf_evsel__name() accept a NULL argument
	vhost_net: disable zerocopy by default
	ipoib: correcly show a VF hardware address
	EDAC/sysfs: Fix memory leak when creating a csrow object
	ipsec: select crypto ciphers for xfrm_algo
	media: i2c: fix warning same module names
	ntp: Limit TAI-UTC offset
	timer_list: Guard procfs specific code
	acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
	media: coda: fix mpeg2 sequence number handling
	media: coda: increment sequence offset for the last returned frame
	mt7601u: do not schedule rx_tasklet when the device has been disconnected
	x86/build: Add 'set -e' to mkcapflags.sh to delete broken capflags.c
	mt7601u: fix possible memory leak when the device is disconnected
	ath10k: fix PCIE device wake up failed
	perf tools: Increase MAX_NR_CPUS and MAX_CACHES
	libata: don't request sense data on !ZAC ATA devices
	clocksource/drivers/exynos_mct: Increase priority over ARM arch timer
	rslib: Fix decoding of shortened codes
	rslib: Fix handling of of caller provided syndrome
	ixgbe: Check DDM existence in transceiver before access
	crypto: asymmetric_keys - select CRYPTO_HASH where needed
	EDAC: Fix global-out-of-bounds write when setting edac_mc_poll_msec
	bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush()
	iwlwifi: mvm: Drop large non sta frames
	net: usb: asix: init MAC address buffers
	gpiolib: Fix references to gpiod_[gs]et_*value_cansleep() variants
	Bluetooth: hci_bcsp: Fix memory leak in rx_skb
	Bluetooth: 6lowpan: search for destination address in all peers
	Bluetooth: Check state in l2cap_disconnect_rsp
	Bluetooth: validate BLE connection interval updates
	gtp: fix Illegal context switch in RCU read-side critical section.
	gtp: fix use-after-free in gtp_newlink()
	xen: let alloc_xenballooned_pages() fail if not enough memory free
	scsi: NCR5380: Reduce goto statements in NCR5380_select()
	scsi: NCR5380: Always re-enable reselection interrupt
	scsi: mac_scsi: Increase PIO/PDMA transfer length threshold
	crypto: ghash - fix unaligned memory access in ghash_setkey()
	crypto: arm64/sha1-ce - correct digest for empty data in finup
	crypto: arm64/sha2-ce - correct digest for empty data in finup
	crypto: chacha20poly1305 - fix atomic sleep when using async algorithm
	crypto: crypto4xx - fix a potential double free in ppc4xx_trng_probe
	Input: gtco - bounds check collection indent level
	regulator: s2mps11: Fix buck7 and buck8 wrong voltages
	arm64: tegra: Update Jetson TX1 GPU regulator timings
	iwlwifi: pcie: don't service an interrupt that was masked
	tracing/snapshot: Resize spare buffer if size changed
	NFSv4: Handle the special Linux file open access mode
	lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE
	ALSA: seq: Break too long mutex context in the write loop
	ALSA: hda/realtek: apply ALC891 headset fixup to one Dell machine
	media: v4l2: Test type instead of cfg->type in v4l2_ctrl_new_custom()
	media: coda: Remove unbalanced and unneeded mutex unlock
	KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed
	arm64: tegra: Fix AGIC register range
	fs/proc/proc_sysctl.c: fix the default values of i_uid/i_gid on /proc/sys inodes.
	drm/nouveau/i2c: Enable i2c pads & busses during preinit
	padata: use smp_mb in padata_reorder to avoid orphaned padata jobs
	9p/virtio: Add cleanup path in p9_virtio_init
	PCI: Do not poll for PME if the device is in D3cold
	Btrfs: add missing inode version, ctime and mtime updates when punching hole
	libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
	take floppy compat ioctls to sodding floppy.c
	floppy: fix div-by-zero in setup_format_params
	floppy: fix out-of-bounds read in next_valid_format
	floppy: fix invalid pointer dereference in drive_name
	floppy: fix out-of-bounds read in copy_buffer
	coda: pass the host file in vma->vm_file on mmap
	gpu: ipu-v3: ipu-ic: Fix saturation bit offset in TPMEM
	crypto: ccp - Validate the the error value used to index error messages
	PCI: hv: Delete the device earlier from hbus->children for hot-remove
	PCI: hv: Fix a use-after-free bug in hv_eject_device_work()
	crypto: caam - limit output IV to CBC to work around CTR mode DMA issue
	um: Allow building and running on older hosts
	um: Fix FP register size for XSTATE/XSAVE
	parisc: Ensure userspace privilege for ptraced processes in regset functions
	parisc: Fix kernel panic due invalid values in IAOQ0 or IAOQ1
	powerpc/32s: fix suspend/resume when IBATs 4-7 are used
	powerpc/watchpoint: Restore NV GPRs while returning from exception
	eCryptfs: fix a couple type promotion bugs
	intel_th: msu: Fix single mode with disabled IOMMU
	Bluetooth: Add SMP workaround Microsoft Surface Precision Mouse bug
	usb: Handle USB3 remote wakeup for LPM enabled devices correctly
	dm bufio: fix deadlock with loop device
	compiler.h, kasan: Avoid duplicating __read_once_size_nocheck()
	compiler.h: Add read_word_at_a_time() function.
	lib/strscpy: Shut up KASAN false-positives in strscpy()
	ext4: allow directory holes
	bnx2x: Prevent load reordering in tx completion processing
	bnx2x: Prevent ptp_task to be rescheduled indefinitely
	caif-hsi: fix possible deadlock in cfhsi_exit_module()
	igmp: fix memory leak in igmpv3_del_delrec()
	ipv4: don't set IPv6 only flags to IPv4 addresses
	net: bcmgenet: use promisc for unsupported filters
	net: dsa: mv88e6xxx: wait after reset deactivation
	net: neigh: fix multiple neigh timer scheduling
	net: openvswitch: fix csum updates for MPLS actions
	nfc: fix potential illegal memory access
	rxrpc: Fix send on a connected, but unbound socket
	sky2: Disable MSI on ASUS P6T
	vrf: make sure skb->data contains ip header to make routing
	macsec: fix use-after-free of skb during RX
	macsec: fix checksumming after decryption
	netrom: fix a memory leak in nr_rx_frame()
	netrom: hold sock when setting skb->destructor
	bonding: validate ip header before check IPPROTO_IGMP
	tcp: Reset bytes_acked and bytes_received when disconnecting
	net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling
	net: bridge: mcast: fix stale ipv6 hdr pointer when handling v6 query
	net: bridge: stp: don't cache eth dest pointer before skb pull
	perf/x86/amd/uncore: Rename 'L2' to 'LLC'
	perf/x86/amd/uncore: Get correct number of cores sharing last level cache
	perf/events/amd/uncore: Fix amd_uncore_llc ID to use pre-defined cpu_llc_id
	NFSv4: Fix open create exclusive when the server reboots
	nfsd: increase DRC cache limit
	nfsd: give out fewer session slots as limit approaches
	nfsd: fix performance-limiting session calculation
	nfsd: Fix overflow causing non-working mounts on 1 TB machines
	drm/panel: simple: Fix panel_simple_dsi_probe
	usb: core: hub: Disable hub-initiated U1/U2
	tty: max310x: Fix invalid baudrate divisors calculator
	pinctrl: rockchip: fix leaked of_node references
	tty: serial: cpm_uart - fix init when SMC is relocated
	drm/bridge: tc358767: read display_props in get_modes()
	drm/bridge: sii902x: pixel clock unit is 10kHz instead of 1kHz
	memstick: Fix error cleanup path of memstick_init
	tty/serial: digicolor: Fix digicolor-usart already registered warning
	tty: serial: msm_serial: avoid system lockup condition
	serial: 8250: Fix TX interrupt handling condition
	drm/virtio: Add memory barriers for capset cache.
	phy: renesas: rcar-gen2: Fix memory leak at error paths
	drm/rockchip: Properly adjust to a true clock in adjusted_mode
	tty: serial_core: Set port active bit in uart_port_activate
	usb: gadget: Zero ffs_io_data
	powerpc/pci/of: Fix OF flags parsing for 64bit BARs
	PCI: sysfs: Ignore lockdep for remove attribute
	kbuild: Add -Werror=unknown-warning-option to CLANG_FLAGS
	PCI: xilinx-nwl: Fix Multi MSI data programming
	iio: iio-utils: Fix possible incorrect mask calculation
	recordmcount: Fix spurious mcount entries on powerpc
	mfd: core: Set fwnode for created devices
	mfd: arizona: Fix undefined behavior
	mfd: hi655x-pmic: Fix missing return value check for devm_regmap_init_mmio_clk
	um: Silence lockdep complaint about mmap_sem
	powerpc/4xx/uic: clear pending interrupt after irq type/pol change
	RDMA/i40iw: Set queue pair state when being queried
	serial: sh-sci: Terminate TX DMA during buffer flushing
	serial: sh-sci: Fix TX DMA buffer flushing and workqueue races
	kallsyms: exclude kasan local symbols on s390
	perf test mmap-thread-lookup: Initialize variable to suppress memory sanitizer warning
	RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
	powerpc/boot: add {get, put}_unaligned_be32 to xz_config.h
	f2fs: avoid out-of-range memory access
	mailbox: handle failed named mailbox channel request
	powerpc/eeh: Handle hugepages in ioremap space
	sh: prevent warnings when using iounmap
	mm/kmemleak.c: fix check for softirq context
	9p: pass the correct prototype to read_cache_page
	mm/mmu_notifier: use hlist_add_head_rcu()
	locking/lockdep: Fix lock used or unused stats error
	locking/lockdep: Hide unused 'class' variable
	usb: wusbcore: fix unbalanced get/put cluster_id
	usb: pci-quirks: Correct AMD PLL quirk detection
	x86/sysfb_efi: Add quirks for some devices with swapped width and height
	x86/speculation/mds: Apply more accurate check on hypervisor platform
	hpet: Fix division by zero in hpet_time_div()
	ALSA: line6: Fix wrong altsetting for LINE6_PODHD500_1
	ALSA: hda - Add a conexant codec entry to let mute led work
	powerpc/tm: Fix oops on sigreturn on systems without TM
	access: avoid the RCU grace period for the temporary subjective credentials
	ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt
	tcp: reset sk_send_head in tcp_write_queue_purge
	arm64: dts: marvell: Fix A37xx UART0 register size
	i2c: qup: fixed releasing dma without flush operation completion
	arm64: compat: Provide definition for COMPAT_SIGMINSTKSZ
	ISDN: hfcsusb: checking idx of ep configuration
	media: au0828: fix null dereference in error path
	media: cpia2_usb: first wake up, then free in disconnect
	media: radio-raremono: change devm_k*alloc to k*alloc
	Bluetooth: hci_uart: check for missing tty operations
	sched/fair: Don't free p->numa_faults with concurrent readers
	drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
	ceph: hold i_ceph_lock when removing caps for freeing inode
	Linux 4.9.187

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-08-04 09:50:32 +02:00
Soheil Hassas Yeganeh
704533394e tcp: reset sk_send_head in tcp_write_queue_purge
[ Upstream commit dbbf2d1e4077bab0c65ece2765d3fc69cf7d610f ]

tcp_write_queue_purge clears all the SKBs in the write queue
but does not reset the sk_send_head. As a result, we can have
a NULL pointer dereference anywhere that we use tcp_send_head
instead of the tcp_write_queue_tail.

For example, after a27fd7a8ed38 (tcp: purge write queue upon RST),
we can purge the write queue on RST. Prior to
75c119afe14f (tcp: implement rb-tree based retransmit queue),
tcp_push will only check tcp_send_head and then accesses
tcp_write_queue_tail to send the actual SKB. As a result, it will
dereference a NULL pointer.

This has been reported twice for 4.14 where we don't have
75c119afe14f:

By Timofey Titovets:

[  422.081094] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000038
[  422.081254] IP: tcp_push+0x42/0x110
[  422.081314] PGD 0 P4D 0
[  422.081364] Oops: 0002 [#1] SMP PTI

By Yongjian Xu:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: tcp_push+0x48/0x120
PGD 80000007ff77b067 P4D 80000007ff77b067 PUD 7fd989067 PMD 0
Oops: 0002 [#18] SMP PTI
Modules linked in: tcp_diag inet_diag tcp_bbr sch_fq iTCO_wdt
iTCO_vendor_support pcspkr ixgbe mdio i2c_i801 lpc_ich joydev input_leds shpchp
e1000e igb dca ptp pps_core hwmon mei_me mei ipmi_si ipmi_msghandler sg ses
scsi_transport_sas enclosure ext4 jbd2 mbcache sd_mod ahci libahci megaraid_sas
wmi ast ttm dm_mirror dm_region_hash dm_log dm_mod dax
CPU: 6 PID: 14156 Comm: [ET_NET 6] Tainted: G D 4.14.26-1.el6.x86_64 #1
Hardware name: LENOVO ThinkServer RD440 /ThinkServer RD440, BIOS A0TS80A
09/22/2014
task: ffff8807d78d8140 task.stack: ffffc9000e944000
RIP: 0010:tcp_push+0x48/0x120
RSP: 0018:ffffc9000e947a88 EFLAGS: 00010246
RAX: 00000000000005b4 RBX: ffff880f7cce9c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000040 RDI: ffff8807d00f5000
RBP: ffffc9000e947aa8 R08: 0000000000001c84 R09: 0000000000000000
R10: ffff8807d00f5158 R11: 0000000000000000 R12: ffff8807d00f5000
R13: 0000000000000020 R14: 00000000000256d4 R15: 0000000000000000
FS: 00007f5916de9700(0000) GS:ffff88107fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000038 CR3: 00000007f8226004 CR4: 00000000001606e0
Call Trace:
tcp_sendmsg_locked+0x33d/0xe50
tcp_sendmsg+0x37/0x60
inet_sendmsg+0x39/0xc0
sock_sendmsg+0x49/0x60
sock_write_iter+0xb6/0x100
do_iter_readv_writev+0xec/0x130
? rw_verify_area+0x49/0xb0
do_iter_write+0x97/0xd0
vfs_writev+0x7e/0xe0
? __wake_up_common_lock+0x80/0xa0
? __fget_light+0x2c/0x70
? __do_page_fault+0x1e7/0x530
do_writev+0x60/0xf0
? inet_shutdown+0xac/0x110
SyS_writev+0x10/0x20
do_syscall_64+0x6f/0x140
? prepare_exit_to_usermode+0x8b/0xa0
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x3135ce0c57
RSP: 002b:00007f5916de4b00 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000003135ce0c57
RDX: 0000000000000002 RSI: 00007f5916de4b90 RDI: 000000000000606f
RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f5916de8c38
R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000464cc
R13: 00007f5916de8c30 R14: 00007f58d8bef080 R15: 0000000000000002
Code: 48 8b 97 60 01 00 00 4c 8d 97 58 01 00 00 41 b9 00 00 00 00 41 89 f3 4c 39
d2 49 0f 44 d1 41 81 e3 00 80 00 00 0f 85 b0 00 00 00 <80> 4a 38 08 44 8b 8f 74
06 00 00 44 89 8f 7c 06 00 00 83 e6 01
RIP: tcp_push+0x48/0x120 RSP: ffffc9000e947a88
CR2: 0000000000000038
---[ end trace 8d545c2e93515549 ]---

There is other scenario which found in stable 4.4:
Allocated:
 [<ffffffff82f380a6>] __alloc_skb+0xe6/0x600 net/core/skbuff.c:218
 [<ffffffff832466c3>] alloc_skb_fclone include/linux/skbuff.h:856 [inline]
 [<ffffffff832466c3>] sk_stream_alloc_skb+0xa3/0x5d0 net/ipv4/tcp.c:833
 [<ffffffff83249164>] tcp_sendmsg+0xd34/0x2b00 net/ipv4/tcp.c:1178
 [<ffffffff83300ef3>] inet_sendmsg+0x203/0x4d0 net/ipv4/af_inet.c:755
Freed:
 [<ffffffff82f372fd>] __kfree_skb+0x1d/0x20 net/core/skbuff.c:676
 [<ffffffff83288834>] sk_wmem_free_skb include/net/sock.h:1447 [inline]
 [<ffffffff83288834>] tcp_write_queue_purge include/net/tcp.h:1460 [inline]
 [<ffffffff83288834>] tcp_connect_init net/ipv4/tcp_output.c:3122 [inline]
 [<ffffffff83288834>] tcp_connect+0xb24/0x30c0 net/ipv4/tcp_output.c:3261
 [<ffffffff8329b991>] tcp_v4_connect+0xf31/0x1890 net/ipv4/tcp_ipv4.c:246

BUG: KASAN: use-after-free in tcp_skb_pcount include/net/tcp.h:796 [inline]
BUG: KASAN: use-after-free in tcp_init_tso_segs net/ipv4/tcp_output.c:1619 [inline]
BUG: KASAN: use-after-free in tcp_write_xmit+0x3fc2/0x4cb0 net/ipv4/tcp_output.c:2056
 [<ffffffff81515cd5>] kasan_report.cold.7+0x175/0x2f7 mm/kasan/report.c:408
 [<ffffffff814f9784>] __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:427
 [<ffffffff83286582>] tcp_skb_pcount include/net/tcp.h:796 [inline]
 [<ffffffff83286582>] tcp_init_tso_segs net/ipv4/tcp_output.c:1619 [inline]
 [<ffffffff83286582>] tcp_write_xmit+0x3fc2/0x4cb0 net/ipv4/tcp_output.c:2056
 [<ffffffff83287a40>] __tcp_push_pending_frames+0xa0/0x290 net/ipv4/tcp_output.c:2307

stable 4.4 and stable 4.9 don't have the commit abb4a8b870b5 ("tcp: purge write queue upon RST")
which is referred in dbbf2d1e4077,
in tcp_connect_init, it calls tcp_write_queue_purge, and does not reset sk_send_head, then UAF.

stable 4.14 have the commit abb4a8b870b5 ("tcp: purge write queue upon RST"),
in tcp_reset, it calls tcp_write_queue_purge(sk), and does not reset sk_send_head, then UAF.

So this patch can be used to fix stable 4.4 and 4.9.

Fixes: a27fd7a8ed38 (tcp: purge write queue upon RST)
Reported-by: Timofey Titovets <nefelim4ag@gmail.com>
Reported-by: Yongjian Xu <yongjianchn@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Yongjian Xu <yongjianchn@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-04 09:33:43 +02:00
Greg Kroah-Hartman
8b83e43efb This is the 4.9.182 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl0H050ACgkQONu9yGCS
 aT7N9hAAnlw6AuUf9wT2Uij1z13pfJWqlNv8kyybDoy9UcGTee9gL/hAs7K6hGgF
 g9x4FD3cZgheJRWrOcieh1IQZHGt/7ObFsbuLKvd1T1kM205DitfY3gDd+cUHfdx
 I4oIhv2nSy9nPfTfrL13yaHhQIt4a+mMeTF6x3+TShjpwszDBlFEYAxFnEyHXA85
 CZRpiq/PDJBb154rGIaaK1uBCi5SstnVzCm2aw5o0Vowy5D7CCM1lOYmFFlI0I8q
 iuLDC+cp2EOu+VJOpwejZEkvqoCp+d5DIl+ZnxhIL/AuGqJ75JuhDjhvZR9seymj
 kDNHUath7ITBwVRxtDdXpR8w0DdURJEcmP+4aod5XE5uo2wYuDvCw1M6AHKyS8rA
 ZSMpuPtVoOm/0UMy4AZ98uZ60vRGyv9OFDyvZ+8CPm+bhQdKw81nmcBW6193jFQr
 FFSeES5ihJfDQJ65hcHIXesQYfMROXChWx1igIoYtjW6lSjVZJypKxAPq0aTfr7W
 F6O1oWSi5bcVlyEw8ghWWgfm5+TARCHihMSsv/dkudMVyl6cM7bu1qbhoiKG/jQB
 co4MsaWfqtb9uFBPqyJXW0MSa5w+s4RD+EydxuAimjfm6D59WgP0L+6Zno1VtJP5
 9YoBbjrezb8jXcgDiYqxysBDC3uvIMT+MbqKXMF8hEXIImITJCM=
 =BFdC
 -----END PGP SIGNATURE-----

Merge 4.9.182 into android-4.9-q

Changes in 4.9.182
	tcp: reduce tcp_fastretrans_alert() verbosity
	tcp: limit payload size of sacked skbs
	tcp: tcp_fragment() should apply sane memory limits
	tcp: add tcp_min_snd_mss sysctl
	tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
	Linux 4.9.182

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-06-17 20:29:43 +02:00
Eric Dumazet
cc1b58ccb7 tcp: limit payload size of sacked skbs
commit 3b4929f65b0d8249f19a50245cd88ed1a2f78cff upstream.

Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :

	BUG_ON(tcp_skb_pcount(skb) < pcount);

This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48

An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.

This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.

Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.

CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs

Backport notes, provided by Joao Martins <joao.m.martins@oracle.com>

v4.15 or since commit 737ff314563 ("tcp: use sequence distance to
detect reordering") had switched from the packet-based FACK tracking and
switched to sequence-based.

v4.14 and older still have the old logic and hence on
tcp_skb_shift_data() needs to retain its original logic and have
@fack_count in sync. In other words, we keep the increment of pcount with
tcp_skb_pcount(skb) to later used that to update fack_count. To make it
more explicit we track the new skb that gets incremented to pcount in
@next_pcount, and we get to avoid the constant invocation of
tcp_skb_pcount(skb) all together.

Fixes: 832d11c5cd ("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17 19:53:32 +02:00
Greg Kroah-Hartman
fd5657a6c7 This is the 4.9.160 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlxw/ucACgkQONu9yGCS
 aT4vaRAAwEs6ppuH93NYZ1v1w2CA7PZLx7MAR3aiEr8rWQ5BnkmowPHvn0kC3VTs
 643XN+iKfaq1SuWfdOs7+ACu1QGBfVsdyQhgIKwspAe7C6v284AF0NfsifnuN8/q
 0eAgzFFkfngBgoh3oGLaeB0oPia3lSB2zG6hC2cyjeiEYEDcvJUA/ZHl9X1zFsJQ
 J9Ikicn1b2gz6/N5VKqrBokCXcrz184Yz8yRrC0rK8VFq0N9N3VZA2NyWmb9/Iqp
 Szj//Rh5LyjgrNSJHk0blNqB/5OdS7VsFl6LXuvE7NmUSLJJ0ou/BGLjw9R6TcOv
 XFIvuMDw0D/dm/icKprG1LuVYfOomoNu82YMz8K96ymt7BS/SAELHFktzvK1s104
 ITS2IvBhpqSPp86dx1vkmo4NEyKUSrff1sLIssjpd9xQMt1+SVP7O7kn02GgRCXz
 T8PITSV2IQhHeNeBZVD8W4cLsrqn3sXFWDAVhmIw4J0VK6ghEGfaIBiwquRtNaz/
 EsXSFKFs2hV++G8+f6vwQpHGyVSopGrgvvEEdqpWLcgjnYt1NhpfNxbEOBkfXXSd
 U0NN1EYs9ade9fVcXrZze9Z8QVF6s4Rdf5unQs64iCp7FvowqzwshJuOoJTz1MB/
 ugCFieeAZXwO7tlLoMiUG+j/k0BNdhNWPx8o7sfQf2teTWNfWtU=
 =XYhR
 -----END PGP SIGNATURE-----

Merge 4.9.160 into android-4.9

Changes in 4.9.160
	net: fix IPv6 prefix route residue
	vsock: cope with memory allocation failure at socket creation time
	hwmon: (lm80) Fix missing unlock on error in set_fan_div()
	net: Fix for_each_netdev_feature on Big endian
	net: phy: xgmiitorgmii: Support generic PHY status read
	net: stmmac: handle endianness in dwmac4_get_timestamp
	sky2: Increase D3 delay again
	vhost: correctly check the return value of translate_desc() in log_used()
	net: Add header for usage of fls64()
	tcp: tcp_v4_err() should be more careful
	net: Do not allocate page fragments that are not skb aligned
	tcp: clear icsk_backoff in tcp_write_queue_purge()
	vxlan: test dev->flags & IFF_UP before calling netif_rx()
	net: stmmac: Fix a race in EEE enable callback
	net: ipv4: use a dedicated counter for icmp_v4 redirect packets
	btrfs: Remove false alert when fiemap range is smaller than on-disk extent
	net/x25: do not hold the cpu too long in x25_new_lci()
	mISDN: fix a race in dev_expire_timer()
	ax25: fix possible use-after-free
	Linux 4.9.160

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-02-23 10:06:00 +01:00
Eric Dumazet
1f52cfe301 tcp: clear icsk_backoff in tcp_write_queue_purge()
[ Upstream commit 04c03114be82194d4a4858d41dba8e286ad1787c ]

soukjin bae reported a crash in tcp_v4_err() handling
ICMP_DEST_UNREACH after tcp_write_queue_head(sk)
returned a NULL pointer.

Current logic should have prevented this :

  if (seq != tp->snd_una  || !icsk->icsk_retransmits ||
      !icsk->icsk_backoff || fastopen)
      break;

Problem is the write queue might have been purged
and icsk_backoff has not been cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: soukjin bae <soukjin.bae@samsung.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23 09:05:59 +01:00
Greg Kroah-Hartman
6a1b592354 This is the 4.9.124 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlt/6CsACgkQONu9yGCS
 aT7KOw//boywfZ/3pWvs5bp6MsNBrYxa9mzvv7MDQFj10nibTdwY+XFf7O370j3q
 Udt3A3V4a57KnploVglMSv4452LETDhMFdlQlxVTMcWAISiZoviVemDTi6GoRCwI
 PJg30COP3kE4u51O5i4sGDrff9mRiUPH2XR4JP27CY1aYbjE1IYoOTfB7jUSZhNn
 nzghkYLuxIpXc3GlNiWwXSxjRoDz9hw/0jaY4oZC9CRGi/Z/YBqvXLvj9dMO/2R2
 0R1ftiI/CEpuQ10XZ9z3h2WlgiWz9Pj183qBMwCUVR0IDVRo+A6lN1Zz7X+eG0no
 +CsDAhwfxu2BmDZ4eTDCgcKTk1LDmSVAo3cOoIdW+UkNbeoz99gFivy+jhEyKwQA
 KYfMciX31qmrbVfqGT9mZTsDgrYnLCqOmRnhIoAbA3Tb/4gzO6pQXt08eGK4l+LI
 u2Yv4L3BiNtZlRPBZYaibtnAAWOPZEuMIf/cRxXJ4//AcWsLirS1Dy3Z70z7J6ew
 jkxtDf2F34YYrj1QGRAKfELiZJOJYX7Ig81y80M4tBD2sjRfb29z7QEK5XbDrIID
 OJ8Kh57YrdnkGBOFbG8Yvfy4QV1js8VsReasK1pgad8iL8suBYPTPQwmUztGWXz5
 AQsU+aQxNYicTtsRIfEN4Tr1Nb/2IqfShC+lwFpcagMvjzjDT/Q=
 =GH5x
 -----END PGP SIGNATURE-----

Merge 4.9.124 into android-4.9

Changes in 4.9.124
	x86/entry/64: Remove %ebx handling from error_entry/exit
	ARC: Explicitly add -mmedium-calls to CFLAGS
	usb: dwc3: of-simple: fix use-after-free on remove
	netfilter: ipv6: nf_defrag: reduce struct net memory waste
	selftests: pstore: return Kselftest Skip code for skipped tests
	selftests: static_keys: return Kselftest Skip code for skipped tests
	selftests: user: return Kselftest Skip code for skipped tests
	selftests: zram: return Kselftest Skip code for skipped tests
	selftests: sync: add config fragment for testing sync framework
	ARM: dts: NSP: Fix i2c controller interrupt type
	ARM: dts: NSP: Fix PCIe controllers interrupt types
	ARM: dts: Cygnus: Fix I2C controller interrupt type
	ARM: dts: Cygnus: Fix PCIe controller interrupt type
	arm64: dts: ns2: Fix I2C controller interrupt type
	drm: mali-dp: Enable Global SE interrupts mask for DP500
	IB/rxe: Fix missing completion for mem_reg work requests
	libahci: Fix possible Spectre-v1 pmp indexing in ahci_led_store()
	usb: dwc2: fix isoc split in transfer with no data
	usb: gadget: composite: fix delayed_status race condition when set_interface
	usb: gadget: dwc2: fix memory leak in gadget_init()
	xen: add error handling for xenbus_printf
	scsi: xen-scsifront: add error handling for xenbus_printf
	xen/scsiback: add error handling for xenbus_printf
	arm64: make secondary_start_kernel() notrace
	qed: Add sanity check for SIMD fastpath handler.
	enic: initialize enic->rfs_h.lock in enic_probe
	net: hamradio: use eth_broadcast_addr
	net: propagate dev_get_valid_name return code
	net: stmmac: socfpga: add additional ocp reset line for Stratix10
	nvmet: reset keep alive timer in controller enable
	ARC: Enable machine_desc->init_per_cpu for !CONFIG_SMP
	net: davinci_emac: match the mdio device against its compatible if possible
	KVM: arm/arm64: Drop resource size check for GICV window
	locking/lockdep: Do not record IRQ state within lockdep code
	ipv6: mcast: fix unsolicited report interval after receiving querys
	Smack: Mark inode instant in smack_task_to_inode
	batman-adv: Fix bat_ogm_iv best gw refcnt after netlink dump
	batman-adv: Fix bat_v best gw refcnt after netlink dump
	cxgb4: when disabling dcb set txq dcb priority to 0
	iio: pressure: bmp280: fix relative humidity unit
	brcmfmac: stop watchdog before detach and free everything
	ARM: dts: am437x: make edt-ft5x06 a wakeup source
	ALSA: seq: Fix UBSAN warning at SNDRV_SEQ_IOCTL_QUERY_NEXT_CLIENT ioctl
	usb: xhci: remove the code build warning
	usb: xhci: increase CRS timeout value
	NFC: pn533: Fix wrong GFP flag usage
	perf test session topology: Fix test on s390
	perf report powerpc: Fix crash if callchain is empty
	perf bench: Fix numa report output code
	netfilter: nf_log: fix uninit read in nf_log_proc_dostring
	ceph: fix dentry leak in splice_dentry()
	selftests/x86/sigreturn/64: Fix spurious failures on AMD CPUs
	selftests/x86/sigreturn: Do minor cleanups
	ARM: dts: da850: Fix interrups property for gpio
	dmaengine: pl330: report BURST residue granularity
	dmaengine: k3dma: Off by one in k3_of_dma_simple_xlate()
	md/raid10: fix that replacement cannot complete recovery after reassemble
	nl80211: relax ht operation checks for mesh
	drm/exynos: gsc: Fix support for NV16/61, YUV420/YVU420 and YUV422 modes
	drm/exynos: decon5433: Fix per-plane global alpha for XRGB modes
	drm/exynos: decon5433: Fix WINCONx reset value
	bpf, s390: fix potential memleak when later bpf_jit_prog fails
	PCI: xilinx: Add missing of_node_put()
	PCI: xilinx-nwl: Add missing of_node_put()
	bnx2x: Fix receiving tx-timeout in error or recovery state.
	acpi/nfit: fix cmd_rc for acpi_nfit_ctl to always return a value
	m68k: fix "bad page state" oops on ColdFire boot
	objtool: Support GCC 8 '-fnoreorder-functions'
	ipvlan: call dev_change_flags when ipvlan mode is reset
	HID: wacom: Correct touch maximum XY of 2nd-gen Intuos
	ARM: imx_v6_v7_defconfig: Select ULPI support
	ARM: imx_v4_v5_defconfig: Select ULPI support
	tracing: Use __printf markup to silence compiler
	kasan: fix shadow_size calculation error in kasan_module_alloc
	smsc75xx: Add workaround for gigabit link up hardware errata.
	samples/bpf: add missing <linux/if_vlan.h>
	samples/bpf: Check the error of write() and read()
	ieee802154: 6lowpan: set IFLA_LINK
	netfilter: x_tables: set module owner for icmp(6) matches
	ipv6: make ipv6_renew_options() interrupt/kernel safe
	net: qrtr: Broadcast messages only from control port
	sh_eth: fix invalid context bug while calling auto-negotiation by ethtool
	sh_eth: fix invalid context bug while changing link options by ethtool
	ravb: fix invalid context bug while calling auto-negotiation by ethtool
	ravb: fix invalid context bug while changing link options by ethtool
	ARM: pxa: irq: fix handling of ICMR registers in suspend/resume
	net/sched: act_tunnel_key: fix NULL dereference when 'goto chain' is used
	ieee802154: at86rf230: switch from BUG_ON() to WARN_ON() on problem
	ieee802154: at86rf230: use __func__ macro for debug messages
	ieee802154: fakelb: switch from BUG_ON() to WARN_ON() on problem
	drm/armada: fix colorkey mode property
	netfilter: nf_conntrack: Fix possible possible crash on module loading.
	ARC: Improve cmpxchg syscall implementation
	bnxt_en: Always set output parameters in bnxt_get_max_rings().
	bnxt_en: Fix for system hang if request_irq fails
	perf llvm-utils: Remove bashism from kernel include fetch script
	nfit: fix unchecked dereference in acpi_nfit_ctl
	RDMA/mlx5: Fix memory leak in mlx5_ib_create_srq() error path
	ARM: 8780/1: ftrace: Only set kernel memory back to read-only after boot
	ARM: DRA7/OMAP5: Enable ACTLR[0] (Enable invalidates of BTB) for secondary cores
	ARM: dts: am3517.dtsi: Disable reference to OMAP3 OTG controller
	ixgbe: Be more careful when modifying MAC filters
	tools: build: Use HOSTLDFLAGS with fixdep
	packet: reset network header if packet shorter than ll reserved space
	qlogic: check kstrtoul() for errors
	tcp: remove DELAYED ACK events in DCTCP
	pinctrl: nsp: off by ones in nsp_pinmux_enable()
	pinctrl: nsp: Fix potential NULL dereference
	drm/nouveau/gem: off by one bugs in nouveau_gem_pushbuf_reloc_apply()
	net/ethernet/freescale/fman: fix cross-build error
	net: usb: rtl8150: demote allmulti message to dev_dbg()
	PCI: OF: Fix I/O space page leak
	PCI: versatile: Fix I/O space page leak
	net: qca_spi: Avoid packet drop during initial sync
	net: qca_spi: Make sure the QCA7000 reset is triggered
	net: qca_spi: Fix log level if probe fails
	tcp: identify cryptic messages as TCP seq # bugs
	KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer
	ext4: fix spectre gadget in ext4_mb_regular_allocator()
	parisc: Remove ordered stores from syscall.S
	xfrm_user: prevent leaking 2 bytes of kernel memory
	netfilter: conntrack: dccp: treat SYNC/SYNCACK as invalid if no prior state
	packet: refine ring v3 block size test to hold one frame
	parisc: Remove unnecessary barriers from spinlock.h
	PCI: hotplug: Don't leak pci_slot on registration failure
	PCI: Skip MPS logic for Virtual Functions (VFs)
	PCI: pciehp: Fix use-after-free on unplug
	PCI: pciehp: Fix unprotected list iteration in IRQ handler
	i2c: imx: Fix race condition in dma read
	reiserfs: fix broken xattr handling (heap corruption, bad retval)
	Linux 4.9.124

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-08-24 13:23:30 +02:00
Yuchung Cheng
d4efb85f5f tcp: remove DELAYED ACK events in DCTCP
[ Upstream commit a69258f7aa2623e0930212f09c586fd06674ad79 ]

After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
related callbacks are no longer needed

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-24 13:12:39 +02:00
Greg Kroah-Hartman
47b77b8d01 This is the 4.9.118 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAltoWckACgkQONu9yGCS
 aT4rdxAAyB4LLh5ylp8b2wEbpSWpOIRGfb1Y78VLf8T3TPsCo/46pTgOPVwpGpeJ
 O9QDBcPBEwqJVJEYW0Hf5PBj/JhVGw9uQ4JM+6Tuy1BoZmlfxmUgQz2NotSAAxUD
 b5ymy5LnMOoM+GX2IsPILsz0h54NGTlQtdjH2C6dUYx/u8uWzUwgW1eXPdc+m++7
 OSWSQ276jZs0oAYgsS5r0GBpe5C+G72dRVDD0uRKTNQEsmSdCOTX6BzaxBzll4yQ
 gaZTQre0Sgmv6cyl0rJ6JqdyNECN1i+aw3oSU75Zr+1cfaRPh+8APtN0PW6HUV47
 WO08k1/0L5HA/EOU6YI4QwNcQS8yv+H0avmsDwnXc8a2NgKpLFlV+LjAQA2jDnTJ
 CWFkLFyfkFtYM/W1Xglyo7OyA1o1BmoZVzjiPECRtW2RqVfl9hORqH4gMtxoHxy2
 maE0he/FcVp6iu9hoas2g7V7T/O6UF2ipYWG/+WZBuZY3SjojNth/MKuQ7E+qLY5
 UDBMx9CCAjYqAKN4A+aMCAfociV5vTAeQLbwc1ffa4JtqX88nDQxAp7SBP8beEWc
 CQsnCvksTdqebeDN0DWcRbSs1abjjeZcoWiifdwGVwwiE5D1RgLZxrABaNEX4XJ6
 lQNUYzMuT8D9MzEoDn0TB5mLgIvxdA5gQzwWMV30h5f3fXax1ro=
 =qE4w
 -----END PGP SIGNATURE-----

Merge 4.9.118 into android-4.9

Changes in 4.9.118
	ipv4: remove BUG_ON() from fib_compute_spec_dst
	net: ena: Fix use of uninitialized DMA address bits field
	net: fix amd-xgbe flow-control issue
	net: lan78xx: fix rx handling before first packet is send
	net: mdio-mux: bcm-iproc: fix wrong getter and setter pair
	NET: stmmac: align DMA stuff to largest cache line length
	tcp_bbr: fix bw probing to raise in-flight data for very small BDPs
	xen-netfront: wait xenbus state change when load module manually
	tcp: do not force quickack when receiving out-of-order packets
	tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode
	tcp: do not aggressively quick ack after ECN events
	tcp: refactor tcp_ecn_check_ce to remove sk type cast
	tcp: add one more quick ack after after ECN events
	pinctrl: intel: Read back TX buffer state
	sched/wait: Remove the lockless swait_active() check in swake_up*()
	bonding: avoid lockdep confusion in bond_get_stats()
	inet: frag: enforce memory limits earlier
	ipv4: frags: handle possible skb truesize change
	net: dsa: Do not suspend/resume closed slave_dev
	netlink: Fix spectre v1 gadget in netlink_create()
	net: stmmac: Fix WoL for PCI-based setups
	squashfs: more metadata hardening
	squashfs: more metadata hardenings
	can: ems_usb: Fix memory leak on ems_usb_disconnect()
	net: socket: fix potential spectre v1 gadget in socketcall
	virtio_balloon: fix another race between migration and ballooning
	kvm: x86: vmx: fix vpid leak
	crypto: padlock-aes - Fix Nano workaround data corruption
	drm/vc4: Reset ->{x, y}_scaling[1] when dealing with uniplanar formats
	scsi: sg: fix minor memory leak in error path
	Linux 4.9.118

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-08-06 19:09:58 +02:00
Eric Dumazet
90cf17d665 tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode
[ Upstream commit 9a9c9b51e54618861420093ae6e9b50a961914c5 ]

We want to add finer control of the number of ACK packets sent after
ECN events.

This patch is not changing current behavior, it only enables following
change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-06 16:23:02 +02:00
Greg Kroah-Hartman
52be322125 This is the 4.9.116 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAltcA9oACgkQONu9yGCS
 aT4NZg/7Bbf10v5kf18jZeolQkj1GyVv7+tI1fAlM3tT5BGEqAZDUdp383YegQih
 YRV5Q7fsdPxqyoXAfQosKdjViBowWglzWJE2YKZHRIOkOBSC0mlhfhNiqwp9owlQ
 /JHGwfPhYaJt9Oyuc/OZ3iq/KNe8gm29OuFnQd8pKp8mFakpyiEVcLSeqHUjGQ9P
 BBM0H9+F/16iOOVcOqQvbG7rza9AjPXeTLGcMf63Nah6qLSvuH3il/v42N5XXOuJ
 iXozco9ifh3BxC/vP3sHrt+BCUeUsNbLUdZO1gZIpybd1byJAbQSPkN8v9jgNZbG
 j7xMfMecsUNVsPpv8i8f7Zbh7PDYx+XGk6ufArmYItmp3X65gO+rrxbme+pSvKib
 g8x0952+u+ddnyEPH/DcypTI/WU2qeAfXk4HEbeeYiZZxOUmF76XNn55YZW8xpqj
 jJi9CaXHiXQpje2a8KGMR3b37T3f5fntOn4rIWT/isaqbqms8j/3b9AYf9yEEGZ1
 b05787d6ybHQrMVi9nTXKrRAQqlnKpZZWdsOPvrrV9jO5TnYyDy2RB9/19SEpkdj
 kD6lsMlL//o6TRFDIdph9Kg1sm2rFnkT78Hc/RZJ5t27+CM2YfvrLr1+k4G15QqG
 N2h+0naYkA6dc052i0kbL0cQGXngeoBeINAKOcXyom99p/rFaKA=
 =gXo7
 -----END PGP SIGNATURE-----

Merge 4.9.116 into android-4.9

Changes in 4.9.116
	MIPS: ath79: fix register address in ath79_ddr_wb_flush()
	MIPS: Fix off-by-one in pci_resource_to_user()
	ip: hash fragments consistently
	ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull
	net/mlx4_core: Save the qpn from the input modifier in RST2INIT wrapper
	net: skb_segment() should not return NULL
	net/mlx5: Adjust clock overflow work period
	net/mlx5e: Don't allow aRFS for encapsulated packets
	net/mlx5e: Fix quota counting in aRFS expire flow
	multicast: do not restore deleted record source filter mode to new one
	net: phy: consider PHY_IGNORE_INTERRUPT in phy_start_aneg_priv
	rtnetlink: add rtnl_link_state check in rtnl_configure_link
	tcp: fix dctcp delayed ACK schedule
	tcp: helpers to send special DCTCP ack
	tcp: do not cancel delay-AcK on DCTCP special ACK
	tcp: do not delay ACK in DCTCP upon CE status change
	tcp: free batches of packets in tcp_prune_ofo_queue()
	tcp: avoid collapses in tcp_prune_queue() if possible
	tcp: detect malicious patterns in tcp_collapse_ofo_queue()
	tcp: call tcp_drop() from tcp_data_queue_ofo()
	usb: cdc_acm: Add quirk for Castles VEGA3000
	usb: core: handle hub C_PORT_OVER_CURRENT condition
	usb: gadget: f_fs: Only return delayed status when len is 0
	driver core: Partially revert "driver core: correct device's shutdown order"
	can: xilinx_can: fix RX loop if RXNEMP is asserted without RXOK
	can: xilinx_can: fix power management handling
	can: xilinx_can: fix recovery from error states not being propagated
	can: xilinx_can: fix device dropping off bus on RX overrun
	can: xilinx_can: keep only 1-2 frames in TX FIFO to fix TX accounting
	can: xilinx_can: fix incorrect clear of non-processed interrupts
	can: xilinx_can: fix RX overflow interrupt not being enabled
	turn off -Wattribute-alias
	exec: avoid gcc-8 warning for get_task_comm
	Linux 4.9.116

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-07-31 21:27:31 +02:00
Yuchung Cheng
8736711f4e tcp: do not delay ACK in DCTCP upon CE status change
[ Upstream commit a0496ef2c23b3b180902dd185d0d63ccbc624cf8 ]

Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
has to be sent immediately so the sender can respond quickly:

""" When receiving packets, the CE codepoint MUST be processed as follows:

   1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
       true and send an immediate ACK.

   2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
       to false and send an immediate ACK.
"""

Previously DCTCP implementation may continue to delay the ACK. This
patch fixes that to implement the RFC by forcing an immediate ACK.

Tested with this packetdrill script provided by Larry Brakmo

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
   +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001

0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001

0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
+0.005 < [ce] . 2001:3001(1000) ack 2 win 257

+0.000 > [ect01] . 2:2(0) ack 2001
// Previously the ACK below would be delayed by 40ms
+0.000 > [ect01] E. 2:2(0) ack 3001

+0.500 < F. 9501:9501(0) ack 4 win 257

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-28 07:49:12 +02:00
Yuchung Cheng
57ec8824b1 tcp: do not cancel delay-AcK on DCTCP special ACK
[ Upstream commit 27cde44a259c380a3c09066fc4b42de7dde9b1ad ]

Currently when a DCTCP receiver delays an ACK and receive a
data packet with a different CE mark from the previous one's, it
sends two immediate ACKs acking previous and latest sequences
respectly (for ECN accounting).

Previously sending the first ACK may mark off the delayed ACK timer
(tcp_event_ack_sent). This may subsequently prevent sending the
second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
The culprit is that tcp_send_ack() assumes it always acknowleges
the latest sequence, which is not true for the first special ACK.

The fix is to not make the assumption in tcp_send_ack and check the
actual ack sequence before cancelling the delayed ACK. Further it's
safer to pass the ack sequence number as a local variable into
tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
future bugs like this.

Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-28 07:49:12 +02:00
Greg Kroah-Hartman
960923fdc2 This is the 4.9.89 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlqzZsQACgkQONu9yGCS
 aT6GpRAA0smp5JfDXHUaQ0/6Syjo0bppqAdXZAkOeRYGEsZC1YG27rmB1YTk7R3X
 6pwLRXDFP8264A7Pks4+fbE2CUv6Rt1qS4V6t3CVInzYVshVfThnZkl+cFXYWqTg
 P9q7oW7M/nXp73kUJIu2Q+d3NTWozbjHAffwXzaM3XschGEA9FghiDoZToemG7gV
 DmKstfUlnLb5M6ljXQla44UQWpPCFL9U5EAXHsEphz5nR7H5fTOvmXd38z2Pxmu8
 F+wmVeqS3VzJA4otefClQMH78ZsEkImCJwGx6B2q5KIcVxJRprc0EC/Mxb3oSozR
 3W7oxuOhGXV84oo9hY9LVett20ZAyDqcEY7RXaUdaaxC0XnZujgIwiHEUfWH5o5S
 7o/kM5DRO7nKDqt9u+O8nXZU9dKPUAV5kPmMWBUlgGslKD9z3yat9z71h4nANd2Y
 uQRlXO3CPrBIxTgLqdDRcAtStJoRBjDZxm1X1IYZwG5DjX6oLLYFVjE2rdD6OrOv
 5Yi4NQ4qzVt3CcePyactJJQLMGf9/PI9zMXOsefKK6KwApud7zjxl+HVvGaFkDXt
 ONkcHW1F9H7AnExaAyIfZiuWlDoOii58XIIAme/xKStCx4jRJD99d6o4fJomwTUq
 Hpqe/6XgwmubKsRSn57e45RqtmrtXTAPlcGWDifd//PB1udoHFc=
 =jQUN
 -----END PGP SIGNATURE-----

Merge 4.9.89 into android-4.9

Changes in 4.9.89
	blkcg: fix double free of new_blkg in blkcg_init_queue
	Input: tsc2007 - check for presence and power down tsc2007 during probe
	perf stat: Issue a HW watchdog disable hint
	staging: speakup: Replace BUG_ON() with WARN_ON().
	staging: wilc1000: add check for kmalloc allocation failure.
	HID: reject input outside logical range only if null state is set
	drm: qxl: Don't alloc fbdev if emulation is not supported
	ARM: dts: r8a7791: Remove unit-address and reg from integrated cache
	ARM: dts: r8a7792: Remove unit-address and reg from integrated cache
	ARM: dts: r8a7793: Remove unit-address and reg from integrated cache
	ARM: dts: r8a7794: Remove unit-address and reg from integrated cache
	arm64: dts: r8a7796: Remove unit-address and reg from integrated cache
	drm/sun4i: Fix up error path cleanup for master bind function
	drm/sun4i: Set drm_crtc.port to the underlying TCON's output port node
	ath10k: fix a warning during channel switch with multiple vaps
	drm/sun4i: Fix TCON clock and regmap initialization sequence
	PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()
	selinux: check for address length in selinux_socket_bind()
	x86/mm: Make mmap(MAP_32BIT) work correctly
	perf sort: Fix segfault with basic block 'cycles' sort dimension
	x86/mce: Handle broadcasted MCE gracefully with kexec
	eventpoll.h: fix epoll event masks
	i40e: Acquire NVM lock before reads on all devices
	i40e: fix ethtool to get EEPROM data from X722 interface
	perf tools: Make perf_event__synthesize_mmap_events() scale
	ARM: brcmstb: Enable ZONE_DMA for non 64-bit capable peripherals
	drivers: net: xgene: Fix hardware checksum setting
	drivers: net: phy: xgene: Fix mdio write
	drivers: net: xgene: Fix wrong logical operation
	drivers: net: xgene: Fix Rx checksum validation logic
	drm: Defer disabling the vblank IRQ until the next interrupt (for instant-off)
	ath10k: disallow DFS simulation if DFS channel is not enabled
	ath10k: fix fetching channel during potential radar detection
	usb: misc: lvs: fix race condition in disconnect handling
	ARM: bcm2835: Enable missing CMA settings for VC4 driver
	net: ethernet: bgmac: Allow MAC address to be specified in DTB
	netem: apply correct delay when rate throttling
	x86/mce: Init some CPU features early
	omapfb: dss: Handle return errors in dss_init_ports()
	perf probe: Fix concat_probe_trace_events
	perf probe: Return errno when not hitting any event
	HID: clamp input to logical range if no null state
	net/8021q: create device with all possible features in wanted_features
	ARM: dts: Adjust moxart IRQ controller and flags
	qed: Always publish VF link from leading hwfn
	s390/topology: fix typo in early topology code
	zd1211rw: fix NULL-deref at probe
	batman-adv: handle race condition for claims between gateways
	of: fix of_device_get_modalias returned length when truncating buffers
	solo6x10: release vb2 buffers in solo_stop_streaming()
	x86/boot/32: Defer resyncing initial_page_table until per-cpu is set up
	scsi: fnic: Fix for "Number of Active IOs" in fnicstats becoming negative
	scsi: ipr: Fix missed EH wakeup
	media: i2c/soc_camera: fix ov6650 sensor getting wrong clock
	timers, sched_clock: Update timeout for clock wrap
	sysrq: Reset the watchdog timers while displaying high-resolution timers
	Input: qt1070 - add OF device ID table
	sched: act_csum: don't mangle TCP and UDP GSO packets
	PCI: hv: Properly handle PCI bus remove
	PCI: hv: Lock PCI bus on device eject
	ASoC: rcar: ssi: don't set SSICR.CKDV = 000 with SSIWSR.CONT
	spi: omap2-mcspi: poll OMAP2_MCSPI_CHSTAT_RXS for PIO transfer
	tcp: sysctl: Fix a race to avoid unexpected 0 window from space
	dmaengine: imx-sdma: add 1ms delay to ensure SDMA channel is stopped
	usb: dwc3: make sure UX_EXIT_PX is cleared
	ARM: dts: bcm2835: add index to the ethernet alias
	perf annotate: Fix a bug following symbolic link of a build-id file
	perf buildid: Do not assume that readlink() returns a null terminated string
	i40e/i40evf: Fix use after free in Rx cleanup path
	scsi: be2iscsi: Check tag in beiscsi_mccq_compl_wait
	driver: (adm1275) set the m,b and R coefficients correctly for power
	bonding: make speed, duplex setting consistent with link state
	mm: Fix false-positive VM_BUG_ON() in page_cache_{get,add}_speculative()
	ALSA: firewire-lib: add a quirk of packet without valid EOH in CIP format
	ARM: dts: r8a7794: Add DU1 clock to device tree
	ARM: dts: r8a7794: Correct clock of DU1
	ARM: dts: silk: Correct clock of DU1
	blk-throttle: make sure expire time isn't too big
	regulator: core: Limit propagation of parent voltage count and list
	perf trace: Handle unpaired raw_syscalls:sys_exit event
	f2fs: relax node version check for victim data in gc
	drm/ttm: never add BO that failed to validate to the LRU list
	bonding: refine bond_fold_stats() wrap detection
	PCI: Apply Cavium ACS quirk only to CN81xx/CN83xx/CN88xx devices
	powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout
	braille-console: Fix value returned by _braille_console_setup
	drm/vmwgfx: Fixes to vmwgfx_fb
	vxlan: vxlan dev should inherit lowerdev's gso_max_size
	NFC: nfcmrvl: Include unaligned.h instead of access_ok.h
	NFC: nfcmrvl: double free on error path
	NFC: pn533: change order of free_irq and dev unregistration
	ARM: dts: r7s72100: fix ethernet clock parent
	ARM: dts: r8a7790: Correct parent of SSI[0-9] clocks
	ARM: dts: r8a7791: Correct parent of SSI[0-9] clocks
	ARM: dts: r8a7793: Correct parent of SSI[0-9] clocks
	powerpc: Avoid taking a data miss on every userspace instruction miss
	net: hns: Correct HNS RSS key set function
	net/faraday: Add missing include of of.h
	qed: Fix TM block ILT allocation
	rtmutex: Fix PI chain order integrity
	printk: Correctly handle preemption in console_unlock()
	drm: rcar-du: Handle event when disabling CRTCs
	ARM: dts: koelsch: Correct clock frequency of X2 DU clock input
	reiserfs: Make cancel_old_flush() reliable
	ASoC: rt5677: Add OF device ID table
	IB/hfi1: Check for QSFP presence before attempting reads
	ALSA: firewire-digi00x: add support for console models of Digi00x series
	ALSA: firewire-digi00x: handle all MIDI messages on streaming packets
	fm10k: correctly check if interface is removed
	EDAC, altera: Fix peripheral warnings for Cyclone5
	scsi: ses: don't get power status of SES device slot on probe
	qed: Correct MSI-x for storage
	apparmor: Make path_max parameter readonly
	iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range
	kvm/svm: Setup MCG_CAP on AMD properly
	kvm: nVMX: Disallow userspace-injected exceptions in guest mode
	video: ARM CLCD: fix dma allocation size
	drm/radeon: Fail fb creation from imported dma-bufs.
	drm/amdgpu: Fail fb creation from imported dma-bufs. (v2)
	drm/rockchip: vop: Enable pm domain before vop_initial
	i40e: only register client on iWarp-capable devices
	coresight: Fixes coresight DT parse to get correct output port ID.
	lkdtm: turn off kcov for lkdtm_rodata_do_nothing:
	tty: amba-pl011: Fix spurious TX interrupts
	serial: imx: setup DCEDTE early and ensure DCD and RI irqs to be off
	MIPS: BPF: Quit clobbering callee saved registers in JIT code.
	MIPS: BPF: Fix multiple problems in JIT skb access helpers.
	MIPS: r2-on-r6-emu: Fix BLEZL and BGTZL identification
	MIPS: r2-on-r6-emu: Clear BLTZALL and BGEZALL debugfs counters
	v4l: vsp1: Prevent multiple streamon race commencing pipeline early
	v4l: vsp1: Register pipe with output WPF
	regulator: isl9305: fix array size
	md/raid6: Fix anomily when recovering a single device in RAID6.
	md.c:didn't unlock the mddev before return EINVAL in array_size_store
	powerpc/nohash: Fix use of mmu_has_feature() in setup_initial_memory_limit()
	usb: dwc2: Make sure we disconnect the gadget state
	usb: gadget: dummy_hcd: Fix wrong power status bit clear/reset in dummy_hub_control()
	perf evsel: Return exact sub event which failed with EPERM for wildcards
	iwlwifi: mvm: fix RX SKB header size and align it properly
	drivers/perf: arm_pmu: handle no platform_device
	perf inject: Copy events when reordering events in pipe mode
	net: fec: add phy-reset-gpios PROBE_DEFER check
	perf session: Don't rely on evlist in pipe mode
	vfio/powerpc/spapr_tce: Enforce IOMMU type compatibility check
	vfio/spapr_tce: Check kzalloc() return when preregistering memory
	scsi: sg: check for valid direction before starting the request
	scsi: sg: close race condition in sg_remove_sfp_usercontext()
	ALSA: hda: Add Geminilake id to SKL_PLUS
	kprobes/x86: Fix kprobe-booster not to boost far call instructions
	kprobes/x86: Set kprobes pages read-only
	pwm: tegra: Increase precision in PWM rate calculation
	clk: qcom: msm8996: Fix the vfe1 powerdomain name
	Bluetooth: Avoid bt_accept_unlink() double unlinking
	Bluetooth: 6lowpan: fix delay work init in add_peer_chan()
	mac80211_hwsim: use per-interface power level
	ath10k: fix compile time sanity check for CE4 buffer size
	wil6210: fix protection against connections during reset
	wil6210: fix memory access violation in wil_memcpy_from/toio_32
	perf stat: Fix bug in handling events in error state
	mwifiex: Fix invalid port issue
	drm/edid: set ELD connector type in drm_edid_to_eld()
	video/hdmi: Allow "empty" HDMI infoframes
	HID: elo: clear BTN_LEFT mapping
	iwlwifi: mvm: rs: don't override the rate history in the search cycle
	clk: meson: gxbb: fix wrong clock for SARADC/SANA
	ARM: dts: exynos: Correct Trats2 panel reset line
	sched: Stop switched_to_rt() from sending IPIs to offline CPUs
	sched: Stop resched_cpu() from sending IPIs to offline CPUs
	test_firmware: fix setting old custom fw path back on exit
	net: ieee802154: adf7242: Fix bug if defined DEBUG
	net: xfrm: allow clearing socket xfrm policies.
	mtd: nand: fix interpretation of NAND_CMD_NONE in nand_command[_lp]()
	net: thunderx: Set max queue count taking XDP_TX into account
	ARM: dts: am335x-pepper: Fix the audio CODEC's reset pin
	ARM: dts: omap3-n900: Fix the audio CODEC's reset pin
	mtd: nand: ifc: update bufnum mask for ver >= 2.0.0
	userns: Don't fail follow_automount based on s_user_ns
	leds: pm8058: Silence pointer to integer size warning
	power: supply: ab8500_charger: Fix an error handling path
	power: supply: ab8500_charger: Bail out in case of error in 'ab8500_charger_init_hw_registers()'
	ath10k: update tdls teardown state to target
	scsi: ses: don't ask for diagnostic pages repeatedly during probe
	pwm: stmpe: Fix wrong register offset for hwpwm=2 case
	clk: qcom: msm8916: fix mnd_width for codec_digcodec
	mwifiex: cfg80211: do not change virtual interface during scan processing
	ath10k: fix invalid STS_CAP_OFFSET_MASK
	tools/usbip: fixes build with musl libc toolchain
	spi: sun6i: disable/unprepare clocks on remove
	bnxt_en: Don't print "Link speed -1 no longer supported" messages.
	scsi: core: scsi_get_device_flags_keyed(): Always return device flags
	scsi: devinfo: apply to HP XP the same flags as Hitachi VSP
	scsi: dh: add new rdac devices
	media: vsp1: Prevent suspending and resuming DRM pipelines
	media: cpia2: Fix a couple off by one bugs
	veth: set peer GSO values
	drm/amdkfd: Fix memory leaks in kfd topology
	powerpc/modules: Don't try to restore r2 after a sibling call
	agp/intel: Flush all chipset writes after updating the GGTT
	mac80211_hwsim: enforce PS_MANUAL_POLL to be set after PS_ENABLED
	mac80211: remove BUG() when interface type is invalid
	ASoC: nuc900: Fix a loop timeout test
	ipvlan: add L2 check for packets arriving via virtual devices
	rcutorture/configinit: Fix build directory error message
	locking/locktorture: Fix num reader/writer corner cases
	ima: relax requiring a file signature for new files with zero length
	net: hns: Some checkpatch.pl script & warning fixes
	x86/boot/32: Fix UP boot on Quark and possibly other platforms
	x86/cpufeatures: Add Intel PCONFIG cpufeature
	selftests/x86/entry_from_vm86: Exit with 1 if we fail
	selftests/x86: Add tests for User-Mode Instruction Prevention
	selftests/x86: Add tests for the STR and SLDT instructions
	selftests/x86/entry_from_vm86: Add test cases for POPF
	x86/vm86/32: Fix POPF emulation
	x86/speculation, objtool: Annotate indirect calls/jumps for objtool on 32-bit kernels
	x86/speculation: Remove Skylake C2 from Speculation Control microcode blacklist
	x86/mm: Fix vmalloc_fault to use pXd_large
	parisc: Handle case where flush_cache_range is called with no context
	ALSA: pcm: Fix UAF in snd_pcm_oss_get_formats()
	ALSA: hda - Revert power_save option default value
	ALSA: seq: Fix possible UAF in snd_seq_check_queue()
	ALSA: seq: Clear client entry before deleting else at closing
	drm/amdgpu: fix prime teardown order
	drm/amdgpu/dce: Don't turn off DP sink when disconnected
	fs: Teach path_connected to handle nfs filesystems with multiple roots.
	lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
	fs/aio: Add explicit RCU grace period when freeing kioctx
	fs/aio: Use RCU accessors for kioctx_table->table[]
	irqchip/gic-v3-its: Ensure nr_ites >= nr_lpis
	scsi: sg: fix SG_DXFER_FROM_DEV transfers
	scsi: sg: fix static checker warning in sg_is_valid_dxfer
	scsi: sg: only check for dxfer_len greater than 256M
	btrfs: alloc_chunk: fix DUP stripe size handling
	btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
	scsi: qla2xxx: Fix extraneous ref on sp's after adapter break
	USB: gadget: udc: Add missing platform_device_put() on error in bdc_pci_probe()
	usb: dwc3: Fix GDBGFIFOSPACE_TYPE values
	usb: gadget: bdc: 64-bit pointer capability check
	Linux 4.9.89

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-03-22 09:54:47 +01:00
Gao Feng
694f219a60 tcp: sysctl: Fix a race to avoid unexpected 0 window from space
[ Upstream commit c48367427a39ea0b85c7cf018fe4256627abfd9e ]

Because sysctl_tcp_adv_win_scale could be changed any time, so there
is one race in tcp_win_from_space.
For example,
1.sysctl_tcp_adv_win_scale<=0 (sysctl_tcp_adv_win_scale is negative now)
2.space>>(-sysctl_tcp_adv_win_scale) (sysctl_tcp_adv_win_scale is postive now)

As a result, tcp_win_from_space returns 0. It is unexpected.

Certainly if the compiler put the sysctl_tcp_adv_win_scale into one
register firstly, then use the register directly, it would be ok.
But we could not depend on the compiler behavior.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22 09:17:43 +01:00
Greg Kroah-Hartman
9e5dd8ed9b This is the 4.9.74 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlpL3vYACgkQONu9yGCS
 aT4/aQ//VcOiG5at4QV8aEyPkmlL69jspa5yWwbbz0p3vddZrGHb72aT+lfcPgOk
 tZpHR4zKOwPPu6ROgVMnyGQTks/I5ZwCqxVjuapgiXJy34QIh6JLjldKl03Bfoz8
 2Q+5u1eia1R+2pLhEQLPp5siH3pbgqjMiC/jr2UtleC1QKiwOgmFApoXsl35OxUW
 VkTjdqTcllxa5cFmEgb53xzH/zm0XemVe6xNH4Y+KMUmow/GcynPdxjZxkBpgl4t
 HEhPR1UP708JF+LHv4FA35HujtxtK9y1UVpZmroW4Y6tW/lwwYAgvIWOC0EYv1i0
 Uin5NfvG2BVkSmct19qn7IKfuVffCRb+dxvKP9I1wembqZSo68QC8rjs/dYbw+VU
 SOGZM/Nd4m3yseM1QjQHc97GSvxtDwqzRBFp5c43HXrmn1ha8By9kqWa25JsiwJo
 GHWmDzTWw9gW7Jp/1EVY7VGO3FNSJHy87ZIHoEnAXxzCtf7BZL2Z+myQhhPIiCXg
 9jtcGkhvFVkvwjJQAAxPRr8kcNQaGZbTqh/ZhYzahl/HNAX6Ez/af30wJkmand2V
 geps+QgWIzy6dgIiWr9YnJgqRbMeaHE8Ncn/Ch/8Lp30tkFnpUEOrdqFOscDtVgk
 yzKFl57pnoxVdGjqejiF10v904sG1uHU8h87jtmnhxpgUz/TTLo=
 =RrJV
 -----END PGP SIGNATURE-----

Merge 4.9.74 into android-4.9

Changes in 4.9.74
	sync objtool's copy of x86-opcode-map.txt
	tracing: Remove extra zeroing out of the ring buffer page
	tracing: Fix possible double free on failure of allocating trace buffer
	tracing: Fix crash when it fails to alloc ring buffer
	ring-buffer: Mask out the info bits when returning buffer page length
	iw_cxgb4: Only validate the MSN for successful completions
	ASoC: wm_adsp: Fix validation of firmware and coeff lengths
	ASoC: da7218: fix fix child-node lookup
	ASoC: fsl_ssi: AC'97 ops need regmap, clock and cleaning up on failure
	ASoC: twl4030: fix child-node lookup
	ASoC: tlv320aic31xx: Fix GPIO1 register definition
	ALSA: hda: Drop useless WARN_ON()
	ALSA: hda - fix headset mic detection issue on a Dell machine
	x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
	x86/mm: Remove flush_tlb() and flush_tlb_current_task()
	x86/mm: Make flush_tlb_mm_range() more predictable
	x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()
	x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code
	x86/mm: Disable PCID on 32-bit kernels
	x86/mm: Add the 'nopcid' boot option to turn off PCID
	x86/mm: Enable CR4.PCIDE on supported systems
	x86/mm/64: Fix reboot interaction with CR4.PCIDE
	kbuild: add '-fno-stack-check' to kernel build options
	ipv4: igmp: guard against silly MTU values
	ipv6: mcast: better catch silly mtu values
	net: fec: unmap the xmit buffer that are not transferred by DMA
	net: igmp: Use correct source address on IGMPv3 reports
	netlink: Add netns check on taps
	net: qmi_wwan: add Sierra EM7565 1199:9091
	net: reevalulate autoflowlabel setting after sysctl setting
	ptr_ring: add barriers
	RDS: Check cmsg_len before dereferencing CMSG_DATA
	tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
	tcp md5sig: Use skb's saddr when replying to an incoming segment
	tg3: Fix rx hang on MTU change with 5717/5719
	net: ipv4: fix for a race condition in raw_sendmsg
	net: mvmdio: disable/unprepare clocks in EPROBE_DEFER case
	sctp: Replace use of sockets_allocated with specified macro.
	adding missing rcu_read_unlock in ipxip6_rcv
	ipv4: Fix use-after-free when flushing FIB tables
	net: bridge: fix early call to br_stp_change_bridge_id and plug newlink leaks
	net: fec: Allow reception of frames bigger than 1522 bytes
	net: Fix double free and memory corruption in get_net_ns_by_id()
	net: phy: micrel: ksz9031: reconfigure autoneg after phy autoneg workaround
	sock: free skb in skb_complete_tx_timestamp on error
	tcp: invalidate rate samples during SACK reneging
	net/mlx5: Fix rate limit packet pacing naming and struct
	net/mlx5e: Fix features check of IPv6 traffic
	net/mlx5e: Fix possible deadlock of VXLAN lock
	net/mlx5e: Add refcount to VXLAN structure
	net/mlx5e: Prevent possible races in VXLAN control flow
	net/mlx5: Fix error flow in CREATE_QP command
	s390/qeth: apply takeover changes when mode is toggled
	s390/qeth: don't apply takeover changes to RXIP
	s390/qeth: lock IP table while applying takeover changes
	s390/qeth: update takeover IPs after configuration change
	usbip: fix usbip bind writing random string after command in match_busid
	usbip: prevent leaking socket pointer address in messages
	usbip: stub: stop printing kernel pointer addresses in messages
	usbip: vhci: stop printing kernel pointer addresses in messages
	USB: serial: ftdi_sio: add id for Airbus DS P8GR
	USB: serial: qcserial: add Sierra Wireless EM7565
	USB: serial: option: add support for Telit ME910 PID 0x1101
	USB: serial: option: adding support for YUGA CLM920-NC5
	usb: Add device quirk for Logitech HD Pro Webcam C925e
	usb: add RESET_RESUME for ELSA MicroLink 56K
	USB: Fix off by one in type-specific length check of BOS SSP capability
	usb: xhci: Add XHCI_TRUST_TX_LENGTH for Renesas uPD720201
	timers: Use deferrable base independent of base::nohz_active
	timers: Invoke timer_start_debug() where it makes sense
	timers: Reinitialize per cpu bases on hotplug
	nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()
	x86/smpboot: Remove stale TLB flush invocations
	n_tty: fix EXTPROC vs ICANON interaction with TIOCINQ (aka FIONREAD)
	tty: fix tty_ldisc_receive_buf() documentation
	mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP
	Linux 4.9.74

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-01-02 20:45:15 +01:00
Yousuk Seung
e74fe7268e tcp: invalidate rate samples during SACK reneging
[ Upstream commit d4761754b4fb2ef8d9a1e9d121c4bec84e1fe292 ]

Mark tcp_sock during a SACK reneging event and invalidate rate samples
while marked. Such rate samples may overestimate bw by including packets
that were SACKed before reneging.

< ack 6001 win 10000 sack 7001:38001
< ack 7001 win 0 sack 8001:38001 // Reneg detected
> seq 7001:8001 // RTO, SACK cleared.
< ack 38001 win 10000

In above example the rate sample taken after the last ack will count
7001-38001 as delivered while the actual delivery rate likely could
be much lower i.e. 7001-8001.

This patch adds a new field tcp_sock.sack_reneg and marks it when we
declare SACK reneging and entering TCP_CA_Loss, and unmarks it after
the last rate sample was taken before moving back to TCP_CA_Open. This
patch also invalidates rate samples taken while tcp_sock.is_sack_reneg
is set.

Fixes: b9f64820fb ("tcp: track data delivery rate for a TCP connection")
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02 20:35:13 +01:00
Greg Kroah-Hartman
44a3afcce1 This is the 4.9.63 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAloQCeEACgkQONu9yGCS
 aT5NSg/+KKaM27NOw+QU41S27e7EEk2ToFZVInD4YMVM37WDP3Dhy/6qGKqd7QEd
 pYjcxdXhVi+vIyozY/QjXGNhTTOao5AtgGTdw1l2lag2VritbAqplgr0hPRLoj4M
 9BEYveO2u+ooNJ6vieyW7TIVqGh05X4F43/Ng1I3iAbmvMcyg8LcqauYMaNa37jj
 PP9XWWbZ87GCLqNM3Cy/V7uR/xFjj7N7/N6//547QRTgqnB31EytUXEwxtvgS7Z8
 HxhVYk7gTzZMgpN6TUo0AKnD9iOxzR18kC0PooUz2nphS92Zad2rakhxUVJASUXv
 DpY5LSyiN/F6fVp68ObAx8Cw31Uavjyvy/TJju1Kg9Mrt1fN/MBsEH0HSI7PrxyQ
 7Q2Se+A8LZqeYW4P1AvHjei7Z10AL64YcXwrsAkeouh74WWKrVoEeoYVYDF+FdRy
 87jNJE6+W589g+hLI0fX1Q07luEfToRfvZQTk0pdxLTxt5HrCgSoX6q6yOQ4ofMn
 mTfRmNSaeiEaNDgl/f9ZqH3ViOFsINJ+0zgCMmFv4p8yyl2grj63ELdHKkjqTHCN
 oPH3ZCeCFV+uvunA8geHLzToMDfZOsPQ4BbewV3TEFG4rSVxXQcoTJWIYaxHYaih
 5X/JIgFZyaDnUFJQyEtcj8MyVnz1oraw73ghdOMuvHxkYXe+Xr4=
 =DcCO
 -----END PGP SIGNATURE-----

Merge 4.9.63 into android-4.9

Changes in 4.9.63
	gso: fix payload length when gso_size is zero
	tun/tap: sanitize TUNSETSNDBUF input
	ipv6: addrconf: increment ifp refcount before ipv6_del_addr()
	netlink: do not set cb_running if dump's start() errs
	net: call cgroup_sk_alloc() earlier in sk_clone_lock()
	tcp: fix tcp_mtu_probe() vs highest_sack
	l2tp: check ps->sock before running pppol2tp_session_ioctl()
	tun: call dev_get_valid_name() before register_netdevice()
	sctp: add the missing sock_owned_by_user check in sctp_icmp_redirect
	tcp/dccp: fix ireq->opt races
	packet: avoid panic in packet_getsockopt()
	soreuseport: fix initialization race
	ipv6: flowlabel: do not leave opt->tot_len with garbage
	sctp: full support for ipv6 ip_nonlocal_bind & IP_FREEBIND
	tcp/dccp: fix lockdep splat in inet_csk_route_req()
	tcp/dccp: fix other lockdep splats accessing ireq_opt
	net/unix: don't show information about sockets from other namespaces
	tap: double-free in error path in tap_open()
	ipip: only increase err_count for some certain type icmp in ipip_err
	ip6_gre: only increase err_count for some certain type icmpv6 in ip6gre_err
	ip6_gre: update dst pmtu if dev mtu has been updated by toobig in __gre6_xmit
	tun: allow positive return values on dev_get_valid_name() call
	sctp: reset owner sk for data chunks on out queues when migrating a sock
	net_sched: avoid matching qdisc with zero handle
	ppp: fix race in ppp device destruction
	mac80211: accept key reinstall without changing anything
	mac80211: use constant time comparison with keys
	mac80211: don't compare TKIP TX MIC key in reinstall prevention
	usb: usbtest: fix NULL pointer dereference
	Input: ims-psu - check if CDC union descriptor is sane
	ALSA: seq: Cancel pending autoload work at unbinding device
	Revert "ARM: dts: imx53-qsb-common: fix FEC pinmux config"
	netfilter: nat: avoid use of nf_conn_nat extension
	netfilter: nat: Revert "netfilter: nat: convert nat bysrc hash to rhashtable"
	security/keys: add CONFIG_KEYS_COMPAT to Kconfig
	brcmfmac: remove setting IBSS mode when stopping AP
	target/iscsi: Fix iSCSI task reassignment handling
	qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT (v2)
	misc: panel: properly restore atomic counter on error path
	Linux 4.9.63

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-11-18 17:25:57 +01:00
Eric Dumazet
e12c42c552 tcp: fix tcp_mtu_probe() vs highest_sack
[ Upstream commit 2b7cda9c35d3b940eb9ce74b30bbd5eb30db493d ]

Based on SNMP values provided by Roman, Yuchung made the observation
that some crashes in tcp_sacktag_walk() might be caused by MTU probing.

Looking at tcp_mtu_probe(), I found that when a new skb was placed
in front of the write queue, we were not updating tcp highest sack.

If one skb is freed because all its content was copied to the new skb
(for MTU probing), then tp->highest_sack could point to a now freed skb.

Bad things would then happen, including infinite loops.

This patch renames tcp_highest_sack_combine() and uses it
from tcp_mtu_probe() to fix the bug.

Note that I also removed one test against tp->sacked_out,
since we want to replace tp->highest_sack regardless of whatever
condition, since keeping a stale pointer to freed skb is a recipe
for disaster.

Fixes: a47e5a988a ("[TCP]: Convert highest_sack to sk_buff to allow direct access")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Roman Gushchin <guro@fb.com>
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-18 11:22:21 +01:00
Wei Wang
d55e63001f BACKPORT: net/tcp-fastopen: Add new API support
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
  s = socket()
    create a new socket
  setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
    newly introduced sockopt
    If set, new functionality described below will be used.
    Return ENOTSUPP if TFO is not supported or not enabled in the
    kernel.

  connect()
    With cookie present, return 0 immediately.
    With no cookie, initiate 3WHS with TFO cookie-request option and
    return -1 with errno = EINPROGRESS.

  write()/sendmsg()
    With cookie present, send out SYN with data and return the number of
    bytes buffered.
    With no cookie, and 3WHS not yet completed, return -1 with errno =
    EINPROGRESS.
    No MSG_FASTOPEN flag is needed.

  read()
    Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
    write() is not called yet.
    Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
    established but no msg is received yet.
    Return number of bytes read if socket is established and there is
    msg received.

The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.

Backport of upstream commit 19f6d3f3c842 ("net/tcp-fastopen: Add new API
support")

Bug: 63449462
Test: Tests in https://android-review.googlesource.com/535357/ pass
Change-Id: Icc181febd74e3117c2fc835d7ed935e107b5815e
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from commit 19f6d3f3c8422d65b5e3d2162e30ef07c6e21ea2)
2017-11-13 15:18:49 +08:00
Wei Wang
b9fde3e805 UPSTREAM: net/tcp-fastopen: refactor cookie check logic
Refactor the cookie check logic in tcp_send_syn_data() into a function.
This function will be called else where in later changes.

Bug: 63449462
Test: Tests in https://android-review.googlesource.com/535357/ pass
Change-Id: I14b0fadd8f97569f773a2e2f15f0b4e8dca48402
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from commit 065263f40f0972d5f1cd294bb0242bd5aa5f06b2)
2017-11-13 15:17:40 +08:00
JP Abgrall
7220b97d37 ANDROID: tcp: add a sysctl to config the tcp_default_init_rwnd
The default initial rwnd is hardcoded to 10.

Now we allow it to be controlled via
  /proc/sys/net/ipv4/tcp_default_init_rwnd
which limits the values from 3 to 100

This is somewhat needed because ipv6 routes are
autoconfigured by the kernel.

See "An Argument for Increasing TCP's Initial Congestion Window"
in https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf

Change-Id: I386b2a9d62de0ebe05c1ebe1b4bd91b314af5c54
Signed-off-by: JP Abgrall <jpa@google.com>

Conflicts:
	net/ipv4/sysctl_net_ipv4.c
	net/ipv4/tcp_input.c
2017-01-19 13:32:34 -08:00
Eric Dumazet
ac6e780070 tcp: take care of truncations done by sk_filter()
With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()

Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.

We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq

Many thanks to syzkaller team and Marco for giving us a reproducer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Marco Grassi <marco.gra@gmail.com>
Reported-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-13 12:30:02 -05:00
David Ahern
da96786e26 net: tcp: check skb is non-NULL for exact match on lookups
Andrey reported the following error report while running the syzkaller
fuzzer:

general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 648 Comm: syz-executor Not tainted 4.9.0-rc3+ #333
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8800398c4480 task.stack: ffff88003b468000
RIP: 0010:[<ffffffff83091106>]  [<     inline     >]
inet_exact_dif_match include/net/tcp.h:808
RIP: 0010:[<ffffffff83091106>]  [<ffffffff83091106>]
__inet_lookup_listener+0xb6/0x500 net/ipv4/inet_hashtables.c:219
RSP: 0018:ffff88003b46f270  EFLAGS: 00010202
RAX: 0000000000000004 RBX: 0000000000004242 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffc90000e3c000 RDI: 0000000000000054
RBP: ffff88003b46f2d8 R08: 0000000000004000 R09: ffffffff830910e7
R10: 0000000000000000 R11: 000000000000000a R12: ffffffff867fa0c0
R13: 0000000000004242 R14: 0000000000000003 R15: dffffc0000000000
FS:  00007fb135881700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020cc3000 CR3: 000000006d56a000 CR4: 00000000000006f0
Stack:
 0000000000000000 000000000601a8c0 0000000000000000 ffffffff00004242
 424200003b9083c2 ffff88003def4041 ffffffff84e7e040 0000000000000246
 ffff88003a0911c0 0000000000000000 ffff88003a091298 ffff88003b9083ae
Call Trace:
 [<ffffffff831100f4>] tcp_v4_send_reset+0x584/0x1700 net/ipv4/tcp_ipv4.c:643
 [<ffffffff83115b1b>] tcp_v4_rcv+0x198b/0x2e50 net/ipv4/tcp_ipv4.c:1718
 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
net/ipv4/ip_input.c:216
...

MD5 has a code path that calls __inet_lookup_listener with a null skb,
so inet{6}_exact_dif_match needs to check skb against null before pulling
the flag.

Fixes: a04a480d43 ("net: Require exact match for TCP socket lookups if
       dif is l3mdev")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:05:44 -04:00
David Ahern
a04a480d43 net: Require exact match for TCP socket lookups if dif is l3mdev
Currently, socket lookups for l3mdev (vrf) use cases can match a socket
that is bound to a port but not a device (ie., a global socket). If the
sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
based on the main table even though the packet came in from an L3 domain.
The end result is that the connection does not establish creating
confusion for users since the service is running and a socket shows in
ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
skb came through an interface enslaved to an l3mdev device and the
tcp_l3mdev_accept is not set.

skb's through an l3mdev interface are marked by setting a flag in
inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
inet_skb_parm struct is moved in the cb per commit 971f10eca1, so the
match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
move is done after the socket lookup, so IP6CB is used.

The flags field in inet_skb_parm struct needs to be increased to add
another flag. There is currently a 1-byte hole following the flags,
so it can be expanded to u16 without increasing the size of the struct.

Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-17 10:17:05 -04:00
Yuchung Cheng
c0402760f5 tcp: new CC hook to set sending rate with rate_sample in any CA state
This commit introduces an optional new "omnipotent" hook,
cong_control(), for congestion control modules. The cong_control()
function is called at the end of processing an ACK (i.e., after
updating sequence numbers, the SACK scoreboard, and loss
detection). At that moment we have precise delivery rate information
the congestion control module can use to control the sending behavior
(using cwnd, TSO skb size, and pacing rate) in any CA state.

This function can also be used by a congestion control that prefers
not to use the default cwnd reduction approach (i.e., the PRR
algorithm) during CA_Recovery to control the cwnd and sending rate
during loss recovery.

We take advantage of the fact that recent changes defer the
retransmission or transmission of new data (e.g. by F-RTO) in recovery
until the new tcp_cong_control() function is run.

With this commit, we only run tcp_update_pacing_rate() if the
congestion control is not using this new API. New congestion controls
which use the new API do not want the TCP stack to run the default
pacing rate calculation and overwrite whatever pacing rate they have
chosen at initialization time.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:01 -04:00
Yuchung Cheng
77bfc174c3 tcp: allow congestion control to expand send buffer differently
Currently the TCP send buffer expands to twice cwnd, in order to allow
limited transmits in the CA_Recovery state. This assumes that cwnd
does not increase in the CA_Recovery.

For some congestion control algorithms, like the upcoming BBR module,
if the losses in recovery do not indicate congestion then we may
continue to raise cwnd multiplicatively in recovery. In such cases the
current multiplier will falsely limit the sending rate, much as if it
were limited by the application.

This commit adds an optional congestion control callback to use a
different multiplier to expand the TCP send buffer. For congestion
control modules that do not specificy this callback, TCP continues to
use the previous default of 2.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:01 -04:00
Neal Cardwell
1b3878ca15 tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments
To allow congestion control modules to use the default TSO auto-sizing
algorithm as one of the ingredients in their own decision about TSO sizing:

1) Export tcp_tso_autosize() so that CC modules can use it.

2) Change tcp_tso_autosize() to allow callers to specify a minimum
   number of segments per TSO skb, in case the congestion control
   module has a different notion of the best floor for TSO skbs for
   the connection right now. For very low-rate paths or policed
   connections it can be appropriate to use smaller TSO skbs.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Neal Cardwell
ed6e7268b9 tcp: allow congestion control module to request TSO skb segment count
Add the tso_segs_goal() function in tcp_congestion_ops to allow the
congestion control module to specify the number of segments that
should be in a TSO skb sent by tcp_write_xmit() and
tcp_xmit_retransmit_queue(). The congestion control module can either
request a particular number of segments in TSO skb that we transmit,
or return 0 if it doesn't care.

This allows the upcoming BBR congestion control module to select small
TSO skb sizes if the module detects that the bottleneck bandwidth is
very low, or that the connection is policed to a low rate.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Soheil Hassas Yeganeh
d7722e8570 tcp: track application-limited rate samples
This commit adds code to track whether the delivery rate represented
by each rate_sample was limited by the application.

Upon each transmit, we store in the is_app_limited field in the skb a
boolean bit indicating whether there is a known "bubble in the pipe":
a point in the rate sample interval where the sender was
application-limited, and did not transmit even though the cwnd and
pacing rate allowed it.

This logic marks the flow app-limited on a write if *all* of the
following are true:

  1) There is less than 1 MSS of unsent data in the write queue
     available to transmit.

  2) There is no packet in the sender's queues (e.g. in fq or the NIC
     tx queue).

  3) The connection is not limited by cwnd.

  4) There are no lost packets to retransmit.

The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
the connection is application-limited at the moment. If the flow is
application-limited, it sets the tp->app_limited field. If the flow is
application-limited then that means there is effectively a "bubble" of
silence in the pipe now, and this silence will be reflected in a lower
bandwidth sample for any rate samples from now until we get an ACK
indicating this bubble has exited the pipe: specifically, until we get
an ACK for the next packet we transmit.

When we send every skb we record in scb->tx.is_app_limited whether the
resulting rate sample will be application-limited.

The code in tcp_rate_gen() checks to see when it is safe to mark all
known application-limited bubbles of silence as having exited the
pipe. It does this by checking to see when the delivered count moves
past the tp->app_limited marker. At this point it zeroes the
tp->app_limited marker, as all known bubbles are out of the pipe.

We make room for the tx.is_app_limited bit in the skb by borrowing a
bit from the in_flight field used by NV to record the number of bytes
in flight. The receive window in the TCP header is 16 bits, and the
max receive window scaling shift factor is 14 (RFC 1323). So the max
receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
only need 30 bits for the tx.in_flight used by NV.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Yuchung Cheng
b9f64820fb tcp: track data delivery rate for a TCP connection
This patch generates data delivery rate (throughput) samples on a
per-ACK basis. These rate samples can be used by congestion control
modules, and specifically will be used by TCP BBR in later patches in
this series.

Key state:

tp->delivered: Tracks the total number of data packets (original or not)
	       delivered so far. This is an already-existing field.

tp->delivered_mstamp: the last time tp->delivered was updated.

Algorithm:

A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:

  d1: the current tp->delivered after processing the ACK
  t1: the current time after processing the ACK

  d0: the prior tp->delivered when the acked skb was transmitted
  t0: the prior tp->delivered_mstamp when the acked skb was transmitted

When an skb is transmitted, we snapshot d0 and t0 in its control
block in tcp_rate_skb_sent().

When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
to reflect the latest (d0, t0).

Finally, tcp_rate_gen() generates a rate sample by storing
(d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.

One caveat: if an skb was sent with no packets in flight, then
tp->delivered_mstamp may be either invalid (if the connection is
starting) or outdated (if the connection was idle). In that case,
we'll re-stamp tp->delivered_mstamp.

At first glance it seems t0 should always be the time when an skb was
transmitted, but actually this could over-estimate the rate due to
phase mismatch between transmit and ACK events. To track the delivery
rate, we ensure that if packets are in flight then t0 and and t1 are
times at which packets were marked delivered.

If the initial and final RTTs are different then one may be corrupted
by some sort of noise. The noise we see most often is sending gaps
caused by delayed, compressed, or stretched acks. This either affects
both RTTs equally or artificially reduces the final RTT. We approach
this by recording the info we need to compute the initial RTT
(duration of the "send phase" of the window) when we recorded the
associated inflight. Then, for a filter to avoid bandwidth
overestimates, we generalize the per-sample bandwidth computation
from:

    bw = delivered / ack_phase_rtt

to the following:

    bw = delivered / max(send_phase_rtt, ack_phase_rtt)

In large-scale experiments, this filtering approach incorporating
send_phase_rtt is effective at avoiding bandwidth overestimates due to
ACK compression or stretched ACKs.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:00 -04:00
Neal Cardwell
6403389211 tcp: use windowed min filter library for TCP min_rtt estimation
Refactor the TCP min_rtt code to reuse the new win_minmax library in
lib/win_minmax.c to simplify the TCP code.

This is a pure refactor: the functionality is exactly the same. We
just moved the windowed min code to make TCP easier to read and
maintain, and to allow other parts of the kernel to use the windowed
min/max filter code.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:22:59 -04:00
Yaogong Wang
9f5afeae51 tcp: use an RB tree for ooo receive queue
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-08 17:25:58 -07:00
David S. Miller
6abdd5f593 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All three conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 00:54:02 -04:00
Eric Dumazet
c9c3321257 tcp: add tcp_add_backlog()
When TCP operates in lossy environments (between 1 and 10 % packet
losses), many SACK blocks can be exchanged, and I noticed we could
drop them on busy senders, if these SACK blocks have to be queued
into the socket backlog.

While the main cause is the poor performance of RACK/SACK processing,
we can try to avoid these drops of valuable information that can lead to
spurious timeouts and retransmits.

Cause of the drops is the skb->truesize overestimation caused by :

- drivers allocating ~2048 (or more) bytes as a fragment to hold an
  Ethernet frame.

- various pskb_may_pull() calls bringing the headers into skb->head
  might have pulled all the frame content, but skb->truesize could
  not be lowered, as the stack has no idea of each fragment truesize.

The backlog drops are also more visible on bidirectional flows, since
their sk_rmem_alloc can be quite big.

Let's add some room for the backlog, as only the socket owner
can selectively take action to lower memory needs, like collapsing
receive queues or partial ofo pruning.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-29 00:20:24 -04:00
Tom Herbert
3203558589 tcp: Set read_sock and peek_len proto_ops
In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
tcp_peek_len (which is just a stub function that calls tcp_inq).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-28 23:32:41 -04:00