Add new proto_ops sendmsg_locked and sendpage_locked that can be
called when the socket lock is already held. Correspondingly, add
kernel_sendmsg_locked and kernel_sendpage_locked as front end
functions.
These functions will be used in zero proxy so that we can take
the socket lock in a ULP sendmsg/sendpage and then directly call the
backend transport proto_ops functions.
Change-Id: I4a8a6f5234486946ec2870ae22fa8ea561df3af0
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.
Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.
Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code. For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.
The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.
This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.
This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.
Two new corresponding structs (one for the kernel one for the user/BPF
program):
/* kernel version */
struct bpf_sock_ops_kern {
struct sock *sk;
__u32 op;
union {
__u32 reply;
__u32 replylong[4];
};
};
/* user version
* Some fields are in network byte order reflecting the sock struct
* Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
* convert them to host byte order.
*/
struct bpf_sock_ops {
__u32 op;
union {
__u32 reply;
__u32 replylong[4];
};
__u32 family;
__u32 remote_ip4; /* In network byte order */
__u32 local_ip4; /* In network byte order */
__u32 remote_ip6[4]; /* In network byte order */
__u32 local_ip6[4]; /* In network byte order */
__u32 remote_port; /* In network byte order */
__u32 local_port; /* In host byte horder */
};
Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.
The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.
[Nguyễn Long: Update to match "UPSTREAM: bpf: multi program support
for cgroup+bpf"]
Change-Id: Ifbbde466367bc68333cc08f57601550944f264ec
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl396QsACgkQONu9yGCS
aT63DA//V6C58MC6I6ISv/2zcGrWOzfP4q1et4lu1eBPgUTzbnB7gdnwH2V6l3QU
Kt9V4GUI6GFsVmRGsEJZjg4kvA1HDjx6sbs3odHuvDnhGFrHstnxKOJiuZeQU4Ph
wxDEQghyeYod+2VDKNWw3gxjBgjP95eAvZMmhnQ13dwY20HZgGjoUMHBKpKpwOly
ZwV0YWsW9Zko8bTmuyrqXUG8ChSWZCWMuNKNpUDIjVj/NPcAyEAr9wkYOGf8UnGq
rK/owI0Y02qZJS9ZGxv4IDNjW7yjIGQzSYWUrY8ajSQzFqHErWSQ7tGwJ8fb5EBM
0zH2GGp8XLXXOCMMrg8nsTlNaj+wg7a5C0Vx0VtijYyIQCE39M91rltO+oVBWdjR
jCUbp//2v7XFN+eQSvwuWHrUyOmGGj9C0mDGoMC4rSHES4jYS4b4wHzn9CK8Iqy2
izvNjz93TrNj+YLLUlf6tTSUfWYuGSFNyahbhjFL7MPBp2RUJM1f5uzhBj2sQqwN
olCUvsNSZRWkR5/f15/kdEyPAhYt0i66aV3JNpEaRHZDqpUXAMTRv+AgtEPBNA5r
mORdDq9Zw+m+BGYSiuotArINRY8PSn1tP8tnhNjGE0RwnAx+S0Fs6+0AyHhEY+wQ
OBpNfRTaJep9L1Yl4Hj4ZDU6Lr5xqWLoplV2X9OIUzLleAY44+M=
=Gqb9
-----END PGP SIGNATURE-----
Merge 4.9.207 into android-4.9-q
Changes in 4.9.207
arm64: tegra: Fix 'active-low' warning for Jetson TX1 regulator
usb: gadget: u_serial: add missing port entry locking
tty: serial: fsl_lpuart: use the sg count from dma_map_sg
tty: serial: msm_serial: Fix flow control
serial: pl011: Fix DMA ->flush_buffer()
serial: serial_core: Perform NULL checks for break_ctl ops
serial: ifx6x60: add missed pm_runtime_disable
autofs: fix a leak in autofs_expire_indirect()
RDMA/hns: Correct the value of HNS_ROCE_HEM_CHUNK_LEN
exportfs_decode_fh(): negative pinned may become positive without the parent locked
audit_get_nd(): don't unlock parent too early
NFC: nxp-nci: Fix NULL pointer dereference after I2C communication error
Input: cyttsp4_core - fix use after free bug
ALSA: pcm: Fix stream lock usage in snd_pcm_period_elapsed()
rsxx: add missed destroy_workqueue calls in remove
net: ep93xx_eth: fix mismatch of request_mem_region in remove
serial: core: Allow processing sysrq at port unlock time
cxgb4vf: fix memleak in mac_hlist initialization
iwlwifi: mvm: Send non offchannel traffic via AP sta
ARM: 8813/1: Make aligned 2-byte getuser()/putuser() atomic on ARMv6+
net/mlx5: Release resource on error flow
extcon: max8997: Fix lack of path setting in USB device mode
clk: rockchip: fix rk3188 sclk_smc gate data
clk: rockchip: fix rk3188 sclk_mac_lbtest parameter ordering
ARM: dts: rockchip: Fix rk3288-rock2 vcc_flash name
dlm: fix missing idr_destroy for recover_idr
MIPS: SiByte: Enable ZONE_DMA32 for LittleSur
scsi: zfcp: drop default switch case which might paper over missing case
pinctrl: qcom: ssbi-gpio: fix gpio-hog related boot issues
Staging: iio: adt7316: Fix i2c data reading, set the data field
regulator: Fix return value of _set_load() stub
MIPS: OCTEON: octeon-platform: fix typing
math-emu/soft-fp.h: (_FP_ROUND_ZERO) cast 0 to void to fix warning
rtc: max8997: Fix the returned value in case of error in 'max8997_rtc_read_alarm()'
rtc: dt-binding: abx80x: fix resistance scale
ARM: dts: exynos: Use Samsung SoC specific compatible for DWC2 module
media: pulse8-cec: return 0 when invalidating the logical address
dmaengine: coh901318: Fix a double-lock bug
dmaengine: coh901318: Remove unused variable
usb: dwc3: don't log probe deferrals; but do log other error codes
ACPI: fix acpi_find_child_device() invocation in acpi_preset_companion()
dma-mapping: fix return type of dma_set_max_seg_size()
altera-stapl: check for a null key before strcasecmp'ing it
serial: imx: fix error handling in console_setup
i2c: imx: don't print error message on probe defer
dlm: NULL check before kmem_cache_destroy is not needed
ARM: debug: enable UART1 for socfpga Cyclone5
nfsd: fix a warning in __cld_pipe_upcall()
ARM: OMAP1/2: fix SoC name printing
net/x25: fix called/calling length calculation in x25_parse_address_block
net/x25: fix null_x25_address handling
ARM: dts: mmp2: fix the gpio interrupt cell number
ARM: dts: realview-pbx: Fix duplicate regulator nodes
tcp: fix off-by-one bug on aborting window-probing socket
tcp: fix SNMP TCP timeout under-estimation
modpost: skip ELF local symbols during section mismatch check
kbuild: fix single target build for external module
mtd: fix mtd_oobavail() incoherent returned value
ARM: dts: pxa: clean up USB controller nodes
clk: sunxi-ng: h3/h5: Fix CSI_MCLK parent
ARM: dts: realview: Fix some more duplicate regulator nodes
dlm: fix invalid cluster name warning
net/mlx4_core: Fix return codes of unsupported operations
powerpc/math-emu: Update macros from GCC
MIPS: OCTEON: cvmx_pko_mem_debug8: use oldest forward compatible definition
nfsd: Return EPERM, not EACCES, in some SETATTR cases
tty: Don't block on IO when ldisc change is pending
media: stkwebcam: Bugfix for wrong return values
mlx4: Use snprintf instead of complicated strcpy
ARM: dts: sunxi: Fix PMU compatible strings
sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
fuse: verify nlink
fuse: verify attributes
ALSA: pcm: oss: Avoid potential buffer overflows
Input: goodix - add upside-down quirk for Teclast X89 tablet
coresight: etm4x: Fix input validation for sysfs.
x86/PCI: Avoid AMD FCH XHCI USB PME# from D0 defect
CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
CIFS: Fix SMB2 oplock break processing
tty: vt: keyboard: reject invalid keycodes
can: slcan: Fix use-after-free Read in slcan_open
jbd2: Fix possible overflow in jbd2_log_space_left()
drm/i810: Prevent underflow in ioctl
KVM: x86: do not modify masked bits of shared MSRs
KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES
crypto: crypto4xx - fix double-free in crypto4xx_destroy_sdr
crypto: ccp - fix uninitialized list head
crypto: ecdh - fix big endian bug in ECC library
crypto: user - fix memory leak in crypto_report
spi: atmel: Fix CS high support
RDMA/qib: Validate ->show()/store() callbacks before calling them
thermal: Fix deadlock in thermal thermal_zone_device_check
KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332)
appletalk: Fix potential NULL pointer dereference in unregister_snap_client
appletalk: Set error code if register_snap_client failed
usb: gadget: configfs: Fix missing spin_lock_init()
USB: uas: honor flag to avoid CAPACITY16
USB: uas: heed CAPACITY_HEURISTICS
usb: Allow USB device to be warm reset in suspended state
staging: rtl8188eu: fix interface sanity check
staging: rtl8712: fix interface sanity check
staging: gigaset: fix general protection fault on probe
staging: gigaset: fix illegal free on probe errors
staging: gigaset: add endpoint-type sanity check
xhci: Increase STS_HALT timeout in xhci_suspend()
ARM: dts: pandora-common: define wl1251 as child node of mmc3
iio: humidity: hdc100x: fix IIO_HUMIDITYRELATIVE channel reporting
USB: atm: ueagle-atm: add missing endpoint check
USB: idmouse: fix interface sanity checks
USB: serial: io_edgeport: fix epic endpoint lookup
USB: adutux: fix interface sanity check
usb: core: urb: fix URB structure initialization function
usb: mon: Fix a deadlock in usbmon between mmap and read
mtd: spear_smi: Fix Write Burst mode
virtio-balloon: fix managed page counts when migrating pages between zones
btrfs: check page->mapping when loading free space cache
btrfs: Remove btrfs_bio::flags member
Btrfs: send, skip backreference walking for extents with many references
btrfs: record all roots for rename exchange on a subvol
rtlwifi: rtl8192de: Fix missing code to retrieve RX buffer address
rtlwifi: rtl8192de: Fix missing callback that tests for hw release of buffer
rtlwifi: rtl8192de: Fix missing enable interrupt flag
lib: raid6: fix awk build warnings
ALSA: hda - Fix pending unsol events at shutdown
workqueue: Fix spurious sanity check failures in destroy_workqueue()
workqueue: Fix pwq ref leak in rescuer_thread()
ASoC: Jack: Fix NULL pointer dereference in snd_soc_jack_report
blk-mq: avoid sysfs buffer overflow with too many CPU cores
cgroup: pids: use atomic64_t for pids->limit
ar5523: check NULL before memcpy() in ar5523_cmd()
media: bdisp: fix memleak on release
media: radio: wl1273: fix interrupt masking on release
cpuidle: Do not unset the driver if it is there already
PM / devfreq: Lock devfreq in trans_stat_show
ACPI: OSL: only free map once in osl.c
ACPI: bus: Fix NULL pointer check in acpi_bus_get_private_data()
ACPI: PM: Avoid attaching ACPI PM domain to certain devices
pinctrl: samsung: Fix device node refcount leaks in S3C24xx wakeup controller init
pinctrl: samsung: Fix device node refcount leaks in init code
mmc: host: omap_hsmmc: add code for special init of wl1251 to get rid of pandora_wl1251_init_card
ppdev: fix PPGETTIME/PPSETTIME ioctls
powerpc: Allow 64bit VDSO __kernel_sync_dicache to work across ranges >4GB
video/hdmi: Fix AVI bar unpack
quota: Check that quota is not dirty before release
ext2: check err when partial != NULL
quota: fix livelock in dquot_writeback_dquots
scsi: zfcp: trace channel log even for FCP command responses
usb: xhci: only set D3hot for pci device
xhci: Fix memory leak in xhci_add_in_port()
xhci: make sure interrupts are restored to correct state
iio: adis16480: Add debugfs_reg_access entry
Btrfs: fix negative subv_writers counter and data space leak after buffered write
omap: pdata-quirks: remove openpandora quirks for mmc3 and wl1251
scsi: lpfc: Cap NPIV vports to 256
e100: Fix passing zero to 'PTR_ERR' warning in e100_load_ucode_wait
x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
ath10k: fix fw crash by moving chip reset after napi disabled
ARM: dts: omap3-tao3530: Fix incorrect MMC card detection GPIO polarity
pinctrl: samsung: Fix device node refcount leaks in S3C64xx wakeup controller init
scsi: qla2xxx: Fix DMA unmap leak
scsi: qla2xxx: Fix session lookup in qlt_abort_work()
scsi: qla2xxx: Fix qla24xx_process_bidir_cmd()
scsi: qla2xxx: Always check the qla2x00_wait_for_hba_online() return value
powerpc: Fix vDSO clock_getres()
reiserfs: fix extended attributes on the root directory
firmware: qcom: scm: Ensure 'a0' status code is treated as signed
mm/shmem.c: cast the type of unmap_start to u64
ext4: fix a bug in ext4_wait_for_tail_page_commit
blk-mq: make sure that line break can be printed
workqueue: Fix missing kfree(rescuer) in destroy_workqueue()
sunrpc: fix crash when cache_head become valid before update
net/mlx5e: Fix SFF 8472 eeprom length
kernel/module.c: wakeup processes in module_wq on module unload
nvme: host: core: fix precedence of ternary operator
net: bridge: deny dev_set_mac_address() when unregistering
net: ethernet: ti: cpsw: fix extra rx interrupt
openvswitch: support asymmetric conntrack
tcp: md5: fix potential overestimation of TCP option space
tipc: fix ordering of tipc module init and exit routine
inet: protect against too small mtu values.
tcp: fix rejected syncookies due to stale timestamps
tcp: tighten acceptance of ACKs not matching a child socket
tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
Revert "regulator: Defer init completion for a while after late_initcall"
PCI: Fix Intel ACS quirk UPDCR register address
PCI/MSI: Fix incorrect MSI-X masking on resume
xtensa: fix TLB sanity checker
CIFS: Respect O_SYNC and O_DIRECT flags during reconnect
ARM: dts: s3c64xx: Fix init order of clock providers
ARM: tegra: Fix FLOW_CTLR_HALT register clobbering by tegra_resume()
vfio/pci: call irq_bypass_unregister_producer() before freeing irq
dma-buf: Fix memory leak in sync_file_merge()
dm btree: increase rebalance threshold in __rebalance2()
scsi: iscsi: Fix a potential deadlock in the timeout handler
drm/radeon: fix r1xx/r2xx register checker for POT textures
xhci: fix USB3 device initiated resume race with roothub autosuspend
net: stmmac: use correct DMA buffer size in the RX descriptor
net: stmmac: don't stop NAPI processing when dropping a packet
Linux 4.9.207
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 721c8dafad26ccfa90ff659ee19755e3377b829d ]
Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the
timestamp of the last synflood. Protect them with READ_ONCE() and
WRITE_ONCE() since reads and writes aren't serialised.
Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was
introduced by a0f82f64e2 ("syncookies: remove last_synq_overflow from
struct tcp_sock"). But unprotected accesses were already there when
timestamp was stored in .last_synq_overflow.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cb44a08f8647fd2e8db5cc9ac27cd8355fa392d8 ]
When no synflood occurs, the synflood timestamp isn't updated.
Therefore it can be so old that time_after32() can consider it to be
in the future.
That's a problem for tcp_synq_no_recent_overflow() as it may report
that a recent overflow occurred while, in fact, it's just that jiffies
has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31.
Spurious detection of recent overflows lead to extra syncookie
verification in cookie_v[46]_check(). At that point, the verification
should fail and the packet dropped. But we should have dropped the
packet earlier as we didn't even send a syncookie.
Let's refine tcp_synq_no_recent_overflow() to report a recent overflow
only if jiffies is within the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This
way, no spurious recent overflow is reported when jiffies wraps and
'last_overflow' becomes in the future from the point of view of
time_after32().
However, if jiffies wraps and enters the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with
'last_overflow' being a stale synflood timestamp), then
tcp_synq_no_recent_overflow() still erroneously reports an
overflow. In such cases, we have to rely on syncookie verification
to drop the packet. We unfortunately have no way to differentiate
between a fresh and a stale syncookie timestamp.
In practice, using last_overflow as lower bound is problematic.
If the synflood timestamp is concurrently updated between the time
we read jiffies and the moment we store the timestamp in
'last_overflow', then 'now' becomes smaller than 'last_overflow' and
tcp_synq_no_recent_overflow() returns true, potentially dropping a
valid syncookie.
Reading jiffies after loading the timestamp could fix the problem,
but that'd require a memory barrier. Let's just accommodate for
potential timestamp growth instead and extend the interval using
'last_overflow - HZ' as lower bound.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 04d26e7b159a396372646a480f4caa166d1b6720 ]
If no synflood happens for a long enough period of time, then the
synflood timestamp isn't refreshed and jiffies can advance so much
that time_after32() can't accurately compare them any more.
Therefore, we can end up in a situation where time_after32(now,
last_overflow + HZ) returns false, just because these two values are
too far apart. In that case, the synflood timestamp isn't updated as
it should be, which can trick tcp_synq_no_recent_overflow() into
rejecting valid syncookies.
For example, let's consider the following scenario on a system
with HZ=1000:
* The synflood timestamp is 0, either because that's the timestamp
of the last synflood or, more commonly, because we're working with
a freshly created socket.
* We receive a new SYN, which triggers synflood protection. Let's say
that this happens when jiffies == 2147484649 (that is,
'synflood timestamp' + HZ + 2^31 + 1).
* Then tcp_synq_overflow() doesn't update the synflood timestamp,
because time_after32(2147484649, 1000) returns false.
With:
- 2147484649: the value of jiffies, aka. 'now'.
- 1000: the value of 'last_overflow' + HZ.
* A bit later, we receive the ACK completing the 3WHS. But
cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow()
says that we're not under synflood. That's because
time_after32(2147484649, 120000) returns false.
With:
- 2147484649: the value of jiffies, aka. 'now'.
- 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID.
Of course, in reality jiffies would have increased a bit, but this
condition will last for the next 119 seconds, which is far enough
to accommodate for jiffie's growth.
Fix this by updating the overflow timestamp whenever jiffies isn't
within the [last_overflow, last_overflow + HZ] range. That shouldn't
have any performance impact since the update still happens at most once
per second.
Now we're guaranteed to have fresh timestamps while under synflood, so
tcp_synq_no_recent_overflow() can safely use it with time_after32() in
such situations.
Stale timestamps can still make tcp_synq_no_recent_overflow() return
the wrong verdict when not under synflood. This will be handled in the
next patch.
For 64 bits architectures, the problem was introduced with the
conversion of ->tw_ts_recent_stamp to 32 bits integer by commit
cca9bab1b72c ("tcp: use monotonic timestamps for PAWS").
The problem has always been there on 32 bits architectures.
Fixes: cca9bab1b72c ("tcp: use monotonic timestamps for PAWS")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl1yFqoACgkQONu9yGCS
aT6Qhg/9EhDdNA67JIe4Wxns9YRb/yqAKKYGhvGQsUNud1fEYmnIeYOfkwSjPDQv
OCt2gwgPfw1K49r96+zXsw4vywrCBAZShe2OnR7g63Toz9NHGQu1LJy6FyvzBubl
vZ2WQBkY8T/viUfPMQ+xwJcuUsx2mWZ+VqgEzmgrqXxFCEPaxCerXqxgLAN3UhjI
KGqylDKaKlHpNDZYc24GvO8rNoehRZS7FSrKOfjFpDld1bLjHq0ZMNp8ChBp/W99
K+jp3MIeenjdoEiH4K2Hdf8zRRxxOVJf/hcbT9Hi8TKG2HsgAKAMuFz1X6IJtnYZ
02bhYkvPoa+sKVS1wt9tscFzIiwBfAP7CyAROn9nv3A2V3A8ZfXIY/tMWh1yRS+E
y6eVwH6gG9IW7AbjkJmZAlcgxCkYlag6WoTjc1F/s+XdB+fLyQIi8Tls85mKtzwi
GXygbZIC7pKPlcVIfnEK4cMEm9FAZlxqThTXTckzSasteFsxTbQxy4AeflMY7MQE
hn8c32gU4+zXhebF/2vQaEwVar+DkzRNwJUjRN5uLbISwJ1hrEvGWR3mhde2RIZg
JiwGiJ5LmdegSemuLMXBvtOiZ6yTvfoT5p4Wqd5LLhtclrMvDOQ1ye4lBPOHGKFc
gTuC+Hvo50THpNovzVjsJX2tf92z1FtF0jvH8fMEaQ3uPsLwGrA=
=U6+Q
-----END PGP SIGNATURE-----
Merge 4.9.191 into android-4.9-q
Changes in 4.9.191
HID: Add 044f:b320 ThrustMaster, Inc. 2 in 1 DT
MIPS: kernel: only use i8253 clocksource with periodic clockevent
netfilter: ebtables: fix a memory leak bug in compat
ASoC: dapm: Fix handling of custom_stop_condition on DAPM graph walks
bonding: Force slave speed check after link state recovery for 802.3ad
can: dev: call netif_carrier_off() in register_candev()
st21nfca_connectivity_event_received: null check the allocation
st_nci_hci_connectivity_event_received: null check the allocation
ASoC: ti: davinci-mcasp: Correct slot_width posed constraint
net: usb: qmi_wwan: Add the BroadMobi BM818 card
isdn: mISDN: hfcsusb: Fix possible null-pointer dereferences in start_isoc_chain()
isdn: hfcsusb: Fix mISDN driver crash caused by transfer buffer on the stack
perf bench numa: Fix cpu0 binding
can: sja1000: force the string buffer NULL-terminated
can: peak_usb: force the string buffer NULL-terminated
NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()
HID: input: fix a4tech horizontal wheel custom usage
net: cxgb3_main: Fix a resource leak in a error path in 'init_one()'
net: hisilicon: make hip04_tx_reclaim non-reentrant
net: hisilicon: fix hip04-xmit never return TX_BUSY
net: hisilicon: Fix dma_map_single failed on arm64
libata: add SG safety checks in SFF pio transfers
x86/lib/cpu: Address missing prototypes warning
drm/vmwgfx: fix memory leak when too many retries have occurred
perf pmu-events: Fix missing "cpu_clk_unhalted.core" event
selftests: kvm: Adding config fragments
HID: wacom: correct misreported EKR ring values
HID: wacom: Correct distance scale for 2nd-gen Intuos devices
Revert "dm bufio: fix deadlock with loop device"
gpiolib: never report open-drain/source lines as 'input' to user-space
userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx
x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386
x86/apic: Handle missing global clockevent gracefully
x86/boot: Save fields explicitly, zero out everything else
x86/boot: Fix boot regression caused by bootparam sanitizing
dm btree: fix order of block initialization in btree_split_beneath
dm space map metadata: fix missing store of apply_bops() return value
dm table: fix invalid memory accesses with too high sector number
genirq: Properly pair kobject_del() with kobject_add()
mm, page_owner: handle THP splits correctly
mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely
xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOT
Revert "perf test 6: Fix missing kvm module load for s390"
x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h
dmaengine: ste_dma40: fix unneeded variable warning
iommu/dma: Handle SG length overflow better
usb: gadget: composite: Clear "suspended" on reset/disconnect
xen/blkback: fix memory leaks
i2c: emev2: avoid race when unregistering slave client
usb: host: fotg2: restart hcd after port reset
tools: hv: fix KVP and VSS daemons exit code
watchdog: bcm2835_wdt: Fix module autoload
scsi: ufs: Fix RX_TERMINATION_FORCE_ENABLE define value
tcp: fix tcp_rtx_queue_tail in case of empty retransmit queue
ALSA: usb-audio: Fix a stack buffer overflow bug in check_input_term
ALSA: usb-audio: Fix an OOB bug in parse_audio_mixer_unit
tcp: make sure EPOLLOUT wont be missed
ALSA: line6: Fix memory leak at line6_init_pcm() error path
ALSA: seq: Fix potential concurrent access to the deleted pool
KVM: x86: Don't update RIP or do single-step on faulting emulation
x86/apic: Do not initialize LDR and DFR for bigsmp
x86/apic: Include the LDR when clearing out APIC registers
mm/zsmalloc.c: fix race condition in zs_destroy_pool
usb-storage: Add new JMS567 revision to unusual_devs
USB: cdc-wdm: fix race between write and disconnect due to flag abuse
usb: chipidea: udc: don't do hardware access if gadget has stopped
usb: host: ohci: fix a race condition between shutdown and irq
usb: host: xhci: rcar: Fix typo in compatible string matching
USB: storage: ums-realtek: Update module parameter description for auto_delink_en
USB: storage: ums-realtek: Whitelist auto-delink support
ptrace,x86: Make user_64bit_mode() available to 32-bit builds
uprobes/x86: Fix detection of 32-bit user mode
mmc: sdhci-of-at91: add quirk for broken HS200
mmc: core: Fix init of SD cards reporting an invalid VDD range
stm class: Fix a double free of stm_source_device
VMCI: Release resource if the work is already queued
Revert "cfg80211: fix processing world regdomain when non modular"
mac80211: fix possible sta leak
KVM: arm/arm64: vgic: Fix potential deadlock when ap_list is long
KVM: arm/arm64: vgic-v2: Handle SGI bits in GICD_I{S,C}PENDR0 as WI
i2c: piix4: Fix port selection for AMD Family 16h Model 30h
x86/ptrace: fix up botched merge of spectrev1 fix
mm/zsmalloc.c: fix build when CONFIG_COMPACTION=n
Linux 4.9.191
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Commit 8c3088f895a0 ("tcp: be more careful in tcp_fragment()")
triggers following stack trace:
[25244.848046] kernel BUG at ./include/linux/skbuff.h:1406!
[25244.859335] RIP: 0010:skb_queue_prev+0x9/0xc
[25244.888167] Call Trace:
[25244.889182] <IRQ>
[25244.890001] tcp_fragment+0x9c/0x2cf
[25244.891295] tcp_write_xmit+0x68f/0x988
[25244.892732] __tcp_push_pending_frames+0x3b/0xa0
[25244.894347] tcp_data_snd_check+0x2a/0xc8
[25244.895775] tcp_rcv_established+0x2a8/0x30d
[25244.897282] tcp_v4_do_rcv+0xb2/0x158
[25244.898666] tcp_v4_rcv+0x692/0x956
[25244.899959] ip_local_deliver_finish+0xeb/0x169
[25244.901547] __netif_receive_skb_core+0x51c/0x582
[25244.903193] ? inet_gro_receive+0x239/0x247
[25244.904756] netif_receive_skb_internal+0xab/0xc6
[25244.906395] napi_gro_receive+0x8a/0xc0
[25244.907760] receive_buf+0x9a1/0x9cd
[25244.909160] ? load_balance+0x17a/0x7b7
[25244.910536] ? vring_unmap_one+0x18/0x61
[25244.911932] ? detach_buf+0x60/0xfa
[25244.913234] virtnet_poll+0x128/0x1e1
[25244.914607] net_rx_action+0x12a/0x2b1
[25244.915953] __do_softirq+0x11c/0x26b
[25244.917269] ? handle_irq_event+0x44/0x56
[25244.918695] irq_exit+0x61/0xa0
[25244.919947] do_IRQ+0x9d/0xbb
[25244.921065] common_interrupt+0x85/0x85
[25244.922479] </IRQ>
tcp_rtx_queue_tail() (called by tcp_fragment()) can call
tcp_write_queue_prev() on the first packet in the queue, which will trigger
the BUG in tcp_write_queue_prev(), because there is no previous packet.
This happens when the retransmit queue is empty, for example in case of a
zero window.
Commit 8c3088f895a0 ("tcp: be more careful in tcp_fragment()") was not a
simple cherry-pick of the original one from master (b617158dc096)
because there is a specific TCP rtx queue only since v4.15. For more
details, please see the commit message of b617158dc096 ("tcp: be more
careful in tcp_fragment()").
The BUG() is hit due to the specific code added to versions older than
v4.15. The comment in skb_queue_prev() (include/linux/skbuff.h:1406),
just before the BUG_ON() somehow suggests to add a check before using
it, what Tim did.
In master, this code path causing the issue will not be taken because
the implementation of tcp_rtx_queue_tail() is different:
tcp_fragment() → tcp_rtx_queue_tail() → tcp_write_queue_prev() →
skb_queue_prev() → BUG_ON()
Fixes: 8c3088f895a0 ("tcp: be more careful in tcp_fragment()")
Signed-off-by: Tim Froidcoeur <tim.froidcoeur@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl1P7H4ACgkQONu9yGCS
aT4ajw//URV78175uHcZ6XC5kU4FSznr3tBbUKSmxL2SWRm8looA1iFLcItm3WTg
jbV/9qETeGllRGuaLNLgG4AzSENh02KWjvlNk/WqpKOanjB+N/K0jnSZ3t2xqysh
UgRN1sjYiX/kYbfbKyqXev+KrEsFDHqjrkVzd4vJEbhdCm6KOSQI835IbAcrST/w
8NKTRinvU+OjHTIqTyjoh6h+qmJnc2HykZGGmBolZE2NmaqCagOo+eGKwd89YOWr
JOV1DWoTRLrVRlR47ZoQ3zE62dpCgPJJpaJMlQkJRJmqLntu3rzvvQrgNSrcFT/2
cUcnle14SXeMQb00HLBxEN1zhYkLj2QsEBd1MwASKpLjdCtwhxLAKZ7It3dNbAJ7
N3kkjAg+4PHwHrWIvlfhwl85hrrz0QuFTRMWT6d6r/msyYhjxptTFCgiRq6pX29u
lo9BqRSKUwwGrZlKuI7wfrafH+r6ujxzgMlAd0/ScgmrRhoWByX3luIaUlIEk5iF
0y9KxXbQmlJCw/jOtTiw5VYlK8mvLQdpMEx5cuvK7tXuN5A2Cqc5ltjBIMo6ZBlK
prRUJu8/GoKjWGBXYQmk6zJo8QHAh/sO6MyQXlLCRq8hHEscRX0sWnBxH6hAhsmc
1DFb+YVdYxWZk63D3aHncNbs6GXHR4MB+Yoll8w7alH0O7jDP4I=
=F4Ha
-----END PGP SIGNATURE-----
Merge 4.9.189 into android-4.9-q
Changes in 4.9.189
scsi: fcoe: Embed fc_rport_priv in fcoe_rport structure
ARM: dts: Add pinmuxing for i2c2 and i2c3 for LogicPD SOM-LV
ARM: dts: Add pinmuxing for i2c2 and i2c3 for LogicPD torpedo
ARM: dts: logicpd-som-lv: Fix Audio Mute
arm64: cpufeature: Fix CTR_EL0 field definitions
arm64: cpufeature: Fix feature comparison for CTR_EL0.{CWG,ERG}
tcp: be more careful in tcp_fragment()
HID: wacom: fix bit shift for Cintiq Companion 2
HID: Add quirk for HP X1200 PIXART OEM mouse
RDMA: Directly cast the sockaddr union to sockaddr
IB: directly cast the sockaddr union to aockaddr
objtool: Add machine_real_restart() to the noreturn list
objtool: Add rewind_stack_do_exit() to the noreturn list
libceph: use kbasename() and kill ceph_file_part()
atm: iphase: Fix Spectre v1 vulnerability
net: bridge: delete local fdb on device init failure
net: bridge: mcast: don't delete permanent entries when fast leave is enabled
net: fix ifindex collision during namespace removal
net/mlx5: Use reversed order when unregister devices
net: sched: Fix a possible null-pointer dereference in dequeue_func()
tipc: compat: allow tipc commands without arguments
compat_ioctl: pppoe: fix PPPOEIOCSFWD handling
ip6_tunnel: fix possible use-after-free on xmit
ife: error out when nla attributes are empty
bnx2x: Disable multi-cos feature.
block: blk_init_allocated_queue() set q->fq as NULL in the fail case
spi: bcm2835: Fix 3-wire mode if DMA is enabled
x86: cpufeatures: Sort feature word 7
x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations
x86/speculation: Enable Spectre v1 swapgs mitigations
x86/entry/64: Use JMP instead of JMPQ
x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS
Linux 4.9.189
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit b617158dc096709d8600c53b6052144d12b89fab ]
Some applications set tiny SO_SNDBUF values and expect
TCP to just work. Recent patches to address CVE-2019-11478
broke them in case of losses, since retransmits might
be prevented.
We should allow these flows to make progress.
This patch allows the first and last skb in retransmit queue
to be split even if memory limits are hit.
It also adds the some room due to the fact that tcp_sendmsg()
and tcp_sendpage() might overshoot sk_wmem_queued by about one full
TSO skb (64KB size). Note this allowance was already present
in stable backports for kernels < 4.15
Note for < 4.15 backports :
tcp_rtx_queue_tail() will probably look like :
static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
{
struct sk_buff *skb = tcp_send_head(sk);
return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
}
Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Tested-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Christoph Paasch <cpaasch@apple.com>
Cc: Jonathan Looney <jtl@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl1GilkACgkQONu9yGCS
aT6ibw//QcoRk+6jjWIzW2R83tvcSpDREIYKF5xDPULqmaUKv7qcbGqdCT4OI0Vf
CySao6jvmN+1V+5OlQHHw1qmithHF/Y4yEVi+eL0CcMgB6qA2tQxeA0ffFu0Fzxe
oKrAIDANAJ2FKymqbIzTJU8AChNuy+Rc+C60L9O0ZhBHykLYvO7/VepTxdO2aMGs
a9tdZyXnOAr/ZgKx+uN1F8Rx2ZorNfFmP2rDxEZ/B8pVrXsOjMAcxWRsZBv+w9mc
zWaMPEL1vIR/kG35L18l2EFwZ++uIGvfGR2HxhNRUlTqN4m/3trATIc0eRn5PyAC
qBlVdcwUeJKavBqK3cgHvK9CzWQdSMxNefk9A0H66ZGpfSuNCiFjw54AXEiRzx3t
OvzUvDjfa36jsKW7yiYXdLbEa52gT14UDzmSIzlAYQLBplEadHikJ0FxCFzAWpFt
C13gG1e4G0ZKEHH8wZMuRdanHbN87c/0Sm4rgTLMwCO6I1PeILcUwIq9El9M4RVL
MmHSXLKduaHbPJWx+Qopl6NxjqTe/VR8paT6q1QnU00/4kP15aIVE08vZSiGSNP+
Gp6xbv12skAx1m1zP7K82oGdwnCVCJUqlC9wafcjsaxcoCVBWvIgPkyMWAzUeFzc
Ub2yIS9iulUeYOxPXBZKWCqgwpl7kovGztf4gwX2Kuy50mXSZQg=
=ozfF
-----END PGP SIGNATURE-----
Merge 4.9.187 into android-4.9-q
Changes in 4.9.187
MIPS: ath79: fix ar933x uart parity mode
MIPS: fix build on non-linux hosts
arm64/efi: Mark __efistub_stext_offset as an absolute symbol explicitly
dmaengine: imx-sdma: fix use-after-free on probe error path
ath10k: Do not send probe response template for mesh
ath9k: Check for errors when reading SREV register
ath6kl: add some bounds checking
ath: DFS JP domain W56 fixed pulse type 3 RADAR detection
batman-adv: fix for leaked TVLV handler.
media: dvb: usb: fix use after free in dvb_usb_device_exit
crypto: talitos - fix skcipher failure due to wrong output IV
media: marvell-ccic: fix DMA s/g desc number calculation
media: vpss: fix a potential NULL pointer dereference
media: media_device_enum_links32: clean a reserved field
net: stmmac: dwmac1000: Clear unused address entries
net: stmmac: dwmac4/5: Clear unused address entries
signal/pid_namespace: Fix reboot_pid_ns to use send_sig not force_sig
af_key: fix leaks in key_pol_get_resp and dump_sp.
xfrm: Fix xfrm sel prefix length validation
media: mc-device.c: don't memset __user pointer contents
media: staging: media: davinci_vpfe: - Fix for memory leak if decoder initialization fails.
net: phy: Check against net_device being NULL
crypto: talitos - properly handle split ICV.
crypto: talitos - Align SEC1 accesses to 32 bits boundaries.
tua6100: Avoid build warnings.
locking/lockdep: Fix merging of hlocks with non-zero references
media: wl128x: Fix some error handling in fm_v4l2_init_video_device()
cpupower : frequency-set -r option misses the last cpu in related cpu list
net: fec: Do not use netdev messages too early
net: axienet: Fix race condition causing TX hang
s390/qdio: handle PENDING state for QEBSM devices
perf cs-etm: Properly set the value of 'old' and 'head' in snapshot mode
perf test 6: Fix missing kvm module load for s390
gpio: omap: fix lack of irqstatus_raw0 for OMAP4
gpio: omap: ensure irq is enabled before wakeup
regmap: fix bulk writes on paged registers
bpf: silence warning messages in core
rcu: Force inlining of rcu_read_lock()
blkcg, writeback: dead memcgs shouldn't contribute to writeback ownership arbitration
xfrm: fix sa selector validation
perf evsel: Make perf_evsel__name() accept a NULL argument
vhost_net: disable zerocopy by default
ipoib: correcly show a VF hardware address
EDAC/sysfs: Fix memory leak when creating a csrow object
ipsec: select crypto ciphers for xfrm_algo
media: i2c: fix warning same module names
ntp: Limit TAI-UTC offset
timer_list: Guard procfs specific code
acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
media: coda: fix mpeg2 sequence number handling
media: coda: increment sequence offset for the last returned frame
mt7601u: do not schedule rx_tasklet when the device has been disconnected
x86/build: Add 'set -e' to mkcapflags.sh to delete broken capflags.c
mt7601u: fix possible memory leak when the device is disconnected
ath10k: fix PCIE device wake up failed
perf tools: Increase MAX_NR_CPUS and MAX_CACHES
libata: don't request sense data on !ZAC ATA devices
clocksource/drivers/exynos_mct: Increase priority over ARM arch timer
rslib: Fix decoding of shortened codes
rslib: Fix handling of of caller provided syndrome
ixgbe: Check DDM existence in transceiver before access
crypto: asymmetric_keys - select CRYPTO_HASH where needed
EDAC: Fix global-out-of-bounds write when setting edac_mc_poll_msec
bcache: check c->gc_thread by IS_ERR_OR_NULL in cache_set_flush()
iwlwifi: mvm: Drop large non sta frames
net: usb: asix: init MAC address buffers
gpiolib: Fix references to gpiod_[gs]et_*value_cansleep() variants
Bluetooth: hci_bcsp: Fix memory leak in rx_skb
Bluetooth: 6lowpan: search for destination address in all peers
Bluetooth: Check state in l2cap_disconnect_rsp
Bluetooth: validate BLE connection interval updates
gtp: fix Illegal context switch in RCU read-side critical section.
gtp: fix use-after-free in gtp_newlink()
xen: let alloc_xenballooned_pages() fail if not enough memory free
scsi: NCR5380: Reduce goto statements in NCR5380_select()
scsi: NCR5380: Always re-enable reselection interrupt
scsi: mac_scsi: Increase PIO/PDMA transfer length threshold
crypto: ghash - fix unaligned memory access in ghash_setkey()
crypto: arm64/sha1-ce - correct digest for empty data in finup
crypto: arm64/sha2-ce - correct digest for empty data in finup
crypto: chacha20poly1305 - fix atomic sleep when using async algorithm
crypto: crypto4xx - fix a potential double free in ppc4xx_trng_probe
Input: gtco - bounds check collection indent level
regulator: s2mps11: Fix buck7 and buck8 wrong voltages
arm64: tegra: Update Jetson TX1 GPU regulator timings
iwlwifi: pcie: don't service an interrupt that was masked
tracing/snapshot: Resize spare buffer if size changed
NFSv4: Handle the special Linux file open access mode
lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE
ALSA: seq: Break too long mutex context in the write loop
ALSA: hda/realtek: apply ALC891 headset fixup to one Dell machine
media: v4l2: Test type instead of cfg->type in v4l2_ctrl_new_custom()
media: coda: Remove unbalanced and unneeded mutex unlock
KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed
arm64: tegra: Fix AGIC register range
fs/proc/proc_sysctl.c: fix the default values of i_uid/i_gid on /proc/sys inodes.
drm/nouveau/i2c: Enable i2c pads & busses during preinit
padata: use smp_mb in padata_reorder to avoid orphaned padata jobs
9p/virtio: Add cleanup path in p9_virtio_init
PCI: Do not poll for PME if the device is in D3cold
Btrfs: add missing inode version, ctime and mtime updates when punching hole
libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
take floppy compat ioctls to sodding floppy.c
floppy: fix div-by-zero in setup_format_params
floppy: fix out-of-bounds read in next_valid_format
floppy: fix invalid pointer dereference in drive_name
floppy: fix out-of-bounds read in copy_buffer
coda: pass the host file in vma->vm_file on mmap
gpu: ipu-v3: ipu-ic: Fix saturation bit offset in TPMEM
crypto: ccp - Validate the the error value used to index error messages
PCI: hv: Delete the device earlier from hbus->children for hot-remove
PCI: hv: Fix a use-after-free bug in hv_eject_device_work()
crypto: caam - limit output IV to CBC to work around CTR mode DMA issue
um: Allow building and running on older hosts
um: Fix FP register size for XSTATE/XSAVE
parisc: Ensure userspace privilege for ptraced processes in regset functions
parisc: Fix kernel panic due invalid values in IAOQ0 or IAOQ1
powerpc/32s: fix suspend/resume when IBATs 4-7 are used
powerpc/watchpoint: Restore NV GPRs while returning from exception
eCryptfs: fix a couple type promotion bugs
intel_th: msu: Fix single mode with disabled IOMMU
Bluetooth: Add SMP workaround Microsoft Surface Precision Mouse bug
usb: Handle USB3 remote wakeup for LPM enabled devices correctly
dm bufio: fix deadlock with loop device
compiler.h, kasan: Avoid duplicating __read_once_size_nocheck()
compiler.h: Add read_word_at_a_time() function.
lib/strscpy: Shut up KASAN false-positives in strscpy()
ext4: allow directory holes
bnx2x: Prevent load reordering in tx completion processing
bnx2x: Prevent ptp_task to be rescheduled indefinitely
caif-hsi: fix possible deadlock in cfhsi_exit_module()
igmp: fix memory leak in igmpv3_del_delrec()
ipv4: don't set IPv6 only flags to IPv4 addresses
net: bcmgenet: use promisc for unsupported filters
net: dsa: mv88e6xxx: wait after reset deactivation
net: neigh: fix multiple neigh timer scheduling
net: openvswitch: fix csum updates for MPLS actions
nfc: fix potential illegal memory access
rxrpc: Fix send on a connected, but unbound socket
sky2: Disable MSI on ASUS P6T
vrf: make sure skb->data contains ip header to make routing
macsec: fix use-after-free of skb during RX
macsec: fix checksumming after decryption
netrom: fix a memory leak in nr_rx_frame()
netrom: hold sock when setting skb->destructor
bonding: validate ip header before check IPPROTO_IGMP
tcp: Reset bytes_acked and bytes_received when disconnecting
net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling
net: bridge: mcast: fix stale ipv6 hdr pointer when handling v6 query
net: bridge: stp: don't cache eth dest pointer before skb pull
perf/x86/amd/uncore: Rename 'L2' to 'LLC'
perf/x86/amd/uncore: Get correct number of cores sharing last level cache
perf/events/amd/uncore: Fix amd_uncore_llc ID to use pre-defined cpu_llc_id
NFSv4: Fix open create exclusive when the server reboots
nfsd: increase DRC cache limit
nfsd: give out fewer session slots as limit approaches
nfsd: fix performance-limiting session calculation
nfsd: Fix overflow causing non-working mounts on 1 TB machines
drm/panel: simple: Fix panel_simple_dsi_probe
usb: core: hub: Disable hub-initiated U1/U2
tty: max310x: Fix invalid baudrate divisors calculator
pinctrl: rockchip: fix leaked of_node references
tty: serial: cpm_uart - fix init when SMC is relocated
drm/bridge: tc358767: read display_props in get_modes()
drm/bridge: sii902x: pixel clock unit is 10kHz instead of 1kHz
memstick: Fix error cleanup path of memstick_init
tty/serial: digicolor: Fix digicolor-usart already registered warning
tty: serial: msm_serial: avoid system lockup condition
serial: 8250: Fix TX interrupt handling condition
drm/virtio: Add memory barriers for capset cache.
phy: renesas: rcar-gen2: Fix memory leak at error paths
drm/rockchip: Properly adjust to a true clock in adjusted_mode
tty: serial_core: Set port active bit in uart_port_activate
usb: gadget: Zero ffs_io_data
powerpc/pci/of: Fix OF flags parsing for 64bit BARs
PCI: sysfs: Ignore lockdep for remove attribute
kbuild: Add -Werror=unknown-warning-option to CLANG_FLAGS
PCI: xilinx-nwl: Fix Multi MSI data programming
iio: iio-utils: Fix possible incorrect mask calculation
recordmcount: Fix spurious mcount entries on powerpc
mfd: core: Set fwnode for created devices
mfd: arizona: Fix undefined behavior
mfd: hi655x-pmic: Fix missing return value check for devm_regmap_init_mmio_clk
um: Silence lockdep complaint about mmap_sem
powerpc/4xx/uic: clear pending interrupt after irq type/pol change
RDMA/i40iw: Set queue pair state when being queried
serial: sh-sci: Terminate TX DMA during buffer flushing
serial: sh-sci: Fix TX DMA buffer flushing and workqueue races
kallsyms: exclude kasan local symbols on s390
perf test mmap-thread-lookup: Initialize variable to suppress memory sanitizer warning
RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
powerpc/boot: add {get, put}_unaligned_be32 to xz_config.h
f2fs: avoid out-of-range memory access
mailbox: handle failed named mailbox channel request
powerpc/eeh: Handle hugepages in ioremap space
sh: prevent warnings when using iounmap
mm/kmemleak.c: fix check for softirq context
9p: pass the correct prototype to read_cache_page
mm/mmu_notifier: use hlist_add_head_rcu()
locking/lockdep: Fix lock used or unused stats error
locking/lockdep: Hide unused 'class' variable
usb: wusbcore: fix unbalanced get/put cluster_id
usb: pci-quirks: Correct AMD PLL quirk detection
x86/sysfb_efi: Add quirks for some devices with swapped width and height
x86/speculation/mds: Apply more accurate check on hypervisor platform
hpet: Fix division by zero in hpet_time_div()
ALSA: line6: Fix wrong altsetting for LINE6_PODHD500_1
ALSA: hda - Add a conexant codec entry to let mute led work
powerpc/tm: Fix oops on sigreturn on systems without TM
access: avoid the RCU grace period for the temporary subjective credentials
ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt
tcp: reset sk_send_head in tcp_write_queue_purge
arm64: dts: marvell: Fix A37xx UART0 register size
i2c: qup: fixed releasing dma without flush operation completion
arm64: compat: Provide definition for COMPAT_SIGMINSTKSZ
ISDN: hfcsusb: checking idx of ep configuration
media: au0828: fix null dereference in error path
media: cpia2_usb: first wake up, then free in disconnect
media: radio-raremono: change devm_k*alloc to k*alloc
Bluetooth: hci_uart: check for missing tty operations
sched/fair: Don't free p->numa_faults with concurrent readers
drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
ceph: hold i_ceph_lock when removing caps for freeing inode
Linux 4.9.187
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 3b4929f65b0d8249f19a50245cd88ed1a2f78cff upstream.
Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :
BUG_ON(tcp_skb_pcount(skb) < pcount);
This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48
An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.
This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.
Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.
CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
Backport notes, provided by Joao Martins <joao.m.martins@oracle.com>
v4.15 or since commit 737ff314563 ("tcp: use sequence distance to
detect reordering") had switched from the packet-based FACK tracking and
switched to sequence-based.
v4.14 and older still have the old logic and hence on
tcp_skb_shift_data() needs to retain its original logic and have
@fack_count in sync. In other words, we keep the increment of pcount with
tcp_skb_pcount(skb) to later used that to update fack_count. To make it
more explicit we track the new skb that gets incremented to pcount in
@next_pcount, and we get to avoid the constant invocation of
tcp_skb_pcount(skb) all together.
Fixes: 832d11c5cd ("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlxw/ucACgkQONu9yGCS
aT4vaRAAwEs6ppuH93NYZ1v1w2CA7PZLx7MAR3aiEr8rWQ5BnkmowPHvn0kC3VTs
643XN+iKfaq1SuWfdOs7+ACu1QGBfVsdyQhgIKwspAe7C6v284AF0NfsifnuN8/q
0eAgzFFkfngBgoh3oGLaeB0oPia3lSB2zG6hC2cyjeiEYEDcvJUA/ZHl9X1zFsJQ
J9Ikicn1b2gz6/N5VKqrBokCXcrz184Yz8yRrC0rK8VFq0N9N3VZA2NyWmb9/Iqp
Szj//Rh5LyjgrNSJHk0blNqB/5OdS7VsFl6LXuvE7NmUSLJJ0ou/BGLjw9R6TcOv
XFIvuMDw0D/dm/icKprG1LuVYfOomoNu82YMz8K96ymt7BS/SAELHFktzvK1s104
ITS2IvBhpqSPp86dx1vkmo4NEyKUSrff1sLIssjpd9xQMt1+SVP7O7kn02GgRCXz
T8PITSV2IQhHeNeBZVD8W4cLsrqn3sXFWDAVhmIw4J0VK6ghEGfaIBiwquRtNaz/
EsXSFKFs2hV++G8+f6vwQpHGyVSopGrgvvEEdqpWLcgjnYt1NhpfNxbEOBkfXXSd
U0NN1EYs9ade9fVcXrZze9Z8QVF6s4Rdf5unQs64iCp7FvowqzwshJuOoJTz1MB/
ugCFieeAZXwO7tlLoMiUG+j/k0BNdhNWPx8o7sfQf2teTWNfWtU=
=XYhR
-----END PGP SIGNATURE-----
Merge 4.9.160 into android-4.9
Changes in 4.9.160
net: fix IPv6 prefix route residue
vsock: cope with memory allocation failure at socket creation time
hwmon: (lm80) Fix missing unlock on error in set_fan_div()
net: Fix for_each_netdev_feature on Big endian
net: phy: xgmiitorgmii: Support generic PHY status read
net: stmmac: handle endianness in dwmac4_get_timestamp
sky2: Increase D3 delay again
vhost: correctly check the return value of translate_desc() in log_used()
net: Add header for usage of fls64()
tcp: tcp_v4_err() should be more careful
net: Do not allocate page fragments that are not skb aligned
tcp: clear icsk_backoff in tcp_write_queue_purge()
vxlan: test dev->flags & IFF_UP before calling netif_rx()
net: stmmac: Fix a race in EEE enable callback
net: ipv4: use a dedicated counter for icmp_v4 redirect packets
btrfs: Remove false alert when fiemap range is smaller than on-disk extent
net/x25: do not hold the cpu too long in x25_new_lci()
mISDN: fix a race in dev_expire_timer()
ax25: fix possible use-after-free
Linux 4.9.160
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 04c03114be82194d4a4858d41dba8e286ad1787c ]
soukjin bae reported a crash in tcp_v4_err() handling
ICMP_DEST_UNREACH after tcp_write_queue_head(sk)
returned a NULL pointer.
Current logic should have prevented this :
if (seq != tp->snd_una || !icsk->icsk_retransmits ||
!icsk->icsk_backoff || fastopen)
break;
Problem is the write queue might have been purged
and icsk_backoff has not been cleared.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: soukjin bae <soukjin.bae@samsung.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlt/6CsACgkQONu9yGCS
aT7KOw//boywfZ/3pWvs5bp6MsNBrYxa9mzvv7MDQFj10nibTdwY+XFf7O370j3q
Udt3A3V4a57KnploVglMSv4452LETDhMFdlQlxVTMcWAISiZoviVemDTi6GoRCwI
PJg30COP3kE4u51O5i4sGDrff9mRiUPH2XR4JP27CY1aYbjE1IYoOTfB7jUSZhNn
nzghkYLuxIpXc3GlNiWwXSxjRoDz9hw/0jaY4oZC9CRGi/Z/YBqvXLvj9dMO/2R2
0R1ftiI/CEpuQ10XZ9z3h2WlgiWz9Pj183qBMwCUVR0IDVRo+A6lN1Zz7X+eG0no
+CsDAhwfxu2BmDZ4eTDCgcKTk1LDmSVAo3cOoIdW+UkNbeoz99gFivy+jhEyKwQA
KYfMciX31qmrbVfqGT9mZTsDgrYnLCqOmRnhIoAbA3Tb/4gzO6pQXt08eGK4l+LI
u2Yv4L3BiNtZlRPBZYaibtnAAWOPZEuMIf/cRxXJ4//AcWsLirS1Dy3Z70z7J6ew
jkxtDf2F34YYrj1QGRAKfELiZJOJYX7Ig81y80M4tBD2sjRfb29z7QEK5XbDrIID
OJ8Kh57YrdnkGBOFbG8Yvfy4QV1js8VsReasK1pgad8iL8suBYPTPQwmUztGWXz5
AQsU+aQxNYicTtsRIfEN4Tr1Nb/2IqfShC+lwFpcagMvjzjDT/Q=
=GH5x
-----END PGP SIGNATURE-----
Merge 4.9.124 into android-4.9
Changes in 4.9.124
x86/entry/64: Remove %ebx handling from error_entry/exit
ARC: Explicitly add -mmedium-calls to CFLAGS
usb: dwc3: of-simple: fix use-after-free on remove
netfilter: ipv6: nf_defrag: reduce struct net memory waste
selftests: pstore: return Kselftest Skip code for skipped tests
selftests: static_keys: return Kselftest Skip code for skipped tests
selftests: user: return Kselftest Skip code for skipped tests
selftests: zram: return Kselftest Skip code for skipped tests
selftests: sync: add config fragment for testing sync framework
ARM: dts: NSP: Fix i2c controller interrupt type
ARM: dts: NSP: Fix PCIe controllers interrupt types
ARM: dts: Cygnus: Fix I2C controller interrupt type
ARM: dts: Cygnus: Fix PCIe controller interrupt type
arm64: dts: ns2: Fix I2C controller interrupt type
drm: mali-dp: Enable Global SE interrupts mask for DP500
IB/rxe: Fix missing completion for mem_reg work requests
libahci: Fix possible Spectre-v1 pmp indexing in ahci_led_store()
usb: dwc2: fix isoc split in transfer with no data
usb: gadget: composite: fix delayed_status race condition when set_interface
usb: gadget: dwc2: fix memory leak in gadget_init()
xen: add error handling for xenbus_printf
scsi: xen-scsifront: add error handling for xenbus_printf
xen/scsiback: add error handling for xenbus_printf
arm64: make secondary_start_kernel() notrace
qed: Add sanity check for SIMD fastpath handler.
enic: initialize enic->rfs_h.lock in enic_probe
net: hamradio: use eth_broadcast_addr
net: propagate dev_get_valid_name return code
net: stmmac: socfpga: add additional ocp reset line for Stratix10
nvmet: reset keep alive timer in controller enable
ARC: Enable machine_desc->init_per_cpu for !CONFIG_SMP
net: davinci_emac: match the mdio device against its compatible if possible
KVM: arm/arm64: Drop resource size check for GICV window
locking/lockdep: Do not record IRQ state within lockdep code
ipv6: mcast: fix unsolicited report interval after receiving querys
Smack: Mark inode instant in smack_task_to_inode
batman-adv: Fix bat_ogm_iv best gw refcnt after netlink dump
batman-adv: Fix bat_v best gw refcnt after netlink dump
cxgb4: when disabling dcb set txq dcb priority to 0
iio: pressure: bmp280: fix relative humidity unit
brcmfmac: stop watchdog before detach and free everything
ARM: dts: am437x: make edt-ft5x06 a wakeup source
ALSA: seq: Fix UBSAN warning at SNDRV_SEQ_IOCTL_QUERY_NEXT_CLIENT ioctl
usb: xhci: remove the code build warning
usb: xhci: increase CRS timeout value
NFC: pn533: Fix wrong GFP flag usage
perf test session topology: Fix test on s390
perf report powerpc: Fix crash if callchain is empty
perf bench: Fix numa report output code
netfilter: nf_log: fix uninit read in nf_log_proc_dostring
ceph: fix dentry leak in splice_dentry()
selftests/x86/sigreturn/64: Fix spurious failures on AMD CPUs
selftests/x86/sigreturn: Do minor cleanups
ARM: dts: da850: Fix interrups property for gpio
dmaengine: pl330: report BURST residue granularity
dmaengine: k3dma: Off by one in k3_of_dma_simple_xlate()
md/raid10: fix that replacement cannot complete recovery after reassemble
nl80211: relax ht operation checks for mesh
drm/exynos: gsc: Fix support for NV16/61, YUV420/YVU420 and YUV422 modes
drm/exynos: decon5433: Fix per-plane global alpha for XRGB modes
drm/exynos: decon5433: Fix WINCONx reset value
bpf, s390: fix potential memleak when later bpf_jit_prog fails
PCI: xilinx: Add missing of_node_put()
PCI: xilinx-nwl: Add missing of_node_put()
bnx2x: Fix receiving tx-timeout in error or recovery state.
acpi/nfit: fix cmd_rc for acpi_nfit_ctl to always return a value
m68k: fix "bad page state" oops on ColdFire boot
objtool: Support GCC 8 '-fnoreorder-functions'
ipvlan: call dev_change_flags when ipvlan mode is reset
HID: wacom: Correct touch maximum XY of 2nd-gen Intuos
ARM: imx_v6_v7_defconfig: Select ULPI support
ARM: imx_v4_v5_defconfig: Select ULPI support
tracing: Use __printf markup to silence compiler
kasan: fix shadow_size calculation error in kasan_module_alloc
smsc75xx: Add workaround for gigabit link up hardware errata.
samples/bpf: add missing <linux/if_vlan.h>
samples/bpf: Check the error of write() and read()
ieee802154: 6lowpan: set IFLA_LINK
netfilter: x_tables: set module owner for icmp(6) matches
ipv6: make ipv6_renew_options() interrupt/kernel safe
net: qrtr: Broadcast messages only from control port
sh_eth: fix invalid context bug while calling auto-negotiation by ethtool
sh_eth: fix invalid context bug while changing link options by ethtool
ravb: fix invalid context bug while calling auto-negotiation by ethtool
ravb: fix invalid context bug while changing link options by ethtool
ARM: pxa: irq: fix handling of ICMR registers in suspend/resume
net/sched: act_tunnel_key: fix NULL dereference when 'goto chain' is used
ieee802154: at86rf230: switch from BUG_ON() to WARN_ON() on problem
ieee802154: at86rf230: use __func__ macro for debug messages
ieee802154: fakelb: switch from BUG_ON() to WARN_ON() on problem
drm/armada: fix colorkey mode property
netfilter: nf_conntrack: Fix possible possible crash on module loading.
ARC: Improve cmpxchg syscall implementation
bnxt_en: Always set output parameters in bnxt_get_max_rings().
bnxt_en: Fix for system hang if request_irq fails
perf llvm-utils: Remove bashism from kernel include fetch script
nfit: fix unchecked dereference in acpi_nfit_ctl
RDMA/mlx5: Fix memory leak in mlx5_ib_create_srq() error path
ARM: 8780/1: ftrace: Only set kernel memory back to read-only after boot
ARM: DRA7/OMAP5: Enable ACTLR[0] (Enable invalidates of BTB) for secondary cores
ARM: dts: am3517.dtsi: Disable reference to OMAP3 OTG controller
ixgbe: Be more careful when modifying MAC filters
tools: build: Use HOSTLDFLAGS with fixdep
packet: reset network header if packet shorter than ll reserved space
qlogic: check kstrtoul() for errors
tcp: remove DELAYED ACK events in DCTCP
pinctrl: nsp: off by ones in nsp_pinmux_enable()
pinctrl: nsp: Fix potential NULL dereference
drm/nouveau/gem: off by one bugs in nouveau_gem_pushbuf_reloc_apply()
net/ethernet/freescale/fman: fix cross-build error
net: usb: rtl8150: demote allmulti message to dev_dbg()
PCI: OF: Fix I/O space page leak
PCI: versatile: Fix I/O space page leak
net: qca_spi: Avoid packet drop during initial sync
net: qca_spi: Make sure the QCA7000 reset is triggered
net: qca_spi: Fix log level if probe fails
tcp: identify cryptic messages as TCP seq # bugs
KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer
ext4: fix spectre gadget in ext4_mb_regular_allocator()
parisc: Remove ordered stores from syscall.S
xfrm_user: prevent leaking 2 bytes of kernel memory
netfilter: conntrack: dccp: treat SYNC/SYNCACK as invalid if no prior state
packet: refine ring v3 block size test to hold one frame
parisc: Remove unnecessary barriers from spinlock.h
PCI: hotplug: Don't leak pci_slot on registration failure
PCI: Skip MPS logic for Virtual Functions (VFs)
PCI: pciehp: Fix use-after-free on unplug
PCI: pciehp: Fix unprotected list iteration in IRQ handler
i2c: imx: Fix race condition in dma read
reiserfs: fix broken xattr handling (heap corruption, bad retval)
Linux 4.9.124
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit a69258f7aa2623e0930212f09c586fd06674ad79 ]
After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
related callbacks are no longer needed
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAltoWckACgkQONu9yGCS
aT4rdxAAyB4LLh5ylp8b2wEbpSWpOIRGfb1Y78VLf8T3TPsCo/46pTgOPVwpGpeJ
O9QDBcPBEwqJVJEYW0Hf5PBj/JhVGw9uQ4JM+6Tuy1BoZmlfxmUgQz2NotSAAxUD
b5ymy5LnMOoM+GX2IsPILsz0h54NGTlQtdjH2C6dUYx/u8uWzUwgW1eXPdc+m++7
OSWSQ276jZs0oAYgsS5r0GBpe5C+G72dRVDD0uRKTNQEsmSdCOTX6BzaxBzll4yQ
gaZTQre0Sgmv6cyl0rJ6JqdyNECN1i+aw3oSU75Zr+1cfaRPh+8APtN0PW6HUV47
WO08k1/0L5HA/EOU6YI4QwNcQS8yv+H0avmsDwnXc8a2NgKpLFlV+LjAQA2jDnTJ
CWFkLFyfkFtYM/W1Xglyo7OyA1o1BmoZVzjiPECRtW2RqVfl9hORqH4gMtxoHxy2
maE0he/FcVp6iu9hoas2g7V7T/O6UF2ipYWG/+WZBuZY3SjojNth/MKuQ7E+qLY5
UDBMx9CCAjYqAKN4A+aMCAfociV5vTAeQLbwc1ffa4JtqX88nDQxAp7SBP8beEWc
CQsnCvksTdqebeDN0DWcRbSs1abjjeZcoWiifdwGVwwiE5D1RgLZxrABaNEX4XJ6
lQNUYzMuT8D9MzEoDn0TB5mLgIvxdA5gQzwWMV30h5f3fXax1ro=
=qE4w
-----END PGP SIGNATURE-----
Merge 4.9.118 into android-4.9
Changes in 4.9.118
ipv4: remove BUG_ON() from fib_compute_spec_dst
net: ena: Fix use of uninitialized DMA address bits field
net: fix amd-xgbe flow-control issue
net: lan78xx: fix rx handling before first packet is send
net: mdio-mux: bcm-iproc: fix wrong getter and setter pair
NET: stmmac: align DMA stuff to largest cache line length
tcp_bbr: fix bw probing to raise in-flight data for very small BDPs
xen-netfront: wait xenbus state change when load module manually
tcp: do not force quickack when receiving out-of-order packets
tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode
tcp: do not aggressively quick ack after ECN events
tcp: refactor tcp_ecn_check_ce to remove sk type cast
tcp: add one more quick ack after after ECN events
pinctrl: intel: Read back TX buffer state
sched/wait: Remove the lockless swait_active() check in swake_up*()
bonding: avoid lockdep confusion in bond_get_stats()
inet: frag: enforce memory limits earlier
ipv4: frags: handle possible skb truesize change
net: dsa: Do not suspend/resume closed slave_dev
netlink: Fix spectre v1 gadget in netlink_create()
net: stmmac: Fix WoL for PCI-based setups
squashfs: more metadata hardening
squashfs: more metadata hardenings
can: ems_usb: Fix memory leak on ems_usb_disconnect()
net: socket: fix potential spectre v1 gadget in socketcall
virtio_balloon: fix another race between migration and ballooning
kvm: x86: vmx: fix vpid leak
crypto: padlock-aes - Fix Nano workaround data corruption
drm/vc4: Reset ->{x, y}_scaling[1] when dealing with uniplanar formats
scsi: sg: fix minor memory leak in error path
Linux 4.9.118
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 9a9c9b51e54618861420093ae6e9b50a961914c5 ]
We want to add finer control of the number of ACK packets sent after
ECN events.
This patch is not changing current behavior, it only enables following
change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAltcA9oACgkQONu9yGCS
aT4NZg/7Bbf10v5kf18jZeolQkj1GyVv7+tI1fAlM3tT5BGEqAZDUdp383YegQih
YRV5Q7fsdPxqyoXAfQosKdjViBowWglzWJE2YKZHRIOkOBSC0mlhfhNiqwp9owlQ
/JHGwfPhYaJt9Oyuc/OZ3iq/KNe8gm29OuFnQd8pKp8mFakpyiEVcLSeqHUjGQ9P
BBM0H9+F/16iOOVcOqQvbG7rza9AjPXeTLGcMf63Nah6qLSvuH3il/v42N5XXOuJ
iXozco9ifh3BxC/vP3sHrt+BCUeUsNbLUdZO1gZIpybd1byJAbQSPkN8v9jgNZbG
j7xMfMecsUNVsPpv8i8f7Zbh7PDYx+XGk6ufArmYItmp3X65gO+rrxbme+pSvKib
g8x0952+u+ddnyEPH/DcypTI/WU2qeAfXk4HEbeeYiZZxOUmF76XNn55YZW8xpqj
jJi9CaXHiXQpje2a8KGMR3b37T3f5fntOn4rIWT/isaqbqms8j/3b9AYf9yEEGZ1
b05787d6ybHQrMVi9nTXKrRAQqlnKpZZWdsOPvrrV9jO5TnYyDy2RB9/19SEpkdj
kD6lsMlL//o6TRFDIdph9Kg1sm2rFnkT78Hc/RZJ5t27+CM2YfvrLr1+k4G15QqG
N2h+0naYkA6dc052i0kbL0cQGXngeoBeINAKOcXyom99p/rFaKA=
=gXo7
-----END PGP SIGNATURE-----
Merge 4.9.116 into android-4.9
Changes in 4.9.116
MIPS: ath79: fix register address in ath79_ddr_wb_flush()
MIPS: Fix off-by-one in pci_resource_to_user()
ip: hash fragments consistently
ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull
net/mlx4_core: Save the qpn from the input modifier in RST2INIT wrapper
net: skb_segment() should not return NULL
net/mlx5: Adjust clock overflow work period
net/mlx5e: Don't allow aRFS for encapsulated packets
net/mlx5e: Fix quota counting in aRFS expire flow
multicast: do not restore deleted record source filter mode to new one
net: phy: consider PHY_IGNORE_INTERRUPT in phy_start_aneg_priv
rtnetlink: add rtnl_link_state check in rtnl_configure_link
tcp: fix dctcp delayed ACK schedule
tcp: helpers to send special DCTCP ack
tcp: do not cancel delay-AcK on DCTCP special ACK
tcp: do not delay ACK in DCTCP upon CE status change
tcp: free batches of packets in tcp_prune_ofo_queue()
tcp: avoid collapses in tcp_prune_queue() if possible
tcp: detect malicious patterns in tcp_collapse_ofo_queue()
tcp: call tcp_drop() from tcp_data_queue_ofo()
usb: cdc_acm: Add quirk for Castles VEGA3000
usb: core: handle hub C_PORT_OVER_CURRENT condition
usb: gadget: f_fs: Only return delayed status when len is 0
driver core: Partially revert "driver core: correct device's shutdown order"
can: xilinx_can: fix RX loop if RXNEMP is asserted without RXOK
can: xilinx_can: fix power management handling
can: xilinx_can: fix recovery from error states not being propagated
can: xilinx_can: fix device dropping off bus on RX overrun
can: xilinx_can: keep only 1-2 frames in TX FIFO to fix TX accounting
can: xilinx_can: fix incorrect clear of non-processed interrupts
can: xilinx_can: fix RX overflow interrupt not being enabled
turn off -Wattribute-alias
exec: avoid gcc-8 warning for get_task_comm
Linux 4.9.116
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit a0496ef2c23b3b180902dd185d0d63ccbc624cf8 ]
Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
has to be sent immediately so the sender can respond quickly:
""" When receiving packets, the CE codepoint MUST be processed as follows:
1. If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
true and send an immediate ACK.
2. If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
to false and send an immediate ACK.
"""
Previously DCTCP implementation may continue to delay the ACK. This
patch fixes that to implement the RFC by forcing an immediate ACK.
Tested with this packetdrill script provided by Larry Brakmo
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001
0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001
0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
+0.005 < [ce] . 2001:3001(1000) ack 2 win 257
+0.000 > [ect01] . 2:2(0) ack 2001
// Previously the ACK below would be delayed by 40ms
+0.000 > [ect01] E. 2:2(0) ack 3001
+0.500 < F. 9501:9501(0) ack 4 win 257
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 27cde44a259c380a3c09066fc4b42de7dde9b1ad ]
Currently when a DCTCP receiver delays an ACK and receive a
data packet with a different CE mark from the previous one's, it
sends two immediate ACKs acking previous and latest sequences
respectly (for ECN accounting).
Previously sending the first ACK may mark off the delayed ACK timer
(tcp_event_ack_sent). This may subsequently prevent sending the
second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
The culprit is that tcp_send_ack() assumes it always acknowleges
the latest sequence, which is not true for the first special ACK.
The fix is to not make the assumption in tcp_send_ack and check the
actual ack sequence before cancelling the delayed ACK. Further it's
safer to pass the ack sequence number as a local variable into
tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
future bugs like this.
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlqzZsQACgkQONu9yGCS
aT6GpRAA0smp5JfDXHUaQ0/6Syjo0bppqAdXZAkOeRYGEsZC1YG27rmB1YTk7R3X
6pwLRXDFP8264A7Pks4+fbE2CUv6Rt1qS4V6t3CVInzYVshVfThnZkl+cFXYWqTg
P9q7oW7M/nXp73kUJIu2Q+d3NTWozbjHAffwXzaM3XschGEA9FghiDoZToemG7gV
DmKstfUlnLb5M6ljXQla44UQWpPCFL9U5EAXHsEphz5nR7H5fTOvmXd38z2Pxmu8
F+wmVeqS3VzJA4otefClQMH78ZsEkImCJwGx6B2q5KIcVxJRprc0EC/Mxb3oSozR
3W7oxuOhGXV84oo9hY9LVett20ZAyDqcEY7RXaUdaaxC0XnZujgIwiHEUfWH5o5S
7o/kM5DRO7nKDqt9u+O8nXZU9dKPUAV5kPmMWBUlgGslKD9z3yat9z71h4nANd2Y
uQRlXO3CPrBIxTgLqdDRcAtStJoRBjDZxm1X1IYZwG5DjX6oLLYFVjE2rdD6OrOv
5Yi4NQ4qzVt3CcePyactJJQLMGf9/PI9zMXOsefKK6KwApud7zjxl+HVvGaFkDXt
ONkcHW1F9H7AnExaAyIfZiuWlDoOii58XIIAme/xKStCx4jRJD99d6o4fJomwTUq
Hpqe/6XgwmubKsRSn57e45RqtmrtXTAPlcGWDifd//PB1udoHFc=
=jQUN
-----END PGP SIGNATURE-----
Merge 4.9.89 into android-4.9
Changes in 4.9.89
blkcg: fix double free of new_blkg in blkcg_init_queue
Input: tsc2007 - check for presence and power down tsc2007 during probe
perf stat: Issue a HW watchdog disable hint
staging: speakup: Replace BUG_ON() with WARN_ON().
staging: wilc1000: add check for kmalloc allocation failure.
HID: reject input outside logical range only if null state is set
drm: qxl: Don't alloc fbdev if emulation is not supported
ARM: dts: r8a7791: Remove unit-address and reg from integrated cache
ARM: dts: r8a7792: Remove unit-address and reg from integrated cache
ARM: dts: r8a7793: Remove unit-address and reg from integrated cache
ARM: dts: r8a7794: Remove unit-address and reg from integrated cache
arm64: dts: r8a7796: Remove unit-address and reg from integrated cache
drm/sun4i: Fix up error path cleanup for master bind function
drm/sun4i: Set drm_crtc.port to the underlying TCON's output port node
ath10k: fix a warning during channel switch with multiple vaps
drm/sun4i: Fix TCON clock and regmap initialization sequence
PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()
selinux: check for address length in selinux_socket_bind()
x86/mm: Make mmap(MAP_32BIT) work correctly
perf sort: Fix segfault with basic block 'cycles' sort dimension
x86/mce: Handle broadcasted MCE gracefully with kexec
eventpoll.h: fix epoll event masks
i40e: Acquire NVM lock before reads on all devices
i40e: fix ethtool to get EEPROM data from X722 interface
perf tools: Make perf_event__synthesize_mmap_events() scale
ARM: brcmstb: Enable ZONE_DMA for non 64-bit capable peripherals
drivers: net: xgene: Fix hardware checksum setting
drivers: net: phy: xgene: Fix mdio write
drivers: net: xgene: Fix wrong logical operation
drivers: net: xgene: Fix Rx checksum validation logic
drm: Defer disabling the vblank IRQ until the next interrupt (for instant-off)
ath10k: disallow DFS simulation if DFS channel is not enabled
ath10k: fix fetching channel during potential radar detection
usb: misc: lvs: fix race condition in disconnect handling
ARM: bcm2835: Enable missing CMA settings for VC4 driver
net: ethernet: bgmac: Allow MAC address to be specified in DTB
netem: apply correct delay when rate throttling
x86/mce: Init some CPU features early
omapfb: dss: Handle return errors in dss_init_ports()
perf probe: Fix concat_probe_trace_events
perf probe: Return errno when not hitting any event
HID: clamp input to logical range if no null state
net/8021q: create device with all possible features in wanted_features
ARM: dts: Adjust moxart IRQ controller and flags
qed: Always publish VF link from leading hwfn
s390/topology: fix typo in early topology code
zd1211rw: fix NULL-deref at probe
batman-adv: handle race condition for claims between gateways
of: fix of_device_get_modalias returned length when truncating buffers
solo6x10: release vb2 buffers in solo_stop_streaming()
x86/boot/32: Defer resyncing initial_page_table until per-cpu is set up
scsi: fnic: Fix for "Number of Active IOs" in fnicstats becoming negative
scsi: ipr: Fix missed EH wakeup
media: i2c/soc_camera: fix ov6650 sensor getting wrong clock
timers, sched_clock: Update timeout for clock wrap
sysrq: Reset the watchdog timers while displaying high-resolution timers
Input: qt1070 - add OF device ID table
sched: act_csum: don't mangle TCP and UDP GSO packets
PCI: hv: Properly handle PCI bus remove
PCI: hv: Lock PCI bus on device eject
ASoC: rcar: ssi: don't set SSICR.CKDV = 000 with SSIWSR.CONT
spi: omap2-mcspi: poll OMAP2_MCSPI_CHSTAT_RXS for PIO transfer
tcp: sysctl: Fix a race to avoid unexpected 0 window from space
dmaengine: imx-sdma: add 1ms delay to ensure SDMA channel is stopped
usb: dwc3: make sure UX_EXIT_PX is cleared
ARM: dts: bcm2835: add index to the ethernet alias
perf annotate: Fix a bug following symbolic link of a build-id file
perf buildid: Do not assume that readlink() returns a null terminated string
i40e/i40evf: Fix use after free in Rx cleanup path
scsi: be2iscsi: Check tag in beiscsi_mccq_compl_wait
driver: (adm1275) set the m,b and R coefficients correctly for power
bonding: make speed, duplex setting consistent with link state
mm: Fix false-positive VM_BUG_ON() in page_cache_{get,add}_speculative()
ALSA: firewire-lib: add a quirk of packet without valid EOH in CIP format
ARM: dts: r8a7794: Add DU1 clock to device tree
ARM: dts: r8a7794: Correct clock of DU1
ARM: dts: silk: Correct clock of DU1
blk-throttle: make sure expire time isn't too big
regulator: core: Limit propagation of parent voltage count and list
perf trace: Handle unpaired raw_syscalls:sys_exit event
f2fs: relax node version check for victim data in gc
drm/ttm: never add BO that failed to validate to the LRU list
bonding: refine bond_fold_stats() wrap detection
PCI: Apply Cavium ACS quirk only to CN81xx/CN83xx/CN88xx devices
powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout
braille-console: Fix value returned by _braille_console_setup
drm/vmwgfx: Fixes to vmwgfx_fb
vxlan: vxlan dev should inherit lowerdev's gso_max_size
NFC: nfcmrvl: Include unaligned.h instead of access_ok.h
NFC: nfcmrvl: double free on error path
NFC: pn533: change order of free_irq and dev unregistration
ARM: dts: r7s72100: fix ethernet clock parent
ARM: dts: r8a7790: Correct parent of SSI[0-9] clocks
ARM: dts: r8a7791: Correct parent of SSI[0-9] clocks
ARM: dts: r8a7793: Correct parent of SSI[0-9] clocks
powerpc: Avoid taking a data miss on every userspace instruction miss
net: hns: Correct HNS RSS key set function
net/faraday: Add missing include of of.h
qed: Fix TM block ILT allocation
rtmutex: Fix PI chain order integrity
printk: Correctly handle preemption in console_unlock()
drm: rcar-du: Handle event when disabling CRTCs
ARM: dts: koelsch: Correct clock frequency of X2 DU clock input
reiserfs: Make cancel_old_flush() reliable
ASoC: rt5677: Add OF device ID table
IB/hfi1: Check for QSFP presence before attempting reads
ALSA: firewire-digi00x: add support for console models of Digi00x series
ALSA: firewire-digi00x: handle all MIDI messages on streaming packets
fm10k: correctly check if interface is removed
EDAC, altera: Fix peripheral warnings for Cyclone5
scsi: ses: don't get power status of SES device slot on probe
qed: Correct MSI-x for storage
apparmor: Make path_max parameter readonly
iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range
kvm/svm: Setup MCG_CAP on AMD properly
kvm: nVMX: Disallow userspace-injected exceptions in guest mode
video: ARM CLCD: fix dma allocation size
drm/radeon: Fail fb creation from imported dma-bufs.
drm/amdgpu: Fail fb creation from imported dma-bufs. (v2)
drm/rockchip: vop: Enable pm domain before vop_initial
i40e: only register client on iWarp-capable devices
coresight: Fixes coresight DT parse to get correct output port ID.
lkdtm: turn off kcov for lkdtm_rodata_do_nothing:
tty: amba-pl011: Fix spurious TX interrupts
serial: imx: setup DCEDTE early and ensure DCD and RI irqs to be off
MIPS: BPF: Quit clobbering callee saved registers in JIT code.
MIPS: BPF: Fix multiple problems in JIT skb access helpers.
MIPS: r2-on-r6-emu: Fix BLEZL and BGTZL identification
MIPS: r2-on-r6-emu: Clear BLTZALL and BGEZALL debugfs counters
v4l: vsp1: Prevent multiple streamon race commencing pipeline early
v4l: vsp1: Register pipe with output WPF
regulator: isl9305: fix array size
md/raid6: Fix anomily when recovering a single device in RAID6.
md.c:didn't unlock the mddev before return EINVAL in array_size_store
powerpc/nohash: Fix use of mmu_has_feature() in setup_initial_memory_limit()
usb: dwc2: Make sure we disconnect the gadget state
usb: gadget: dummy_hcd: Fix wrong power status bit clear/reset in dummy_hub_control()
perf evsel: Return exact sub event which failed with EPERM for wildcards
iwlwifi: mvm: fix RX SKB header size and align it properly
drivers/perf: arm_pmu: handle no platform_device
perf inject: Copy events when reordering events in pipe mode
net: fec: add phy-reset-gpios PROBE_DEFER check
perf session: Don't rely on evlist in pipe mode
vfio/powerpc/spapr_tce: Enforce IOMMU type compatibility check
vfio/spapr_tce: Check kzalloc() return when preregistering memory
scsi: sg: check for valid direction before starting the request
scsi: sg: close race condition in sg_remove_sfp_usercontext()
ALSA: hda: Add Geminilake id to SKL_PLUS
kprobes/x86: Fix kprobe-booster not to boost far call instructions
kprobes/x86: Set kprobes pages read-only
pwm: tegra: Increase precision in PWM rate calculation
clk: qcom: msm8996: Fix the vfe1 powerdomain name
Bluetooth: Avoid bt_accept_unlink() double unlinking
Bluetooth: 6lowpan: fix delay work init in add_peer_chan()
mac80211_hwsim: use per-interface power level
ath10k: fix compile time sanity check for CE4 buffer size
wil6210: fix protection against connections during reset
wil6210: fix memory access violation in wil_memcpy_from/toio_32
perf stat: Fix bug in handling events in error state
mwifiex: Fix invalid port issue
drm/edid: set ELD connector type in drm_edid_to_eld()
video/hdmi: Allow "empty" HDMI infoframes
HID: elo: clear BTN_LEFT mapping
iwlwifi: mvm: rs: don't override the rate history in the search cycle
clk: meson: gxbb: fix wrong clock for SARADC/SANA
ARM: dts: exynos: Correct Trats2 panel reset line
sched: Stop switched_to_rt() from sending IPIs to offline CPUs
sched: Stop resched_cpu() from sending IPIs to offline CPUs
test_firmware: fix setting old custom fw path back on exit
net: ieee802154: adf7242: Fix bug if defined DEBUG
net: xfrm: allow clearing socket xfrm policies.
mtd: nand: fix interpretation of NAND_CMD_NONE in nand_command[_lp]()
net: thunderx: Set max queue count taking XDP_TX into account
ARM: dts: am335x-pepper: Fix the audio CODEC's reset pin
ARM: dts: omap3-n900: Fix the audio CODEC's reset pin
mtd: nand: ifc: update bufnum mask for ver >= 2.0.0
userns: Don't fail follow_automount based on s_user_ns
leds: pm8058: Silence pointer to integer size warning
power: supply: ab8500_charger: Fix an error handling path
power: supply: ab8500_charger: Bail out in case of error in 'ab8500_charger_init_hw_registers()'
ath10k: update tdls teardown state to target
scsi: ses: don't ask for diagnostic pages repeatedly during probe
pwm: stmpe: Fix wrong register offset for hwpwm=2 case
clk: qcom: msm8916: fix mnd_width for codec_digcodec
mwifiex: cfg80211: do not change virtual interface during scan processing
ath10k: fix invalid STS_CAP_OFFSET_MASK
tools/usbip: fixes build with musl libc toolchain
spi: sun6i: disable/unprepare clocks on remove
bnxt_en: Don't print "Link speed -1 no longer supported" messages.
scsi: core: scsi_get_device_flags_keyed(): Always return device flags
scsi: devinfo: apply to HP XP the same flags as Hitachi VSP
scsi: dh: add new rdac devices
media: vsp1: Prevent suspending and resuming DRM pipelines
media: cpia2: Fix a couple off by one bugs
veth: set peer GSO values
drm/amdkfd: Fix memory leaks in kfd topology
powerpc/modules: Don't try to restore r2 after a sibling call
agp/intel: Flush all chipset writes after updating the GGTT
mac80211_hwsim: enforce PS_MANUAL_POLL to be set after PS_ENABLED
mac80211: remove BUG() when interface type is invalid
ASoC: nuc900: Fix a loop timeout test
ipvlan: add L2 check for packets arriving via virtual devices
rcutorture/configinit: Fix build directory error message
locking/locktorture: Fix num reader/writer corner cases
ima: relax requiring a file signature for new files with zero length
net: hns: Some checkpatch.pl script & warning fixes
x86/boot/32: Fix UP boot on Quark and possibly other platforms
x86/cpufeatures: Add Intel PCONFIG cpufeature
selftests/x86/entry_from_vm86: Exit with 1 if we fail
selftests/x86: Add tests for User-Mode Instruction Prevention
selftests/x86: Add tests for the STR and SLDT instructions
selftests/x86/entry_from_vm86: Add test cases for POPF
x86/vm86/32: Fix POPF emulation
x86/speculation, objtool: Annotate indirect calls/jumps for objtool on 32-bit kernels
x86/speculation: Remove Skylake C2 from Speculation Control microcode blacklist
x86/mm: Fix vmalloc_fault to use pXd_large
parisc: Handle case where flush_cache_range is called with no context
ALSA: pcm: Fix UAF in snd_pcm_oss_get_formats()
ALSA: hda - Revert power_save option default value
ALSA: seq: Fix possible UAF in snd_seq_check_queue()
ALSA: seq: Clear client entry before deleting else at closing
drm/amdgpu: fix prime teardown order
drm/amdgpu/dce: Don't turn off DP sink when disconnected
fs: Teach path_connected to handle nfs filesystems with multiple roots.
lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
fs/aio: Add explicit RCU grace period when freeing kioctx
fs/aio: Use RCU accessors for kioctx_table->table[]
irqchip/gic-v3-its: Ensure nr_ites >= nr_lpis
scsi: sg: fix SG_DXFER_FROM_DEV transfers
scsi: sg: fix static checker warning in sg_is_valid_dxfer
scsi: sg: only check for dxfer_len greater than 256M
btrfs: alloc_chunk: fix DUP stripe size handling
btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
scsi: qla2xxx: Fix extraneous ref on sp's after adapter break
USB: gadget: udc: Add missing platform_device_put() on error in bdc_pci_probe()
usb: dwc3: Fix GDBGFIFOSPACE_TYPE values
usb: gadget: bdc: 64-bit pointer capability check
Linux 4.9.89
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit c48367427a39ea0b85c7cf018fe4256627abfd9e ]
Because sysctl_tcp_adv_win_scale could be changed any time, so there
is one race in tcp_win_from_space.
For example,
1.sysctl_tcp_adv_win_scale<=0 (sysctl_tcp_adv_win_scale is negative now)
2.space>>(-sysctl_tcp_adv_win_scale) (sysctl_tcp_adv_win_scale is postive now)
As a result, tcp_win_from_space returns 0. It is unexpected.
Certainly if the compiler put the sysctl_tcp_adv_win_scale into one
register firstly, then use the register directly, it would be ok.
But we could not depend on the compiler behavior.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlpL3vYACgkQONu9yGCS
aT4/aQ//VcOiG5at4QV8aEyPkmlL69jspa5yWwbbz0p3vddZrGHb72aT+lfcPgOk
tZpHR4zKOwPPu6ROgVMnyGQTks/I5ZwCqxVjuapgiXJy34QIh6JLjldKl03Bfoz8
2Q+5u1eia1R+2pLhEQLPp5siH3pbgqjMiC/jr2UtleC1QKiwOgmFApoXsl35OxUW
VkTjdqTcllxa5cFmEgb53xzH/zm0XemVe6xNH4Y+KMUmow/GcynPdxjZxkBpgl4t
HEhPR1UP708JF+LHv4FA35HujtxtK9y1UVpZmroW4Y6tW/lwwYAgvIWOC0EYv1i0
Uin5NfvG2BVkSmct19qn7IKfuVffCRb+dxvKP9I1wembqZSo68QC8rjs/dYbw+VU
SOGZM/Nd4m3yseM1QjQHc97GSvxtDwqzRBFp5c43HXrmn1ha8By9kqWa25JsiwJo
GHWmDzTWw9gW7Jp/1EVY7VGO3FNSJHy87ZIHoEnAXxzCtf7BZL2Z+myQhhPIiCXg
9jtcGkhvFVkvwjJQAAxPRr8kcNQaGZbTqh/ZhYzahl/HNAX6Ez/af30wJkmand2V
geps+QgWIzy6dgIiWr9YnJgqRbMeaHE8Ncn/Ch/8Lp30tkFnpUEOrdqFOscDtVgk
yzKFl57pnoxVdGjqejiF10v904sG1uHU8h87jtmnhxpgUz/TTLo=
=RrJV
-----END PGP SIGNATURE-----
Merge 4.9.74 into android-4.9
Changes in 4.9.74
sync objtool's copy of x86-opcode-map.txt
tracing: Remove extra zeroing out of the ring buffer page
tracing: Fix possible double free on failure of allocating trace buffer
tracing: Fix crash when it fails to alloc ring buffer
ring-buffer: Mask out the info bits when returning buffer page length
iw_cxgb4: Only validate the MSN for successful completions
ASoC: wm_adsp: Fix validation of firmware and coeff lengths
ASoC: da7218: fix fix child-node lookup
ASoC: fsl_ssi: AC'97 ops need regmap, clock and cleaning up on failure
ASoC: twl4030: fix child-node lookup
ASoC: tlv320aic31xx: Fix GPIO1 register definition
ALSA: hda: Drop useless WARN_ON()
ALSA: hda - fix headset mic detection issue on a Dell machine
x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
x86/mm: Remove flush_tlb() and flush_tlb_current_task()
x86/mm: Make flush_tlb_mm_range() more predictable
x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()
x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code
x86/mm: Disable PCID on 32-bit kernels
x86/mm: Add the 'nopcid' boot option to turn off PCID
x86/mm: Enable CR4.PCIDE on supported systems
x86/mm/64: Fix reboot interaction with CR4.PCIDE
kbuild: add '-fno-stack-check' to kernel build options
ipv4: igmp: guard against silly MTU values
ipv6: mcast: better catch silly mtu values
net: fec: unmap the xmit buffer that are not transferred by DMA
net: igmp: Use correct source address on IGMPv3 reports
netlink: Add netns check on taps
net: qmi_wwan: add Sierra EM7565 1199:9091
net: reevalulate autoflowlabel setting after sysctl setting
ptr_ring: add barriers
RDS: Check cmsg_len before dereferencing CMSG_DATA
tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
tcp md5sig: Use skb's saddr when replying to an incoming segment
tg3: Fix rx hang on MTU change with 5717/5719
net: ipv4: fix for a race condition in raw_sendmsg
net: mvmdio: disable/unprepare clocks in EPROBE_DEFER case
sctp: Replace use of sockets_allocated with specified macro.
adding missing rcu_read_unlock in ipxip6_rcv
ipv4: Fix use-after-free when flushing FIB tables
net: bridge: fix early call to br_stp_change_bridge_id and plug newlink leaks
net: fec: Allow reception of frames bigger than 1522 bytes
net: Fix double free and memory corruption in get_net_ns_by_id()
net: phy: micrel: ksz9031: reconfigure autoneg after phy autoneg workaround
sock: free skb in skb_complete_tx_timestamp on error
tcp: invalidate rate samples during SACK reneging
net/mlx5: Fix rate limit packet pacing naming and struct
net/mlx5e: Fix features check of IPv6 traffic
net/mlx5e: Fix possible deadlock of VXLAN lock
net/mlx5e: Add refcount to VXLAN structure
net/mlx5e: Prevent possible races in VXLAN control flow
net/mlx5: Fix error flow in CREATE_QP command
s390/qeth: apply takeover changes when mode is toggled
s390/qeth: don't apply takeover changes to RXIP
s390/qeth: lock IP table while applying takeover changes
s390/qeth: update takeover IPs after configuration change
usbip: fix usbip bind writing random string after command in match_busid
usbip: prevent leaking socket pointer address in messages
usbip: stub: stop printing kernel pointer addresses in messages
usbip: vhci: stop printing kernel pointer addresses in messages
USB: serial: ftdi_sio: add id for Airbus DS P8GR
USB: serial: qcserial: add Sierra Wireless EM7565
USB: serial: option: add support for Telit ME910 PID 0x1101
USB: serial: option: adding support for YUGA CLM920-NC5
usb: Add device quirk for Logitech HD Pro Webcam C925e
usb: add RESET_RESUME for ELSA MicroLink 56K
USB: Fix off by one in type-specific length check of BOS SSP capability
usb: xhci: Add XHCI_TRUST_TX_LENGTH for Renesas uPD720201
timers: Use deferrable base independent of base::nohz_active
timers: Invoke timer_start_debug() where it makes sense
timers: Reinitialize per cpu bases on hotplug
nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()
x86/smpboot: Remove stale TLB flush invocations
n_tty: fix EXTPROC vs ICANON interaction with TIOCINQ (aka FIONREAD)
tty: fix tty_ldisc_receive_buf() documentation
mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP
Linux 4.9.74
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit d4761754b4fb2ef8d9a1e9d121c4bec84e1fe292 ]
Mark tcp_sock during a SACK reneging event and invalidate rate samples
while marked. Such rate samples may overestimate bw by including packets
that were SACKed before reneging.
< ack 6001 win 10000 sack 7001:38001
< ack 7001 win 0 sack 8001:38001 // Reneg detected
> seq 7001:8001 // RTO, SACK cleared.
< ack 38001 win 10000
In above example the rate sample taken after the last ack will count
7001-38001 as delivered while the actual delivery rate likely could
be much lower i.e. 7001-8001.
This patch adds a new field tcp_sock.sack_reneg and marks it when we
declare SACK reneging and entering TCP_CA_Loss, and unmarks it after
the last rate sample was taken before moving back to TCP_CA_Open. This
patch also invalidates rate samples taken while tcp_sock.is_sack_reneg
is set.
Fixes: b9f64820fb ("tcp: track data delivery rate for a TCP connection")
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAloQCeEACgkQONu9yGCS
aT5NSg/+KKaM27NOw+QU41S27e7EEk2ToFZVInD4YMVM37WDP3Dhy/6qGKqd7QEd
pYjcxdXhVi+vIyozY/QjXGNhTTOao5AtgGTdw1l2lag2VritbAqplgr0hPRLoj4M
9BEYveO2u+ooNJ6vieyW7TIVqGh05X4F43/Ng1I3iAbmvMcyg8LcqauYMaNa37jj
PP9XWWbZ87GCLqNM3Cy/V7uR/xFjj7N7/N6//547QRTgqnB31EytUXEwxtvgS7Z8
HxhVYk7gTzZMgpN6TUo0AKnD9iOxzR18kC0PooUz2nphS92Zad2rakhxUVJASUXv
DpY5LSyiN/F6fVp68ObAx8Cw31Uavjyvy/TJju1Kg9Mrt1fN/MBsEH0HSI7PrxyQ
7Q2Se+A8LZqeYW4P1AvHjei7Z10AL64YcXwrsAkeouh74WWKrVoEeoYVYDF+FdRy
87jNJE6+W589g+hLI0fX1Q07luEfToRfvZQTk0pdxLTxt5HrCgSoX6q6yOQ4ofMn
mTfRmNSaeiEaNDgl/f9ZqH3ViOFsINJ+0zgCMmFv4p8yyl2grj63ELdHKkjqTHCN
oPH3ZCeCFV+uvunA8geHLzToMDfZOsPQ4BbewV3TEFG4rSVxXQcoTJWIYaxHYaih
5X/JIgFZyaDnUFJQyEtcj8MyVnz1oraw73ghdOMuvHxkYXe+Xr4=
=DcCO
-----END PGP SIGNATURE-----
Merge 4.9.63 into android-4.9
Changes in 4.9.63
gso: fix payload length when gso_size is zero
tun/tap: sanitize TUNSETSNDBUF input
ipv6: addrconf: increment ifp refcount before ipv6_del_addr()
netlink: do not set cb_running if dump's start() errs
net: call cgroup_sk_alloc() earlier in sk_clone_lock()
tcp: fix tcp_mtu_probe() vs highest_sack
l2tp: check ps->sock before running pppol2tp_session_ioctl()
tun: call dev_get_valid_name() before register_netdevice()
sctp: add the missing sock_owned_by_user check in sctp_icmp_redirect
tcp/dccp: fix ireq->opt races
packet: avoid panic in packet_getsockopt()
soreuseport: fix initialization race
ipv6: flowlabel: do not leave opt->tot_len with garbage
sctp: full support for ipv6 ip_nonlocal_bind & IP_FREEBIND
tcp/dccp: fix lockdep splat in inet_csk_route_req()
tcp/dccp: fix other lockdep splats accessing ireq_opt
net/unix: don't show information about sockets from other namespaces
tap: double-free in error path in tap_open()
ipip: only increase err_count for some certain type icmp in ipip_err
ip6_gre: only increase err_count for some certain type icmpv6 in ip6gre_err
ip6_gre: update dst pmtu if dev mtu has been updated by toobig in __gre6_xmit
tun: allow positive return values on dev_get_valid_name() call
sctp: reset owner sk for data chunks on out queues when migrating a sock
net_sched: avoid matching qdisc with zero handle
ppp: fix race in ppp device destruction
mac80211: accept key reinstall without changing anything
mac80211: use constant time comparison with keys
mac80211: don't compare TKIP TX MIC key in reinstall prevention
usb: usbtest: fix NULL pointer dereference
Input: ims-psu - check if CDC union descriptor is sane
ALSA: seq: Cancel pending autoload work at unbinding device
Revert "ARM: dts: imx53-qsb-common: fix FEC pinmux config"
netfilter: nat: avoid use of nf_conn_nat extension
netfilter: nat: Revert "netfilter: nat: convert nat bysrc hash to rhashtable"
security/keys: add CONFIG_KEYS_COMPAT to Kconfig
brcmfmac: remove setting IBSS mode when stopping AP
target/iscsi: Fix iSCSI task reassignment handling
qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT (v2)
misc: panel: properly restore atomic counter on error path
Linux 4.9.63
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 2b7cda9c35d3b940eb9ce74b30bbd5eb30db493d ]
Based on SNMP values provided by Roman, Yuchung made the observation
that some crashes in tcp_sacktag_walk() might be caused by MTU probing.
Looking at tcp_mtu_probe(), I found that when a new skb was placed
in front of the write queue, we were not updating tcp highest sack.
If one skb is freed because all its content was copied to the new skb
(for MTU probing), then tp->highest_sack could point to a now freed skb.
Bad things would then happen, including infinite loops.
This patch renames tcp_highest_sack_combine() and uses it
from tcp_mtu_probe() to fix the bug.
Note that I also removed one test against tp->sacked_out,
since we want to replace tp->highest_sack regardless of whatever
condition, since keeping a stale pointer to freed skb is a recipe
for disaster.
Fixes: a47e5a988a ("[TCP]: Convert highest_sack to sk_buff to allow direct access")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Roman Gushchin <guro@fb.com>
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Backport of upstream commit 19f6d3f3c842 ("net/tcp-fastopen: Add new API
support")
Bug: 63449462
Test: Tests in https://android-review.googlesource.com/535357/ pass
Change-Id: Icc181febd74e3117c2fc835d7ed935e107b5815e
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from commit 19f6d3f3c8422d65b5e3d2162e30ef07c6e21ea2)
Refactor the cookie check logic in tcp_send_syn_data() into a function.
This function will be called else where in later changes.
Bug: 63449462
Test: Tests in https://android-review.googlesource.com/535357/ pass
Change-Id: I14b0fadd8f97569f773a2e2f15f0b4e8dca48402
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from commit 065263f40f0972d5f1cd294bb0242bd5aa5f06b2)
The default initial rwnd is hardcoded to 10.
Now we allow it to be controlled via
/proc/sys/net/ipv4/tcp_default_init_rwnd
which limits the values from 3 to 100
This is somewhat needed because ipv6 routes are
autoconfigured by the kernel.
See "An Argument for Increasing TCP's Initial Congestion Window"
in https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf
Change-Id: I386b2a9d62de0ebe05c1ebe1b4bd91b314af5c54
Signed-off-by: JP Abgrall <jpa@google.com>
Conflicts:
net/ipv4/sysctl_net_ipv4.c
net/ipv4/tcp_input.c
With syzkaller help, Marco Grassi found a bug in TCP stack,
crashing in tcp_collapse()
Root cause is that sk_filter() can truncate the incoming skb,
but TCP stack was not really expecting this to happen.
It probably was expecting a simple DROP or ACCEPT behavior.
We first need to make sure no part of TCP header could be removed.
Then we need to adjust TCP_SKB_CB(skb)->end_seq
Many thanks to syzkaller team and Marco for giving us a reproducer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Marco Grassi <marco.gra@gmail.com>
Reported-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, socket lookups for l3mdev (vrf) use cases can match a socket
that is bound to a port but not a device (ie., a global socket). If the
sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
based on the main table even though the packet came in from an L3 domain.
The end result is that the connection does not establish creating
confusion for users since the service is running and a socket shows in
ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
skb came through an interface enslaved to an l3mdev device and the
tcp_l3mdev_accept is not set.
skb's through an l3mdev interface are marked by setting a flag in
inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
inet_skb_parm struct is moved in the cb per commit 971f10eca1, so the
match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
move is done after the socket lookup, so IP6CB is used.
The flags field in inet_skb_parm struct needs to be increased to add
another flag. There is currently a 1-byte hole following the flags,
so it can be expanded to u16 without increasing the size of the struct.
Fixes: 193125dbd8 ("net: Introduce VRF device driver")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit introduces an optional new "omnipotent" hook,
cong_control(), for congestion control modules. The cong_control()
function is called at the end of processing an ACK (i.e., after
updating sequence numbers, the SACK scoreboard, and loss
detection). At that moment we have precise delivery rate information
the congestion control module can use to control the sending behavior
(using cwnd, TSO skb size, and pacing rate) in any CA state.
This function can also be used by a congestion control that prefers
not to use the default cwnd reduction approach (i.e., the PRR
algorithm) during CA_Recovery to control the cwnd and sending rate
during loss recovery.
We take advantage of the fact that recent changes defer the
retransmission or transmission of new data (e.g. by F-RTO) in recovery
until the new tcp_cong_control() function is run.
With this commit, we only run tcp_update_pacing_rate() if the
congestion control is not using this new API. New congestion controls
which use the new API do not want the TCP stack to run the default
pacing rate calculation and overwrite whatever pacing rate they have
chosen at initialization time.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the TCP send buffer expands to twice cwnd, in order to allow
limited transmits in the CA_Recovery state. This assumes that cwnd
does not increase in the CA_Recovery.
For some congestion control algorithms, like the upcoming BBR module,
if the losses in recovery do not indicate congestion then we may
continue to raise cwnd multiplicatively in recovery. In such cases the
current multiplier will falsely limit the sending rate, much as if it
were limited by the application.
This commit adds an optional congestion control callback to use a
different multiplier to expand the TCP send buffer. For congestion
control modules that do not specificy this callback, TCP continues to
use the previous default of 2.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To allow congestion control modules to use the default TSO auto-sizing
algorithm as one of the ingredients in their own decision about TSO sizing:
1) Export tcp_tso_autosize() so that CC modules can use it.
2) Change tcp_tso_autosize() to allow callers to specify a minimum
number of segments per TSO skb, in case the congestion control
module has a different notion of the best floor for TSO skbs for
the connection right now. For very low-rate paths or policed
connections it can be appropriate to use smaller TSO skbs.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the tso_segs_goal() function in tcp_congestion_ops to allow the
congestion control module to specify the number of segments that
should be in a TSO skb sent by tcp_write_xmit() and
tcp_xmit_retransmit_queue(). The congestion control module can either
request a particular number of segments in TSO skb that we transmit,
or return 0 if it doesn't care.
This allows the upcoming BBR congestion control module to select small
TSO skb sizes if the module detects that the bottleneck bandwidth is
very low, or that the connection is policed to a low rate.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds code to track whether the delivery rate represented
by each rate_sample was limited by the application.
Upon each transmit, we store in the is_app_limited field in the skb a
boolean bit indicating whether there is a known "bubble in the pipe":
a point in the rate sample interval where the sender was
application-limited, and did not transmit even though the cwnd and
pacing rate allowed it.
This logic marks the flow app-limited on a write if *all* of the
following are true:
1) There is less than 1 MSS of unsent data in the write queue
available to transmit.
2) There is no packet in the sender's queues (e.g. in fq or the NIC
tx queue).
3) The connection is not limited by cwnd.
4) There are no lost packets to retransmit.
The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
the connection is application-limited at the moment. If the flow is
application-limited, it sets the tp->app_limited field. If the flow is
application-limited then that means there is effectively a "bubble" of
silence in the pipe now, and this silence will be reflected in a lower
bandwidth sample for any rate samples from now until we get an ACK
indicating this bubble has exited the pipe: specifically, until we get
an ACK for the next packet we transmit.
When we send every skb we record in scb->tx.is_app_limited whether the
resulting rate sample will be application-limited.
The code in tcp_rate_gen() checks to see when it is safe to mark all
known application-limited bubbles of silence as having exited the
pipe. It does this by checking to see when the delivered count moves
past the tp->app_limited marker. At this point it zeroes the
tp->app_limited marker, as all known bubbles are out of the pipe.
We make room for the tx.is_app_limited bit in the skb by borrowing a
bit from the in_flight field used by NV to record the number of bytes
in flight. The receive window in the TCP header is 16 bits, and the
max receive window scaling shift factor is 14 (RFC 1323). So the max
receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
only need 30 bits for the tx.in_flight used by NV.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch generates data delivery rate (throughput) samples on a
per-ACK basis. These rate samples can be used by congestion control
modules, and specifically will be used by TCP BBR in later patches in
this series.
Key state:
tp->delivered: Tracks the total number of data packets (original or not)
delivered so far. This is an already-existing field.
tp->delivered_mstamp: the last time tp->delivered was updated.
Algorithm:
A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:
d1: the current tp->delivered after processing the ACK
t1: the current time after processing the ACK
d0: the prior tp->delivered when the acked skb was transmitted
t0: the prior tp->delivered_mstamp when the acked skb was transmitted
When an skb is transmitted, we snapshot d0 and t0 in its control
block in tcp_rate_skb_sent().
When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
to reflect the latest (d0, t0).
Finally, tcp_rate_gen() generates a rate sample by storing
(d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.
One caveat: if an skb was sent with no packets in flight, then
tp->delivered_mstamp may be either invalid (if the connection is
starting) or outdated (if the connection was idle). In that case,
we'll re-stamp tp->delivered_mstamp.
At first glance it seems t0 should always be the time when an skb was
transmitted, but actually this could over-estimate the rate due to
phase mismatch between transmit and ACK events. To track the delivery
rate, we ensure that if packets are in flight then t0 and and t1 are
times at which packets were marked delivered.
If the initial and final RTTs are different then one may be corrupted
by some sort of noise. The noise we see most often is sending gaps
caused by delayed, compressed, or stretched acks. This either affects
both RTTs equally or artificially reduces the final RTT. We approach
this by recording the info we need to compute the initial RTT
(duration of the "send phase" of the window) when we recorded the
associated inflight. Then, for a filter to avoid bandwidth
overestimates, we generalize the per-sample bandwidth computation
from:
bw = delivered / ack_phase_rtt
to the following:
bw = delivered / max(send_phase_rtt, ack_phase_rtt)
In large-scale experiments, this filtering approach incorporating
send_phase_rtt is effective at avoiding bandwidth overestimates due to
ACK compression or stretched ACKs.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the TCP min_rtt code to reuse the new win_minmax library in
lib/win_minmax.c to simplify the TCP code.
This is a pure refactor: the functionality is exactly the same. We
just moved the windowed min code to make TCP easier to read and
maintain, and to allow other parts of the kernel to use the windowed
min/max filter code.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.
Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.
In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.
Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.
However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.
This patch converts it to a RB tree, to get bounded latencies.
Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.
Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)
Next step would be to also use an RB tree for the write queue at sender
side ;)
Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCP operates in lossy environments (between 1 and 10 % packet
losses), many SACK blocks can be exchanged, and I noticed we could
drop them on busy senders, if these SACK blocks have to be queued
into the socket backlog.
While the main cause is the poor performance of RACK/SACK processing,
we can try to avoid these drops of valuable information that can lead to
spurious timeouts and retransmits.
Cause of the drops is the skb->truesize overestimation caused by :
- drivers allocating ~2048 (or more) bytes as a fragment to hold an
Ethernet frame.
- various pskb_may_pull() calls bringing the headers into skb->head
might have pulled all the frame content, but skb->truesize could
not be lowered, as the stack has no idea of each fragment truesize.
The backlog drops are also more visible on bidirectional flows, since
their sk_rmem_alloc can be quite big.
Let's add some room for the backlog, as only the socket owner
can selectively take action to lower memory needs, like collapsing
receive queues or partial ofo pruning.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
tcp_peek_len (which is just a stub function that calls tcp_inq).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>