wonderful-mobile/exynos-linux-stable

Author	SHA1	Message	Date
Mustafa Gökmen	03c1559487	Revert "bpf: Add BPF_OBJ_GET_INFO_BY_FD" This reverts commit `053633c9a8`.	2024-09-25 16:54:36 +03:00
Mustafa Gökmen	b75007c01b	Revert "bpf: Discard unused bpf_prog_info members code" This reverts commit `62a45e6fb4`.	2024-09-25 16:54:36 +03:00
roynatech2544	62a45e6fb4	bpf: Discard unused bpf_prog_info members code - Removes the need for a bunch of backports	2024-01-28 03:18:44 +03:00
Martin KaFai Lau	053633c9a8	bpf: Add BPF_OBJ_GET_INFO_BY_FD A single BPF_OBJ_GET_INFO_BY_FD cmd is used to obtain the info for both bpf_prog and bpf_map. The kernel can figure out the fd is associated with a bpf_prog or bpf_map. The suggested struct bpf_prog_info and struct bpf_map_info are not meant to be a complete list and it is not the goal of this patch. New fields can be added in the future patch. The focus of this patch is to create the interface, BPF_OBJ_GET_INFO_BY_FD cmd for exposing the bpf_prog's and bpf_map's info. The obj's info, which will be extended (and get bigger) over time, is separated from the bpf_attr to avoid bloating the bpf_attr. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2024-01-28 03:18:44 +03:00
Greg Kroah-Hartman	2fc19cf960	UPSTREAM: bpf: Explicitly memset the bpf_attr structure For the bpf syscall, we are relying on the compiler to properly zero out the bpf_attr union that we copy userspace data into. Unfortunately that doesn't always work properly, padding and other oddities might not be correctly zeroed, and in some tests odd things have been found when the stack is pre-initialized to other values. Fix this by explicitly memsetting the structure to 0 before using it. Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: John Stultz <john.stultz@linaro.org> Reported-by: Alexander Potapenko <glider@google.com> Reported-by: Alistair Delva <adelva@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: `1235490` Link: https://lore.kernel.org/bpf/20200320094813.GA421650@kroah.com (cherry picked from commit 8096f229421f7b22433775e928d506f0342e5907) Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I2dc28cd45024da5cc6861ff4a9b25fae389cc6d8	2020-03-23 18:59:38 +01:00
Alexei Starovoitov	148f111e98	BACKPORT: bpf: multi program support for cgroup+bpf introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple bpf programs to a cgroup. The difference between three possible flags for BPF_PROG_ATTACH command: - NONE(default): No further bpf programs allowed in the subtree. - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program, the program in this cgroup yields to sub-cgroup program. - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program, that cgroup program gets run in addition to the program in this cgroup. NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't change their behavior. It only clarifies the semantics in relation to new flag. Only one program is allowed to be attached to a cgroup with NONE or BPF_F_ALLOW_OVERRIDE flag. Multiple programs are allowed to be attached to a cgroup with BPF_F_ALLOW_MULTI flag. They are executed in FIFO order (those that were attached first, run first) The programs of sub-cgroup are executed first, then programs of this cgroup and then programs of parent cgroup. All eligible programs are executed regardless of return code from earlier programs. To allow efficient execution of multiple programs attached to a cgroup and to avoid penalizing cgroups without any programs attached introduce 'struct bpf_prog_array' which is RCU protected array of pointers to bpf programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> for cgroup bits Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 324bda9e6c5add86ba2e1066476481c48132aca0) Signed-off-by: Connor O'Brien <connoro@google.com> Bug: 121213201 Bug: 138317270 Test: build & boot cuttlefish Change-Id: I06b71c850b9f3e052b106abab7a4a3add012a3f8	2019-12-12 15:48:18 -08:00
Greg Kroah-Hartman	c462abbf77	This is the 4.9.99 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlryqEsACgkQONu9yGCS aT4a2xAAw8RIK5Yp2iWIM+yXGnkvt99eUqE5W4HMwwbbJUr3aeW5Uw4A0JOEA6lJ z7J+tu7qOTFj9H37Tk0GlOe0GmbfLsyfcxPWNDyII/0l38sgdkypOw/Um6uMaa0l o7yBKXTUb0ZKaUonIy7tz3UVh8h1OTtynUHxDY2zMu3zuZzpAccuQe9pPhP7p8QJ f3F8yvRlu69763eGPKC87FeX7rBSoQPAIg1PArR6aFWhQ6+NrE+VMkz0zSYqQuTQ 5rlgUJxUmTGjeug5lhI1c+JN98NSzcLy6EnlIdaW10e83DsH9+jL9YnevC/RtUXc jY3kLGHB/InQLxbuHK+4F6NISGcYUOoyD4LNTyeV/v4H4nL81ftxVSBa6SxdQqiC YMrXDiRMm9X2aod0eMWCPERv5NtJbcQWYEKr+C702jWJEvr0ijfa7vamB2ClAD92 ipvfLYaHpNutBOR8JgClwdBwokQXjznIy/PO32gjGYEBGHpxlHBpyP565umTgZcP bkTPBai14aWELrzgW3Xyd5fcgIH0znZ3Xi6Krx7RXBzgJEo7A92G4OeaqO4NZ9/B UBCOmdHLfbwQRbeJa3h+CqA6SB3DbJXLquUJuuB0tsFP9yvLDyoZfBx4xJuzTACn NdiPu7ik9q94pcT6b2P46VSChaQGmHZjaA1/quIO9iVVryeZNlM= =m2YN -----END PGP SIGNATURE----- Merge 4.9.99 into android-4.9 Changes in 4.9.99 perf/core: Fix the perf_cpu_time_max_percent check percpu: include linux/sched.h for cond_resched() bpf: map_get_next_key to return first key on NULL arm/arm64: KVM: Add PSCI version selection API crypto: talitos - fix IPsec cipher in length serial: imx: ensure UCR3 and UFCR are setup correctly USB: serial: option: Add support for Quectel EP06 ALSA: pcm: Check PCM state at xfern compat ioctl ALSA: seq: Fix races at MIDI encoding in snd_virmidi_output_trigger() ALSA: aloop: Mark paused device as inactive ALSA: aloop: Add missing cable lock to ctl API callbacks tracepoint: Do not warn on ENOMEM Input: leds - fix out of bound access Input: atmel_mxt_ts - add touchpad button mapping for Samsung Chromebook Pro xfs: prevent creating negative-sized file via INSERT_RANGE RDMA/cxgb4: release hw resources on device removal RDMA/ucma: Allow resolving address w/o specifying source address RDMA/mlx5: Protect from shift operand overflow NET: usb: qmi_wwan: add support for ublox R410M PID 0x90b2 IB/mlx5: Use unlimited rate when static rate is not supported IB/hfi1: Fix NULL pointer dereference when invalid num_vls is used drm/vmwgfx: Fix a buffer object leak drm/bridge: vga-dac: Fix edid memory leak test_firmware: fix setting old custom fw path back on exit, second try USB: serial: visor: handle potential invalid device configuration USB: Accept bulk endpoints with 1024-byte maxpacket USB: serial: option: reimplement interface masking USB: serial: option: adding support for ublox R410M usb: musb: host: fix potential NULL pointer dereference usb: musb: trace: fix NULL pointer dereference in musb_g_tx() platform/x86: asus-wireless: Fix NULL pointer dereference s390/facilites: use stfle_fac_list array size for MAX_FACILITY_BIT Linux 4.9.99 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2018-05-09 12:28:25 +02:00
Teng Qin	fcbc8d0e7d	bpf: map_get_next_key to return first key on NULL commit 8fe45924387be6b5c1be59a7eb330790c61d5d10 upstream. When iterating through a map, we need to find a key that does not exist in the map so map_get_next_key will give us the first key of the map. This often requires a lot of guessing in production systems. This patch makes map_get_next_key return the first key when the key pointer in the parameter is NULL. Signed-off-by: Teng Qin <qinteng@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chenbo Feng <fengc@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-05-09 09:50:19 +02:00
Greg Kroah-Hartman	bb94f9d8f5	This is the 4.9.91 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlq7xT4ACgkQONu9yGCS aT54KBAAsw3cqSI3soZ1zjsjBlwrbH1pnv1M36yUik1Jllg03PM5cBqzPHb3aDJd zIl8p57Q4tPnsOH42S/yv6v9oFMoiOdZs7pOWvGVmb7PMe3waIPEkHhs4H6JEGO5 6dregIdYm7nowsgZk/f7ZprLJ3YouIgRM+K7CD114Tt1Mnk/jt7s/6aiE2gY1R9m RMAH77DYJntFehazGucYdzxKsVJo5kPug+PoM3svR9A4kzl4W5WLQ3j8/Oz/M640 9CaDUXxeA6FQm+4XCdiqaIj5KcY3+kSy+QEOVy59yOMp8TOY9Yk9L8oDdl5Jgx22 WeiHnroNTbLGbSzH4tq1/DMBt9wT5FSyHDmuhjrWj5scW+PfCk3ALdFvpZD3E032 sexY1gBTDPkozwSrDdNYXST/Gi3V+2gwBH94IrDlL93IixRnbsgIt8G6Xg9MZ//X DDGeVACKyZkUtRF7HXyHAMZ6vruj3QEUrn+pMflhdbF46jMTx4RimyT5P/NQiQNt 4L0Z9XZpv6QUAJUmmYH8nexJMOTYXy3lpjpZOrNPw5cc2z4VveFma2kDamNDw8Jk sgJnDZOVuR70hcQUp5vmws/itlDLQBLRiwrKV28Dtid9kgW+mWzNpUM8ciUag8rf t5KoqQjAuooZH+eISoX4hOpCMNoLHKCCibKauWfDxj9tgyL61ho= =kgAh -----END PGP SIGNATURE----- Merge 4.9.91 into android-4.9 Changes in 4.9.91 MIPS: ralink: Remove ralink_halt() iio: st_pressure: st_accel: pass correct platform data to init ALSA: usb-audio: Fix parsing descriptor of UAC2 processing unit ALSA: aloop: Sync stale timer before release ALSA: aloop: Fix access to not-yet-ready substream via cable ALSA: hda/realtek - Always immediately update mute LED with pin VREF mmc: dw_mmc: fix falling from idmac to PIO mode when dw_mci_reset occurs PCI: Add function 1 DMA alias quirk for Highpoint RocketRAID 644L ahci: Add PCI-id for the Highpoint Rocketraid 644L card clk: bcm2835: Fix ana->maskX definitions clk: bcm2835: Protect sections updating shared registers clk: sunxi-ng: a31: Fix CLK_OUT_* clock ops Bluetooth: btusb: Fix quirk for Atheros 1525/QCA6174 libata: fix length validation of ATAPI-relayed SCSI commands libata: remove WARN() for DMA or PIO command without data libata: don't try to pass through NCQ commands to non-NCQ devices libata: Apply NOLPM quirk to Crucial MX100 512GB SSDs libata: disable LPM for Crucial BX100 SSD 500GB drive libata: Enable queued TRIM for Samsung SSD 860 libata: Apply NOLPM quirk to Crucial M500 480 and 960GB SSDs libata: Make Crucial BX100 500GB LPM quirk apply to all firmware versions libata: Modify quirks for MX100 to limit NCQ_TRIM quirk to MU01 version nfsd: remove blocked locks on client teardown mm/vmalloc: add interfaces to free unmapped page table x86/mm: implement free pmd/pte page interfaces mm/khugepaged.c: convert VM_BUG_ON() to collapse fail mm/thp: do not wait for lock_page() in deferred_split_scan() mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink() drm/vmwgfx: Fix a destoy-while-held mutex problem. drm/radeon: Don't turn off DP sink when disconnected drm: udl: Properly check framebuffer mmap offsets acpi, numa: fix pxm to online numa node associations ACPI / watchdog: Fix off-by-one error at resource assignment libnvdimm, {btt, blk}: do integrity setup before add_disk() brcmfmac: fix P2P_DEVICE ethernet address generation rtlwifi: rtl8723be: Fix loss of signal tracing: probeevent: Fix to support minus offset from symbol mtdchar: fix usage of mtd_ooblayout_ecc() mtd: nand: fsl_ifc: Fix nand waitfunc return value mtd: nand: fsl_ifc: Fix eccstat array overflow for IFC ver >= 2.0.0 mtd: nand: fsl_ifc: Read ECCSTAT0 and ECCSTAT1 registers for IFC 2.0 staging: ncpfs: memory corruption in ncp_read_kernel() can: ifi: Repair the error handling can: ifi: Check core revision upon probe can: cc770: Fix stalls on rt-linux, remove redundant IRQ ack can: cc770: Fix queue stall & dropped RTR reply can: cc770: Fix use after free in cc770_tx_interrupt() tty: vt: fix up tabstops properly selftests/x86/ptrace_syscall: Fix for yet more glibc interference kvm/x86: fix icebp instruction handling x86/build/64: Force the linker to use 2MB page size x86/boot/64: Verify alignment of the LOAD segment x86/entry/64: Don't use IST entry for #BP stack perf/x86/intel/uncore: Fix Skylake UPI event format perf stat: Fix CVS output format for non-supported counters perf/x86/intel: Don't accidentally clear high bits in bdw_limit_period() perf/x86/intel/uncore: Fix multi-domain PCI CHA enumeration bug on Skylake servers iio: ABI: Fix name of timestamp sysfs file staging: lustre: ptlrpc: kfree used instead of kvfree selftests, x86, protection_keys: fix wrong offset in siginfo selftests/x86/protection_keys: Fix syscall NR redefinition warnings signal/testing: Don't look for __SI_FAULT in userspace x86/pkeys/selftests: Rename 'si_pkey' to 'siginfo_pkey' selftests: x86: sysret_ss_attrs doesn't build on a PIE build kbuild: disable clang's default use of -fmerge-all-constants bpf: skip unnecessary capability check bpf, x64: increase number of passes Linux 4.9.91 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2018-03-29 11:32:39 +02:00
Chenbo Feng	3eb88807b2	bpf: skip unnecessary capability check commit 0fa4fe85f4724fff89b09741c437cbee9cf8b008 upstream. The current check statement in BPF syscall will do a capability check for CAP_SYS_ADMIN before checking sysctl_unprivileged_bpf_disabled. This code path will trigger unnecessary security hooks on capability checking and cause false alarms on unprivileged process trying to get CAP_SYS_ADMIN access. This can be resolved by simply switch the order of the statement and CAP_SYS_ADMIN is not required anyway if unprivileged bpf syscall is allowed. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-03-28 18:39:26 +02:00
Greg Kroah-Hartman	033d019ce2	This is the 4.9.77 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlpfC6UACgkQONu9yGCS aT7DEg//eQt2Bh3O3UgkO00vmWYCrZVCR8cCYeEhKOiuwtyU6c5zdEybfVTIIsi7 jm9UypajB6kq50ZXxR1PY1tczfFQpAEZLQBBbaaX7y47HQEYi1FWFRr6ba3VLXnp fNYOySWbeyO0XBywyXv5n9a2BE13MX97bW3zP6+Yhf0x7o4wPx5Xx0mDI9etqPnA FPItaRVC1Ud+ILQf26PiJPCIOZKXHBYDyqmMlPhwu+Vgn4cbUs9WFeJ0Jq58h5ya A/SiknxKH6iCU8us9z4hIXWvyCvoxVLhHm789r4O0L64yAHkc3uHPPax//Xt5psL FyTP7e0G+Bj56YpPwx7f6k/+GSrLN52X81cF7oNXy/oJztMuL6oKsfPFxbRxh6wk xYgBnie654qbH2zncK1c5/ZgIWSHGlCXPkUzdH5Wky72N6xsvlTmrCQ55TJHLB6t UKkHfhKeUSMDjxDEHMhT3pPViKSXZb2zwXBej976OJEVlyeOr65v7rULD+Di7jFT dBegeauxWJiVCiOZrM3606C1K6xQk5nnxngg9dAQ8Gj7V2bge3/P414zWEJo0APw 96EVgGsk0vTcy2irKJmdRWWH1svIflakfvut0PVfbOEhIvsMtcO2cKzEXFmA0vJz gcsy+Gsj3sW83BJR5NNiZZ90hF6ht+tNjbp7YNoxkoLHf64SE4A= =r9Be -----END PGP SIGNATURE----- Merge 4.9.77 into android-4.9 Changes in 4.9.77 dm bufio: fix shrinker scans when (nr_to_scan < retain_target) mac80211: Add RX flag to indicate ICV stripped ath10k: rebuild crypto header in rx data frames KVM: Fix stack-out-of-bounds read in write_mmio can: gs_usb: fix return value of the "set_bittiming" callback IB/srpt: Disable RDMA access by the initiator MIPS: Validate PR_SET_FP_MODE prctl(2) requests against the ABI of the task MIPS: Factor out NT_PRFPREG regset access helpers MIPS: Guard against any partial write attempt with PTRACE_SETREGSET MIPS: Consistently handle buffer counter with PTRACE_SETREGSET MIPS: Fix an FCSR access API regression with NT_PRFPREG and MSA MIPS: Also verify sizeof `elf_fpreg_t' with PTRACE_SETREGSET MIPS: Disallow outsized PTRACE_SETREGSET NT_PRFPREG regset accesses kvm: vmx: Scrub hardware GPRs at VM-exit platform/x86: wmi: Call acpi_wmi_init() later x86/acpi: Handle SCI interrupts above legacy space gracefully ALSA: pcm: Remove incorrect snd_BUG_ON() usages ALSA: pcm: Add missing error checks in OSS emulation plugin builder ALSA: pcm: Abort properly at pending signal in OSS read/write loops ALSA: pcm: Allow aborting mutex lock at OSS read/write loops ALSA: aloop: Release cable upon open error path ALSA: aloop: Fix inconsistent format due to incomplete rule ALSA: aloop: Fix racy hw constraints adjustment x86/acpi: Reduce code duplication in mp_override_legacy_irq() zswap: don't param_set_charp while holding spinlock lan78xx: use skb_cow_head() to deal with cloned skbs sr9700: use skb_cow_head() to deal with cloned skbs smsc75xx: use skb_cow_head() to deal with cloned skbs cx82310_eth: use skb_cow_head() to deal with cloned skbs xhci: Fix ring leak in failure path of xhci_alloc_virt_device() 8021q: fix a memory leak for VLAN 0 device ip6_tunnel: disable dst caching if tunnel is dual-stack net: core: fix module type in sock_diag_bind RDS: Heap OOB write in rds_message_alloc_sgs() RDS: null pointer dereference in rds_atomic_free_op sh_eth: fix TSU resource handling sh_eth: fix SH7757 GEther initialization net: stmmac: enable EEE in MII, GMII or RGMII only ipv6: fix possible mem leaks in ipv6_make_skb() ethtool: do not print warning for applications using legacy API mlxsw: spectrum_router: Fix NULL pointer deref net/sched: Fix update of lastuse in act modules implementing stats_update crypto: algapi - fix NULL dereference in crypto_remove_spawns() rbd: set max_segments to USHRT_MAX x86/microcode/intel: Extend BDW late-loading with a revision check KVM: x86: Add memory barrier on vmcs field lookup drm/vmwgfx: Potential off by one in vmw_view_add() kaiser: Set _PAGE_NX only if supported iscsi-target: Make TASK_REASSIGN use proper se_cmd->cmd_kref target: Avoid early CMD_T_PRE_EXECUTE failures during ABORT_TASK bpf: move fixup_bpf_calls() function bpf: refactor fixup_bpf_calls() bpf: prevent out-of-bounds speculation bpf, array: fix overflow in max_entries and undefined behavior in index_mask USB: serial: cp210x: add IDs for LifeScan OneTouch Verio IQ USB: serial: cp210x: add new device ID ELV ALC 8xxx usb: misc: usb3503: make sure reset is low for at least 100us USB: fix usbmon BUG trigger usbip: remove kernel addresses from usb device and urb debug msgs usbip: fix vudc_rx: harden CMD_SUBMIT path to handle malicious input usbip: vudc_tx: fix v_send_ret_submit() vulnerability to null xfer buffer staging: android: ashmem: fix a race condition in ASHMEM_SET_SIZE ioctl Bluetooth: Prevent stack info leak from the EFS element. uas: ignore UAS for Norelsys NS1068(X) chips e1000e: Fix e1000_check_for_copper_link_ich8lan return value. x86/Documentation: Add PTI description x86/cpu: Factor out application of forced CPU caps x86/cpufeatures: Make CPU bugs sticky x86/cpufeatures: Add X86_BUG_CPU_INSECURE x86/pti: Rename BUG_CPU_INSECURE to BUG_CPU_MELTDOWN x86/cpufeatures: Add X86_BUG_SPECTRE_V[12] x86/cpu: Merge bugs.c and bugs_64.c sysfs/cpu: Add vulnerability folder x86/cpu: Implement CPU vulnerabilites sysfs functions x86/cpu/AMD: Make LFENCE a serializing instruction x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC sysfs/cpu: Fix typos in vulnerability documentation x86/alternatives: Fix optimize_nops() checking x86/alternatives: Add missing '\n' at end of ALTERNATIVE inline asm x86/mm/32: Move setup_clear_cpu_cap(X86_FEATURE_PCID) earlier objtool, modules: Discard objtool annotation sections for modules objtool: Detect jumps to retpoline thunks objtool: Allow alternatives to be ignored x86/asm: Use register variable to get stack pointer value x86/retpoline: Add initial retpoline support x86/spectre: Add boot time option to select Spectre v2 mitigation x86/retpoline/crypto: Convert crypto assembler indirect jumps x86/retpoline/entry: Convert entry assembler indirect jumps x86/retpoline/ftrace: Convert ftrace assembler indirect jumps x86/retpoline/hyperv: Convert assembler indirect jumps x86/retpoline/xen: Convert Xen hypercall indirect jumps x86/retpoline/checksum32: Convert assembler indirect jumps x86/retpoline/irq32: Convert assembler indirect jumps x86/retpoline: Fill return stack buffer on vmexit selftests/x86: Add test_vsyscall x86/retpoline: Remove compile time warning objtool: Fix retpoline support for pre-ORC objtool x86/pti/efi: broken conversion from efi to kernel page table Linux 4.9.77 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2018-01-17 10:29:45 +01:00
Alexei Starovoitov	28035366af	bpf: move fixup_bpf_calls() function commit e245c5c6a5656e4d61aa7bb08e9694fd6e5b2b9d upstream. no functional change. move fixup_bpf_calls() to verifier.c it's being refactored in the next patch Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Jiri Slaby <jslaby@suse.cz> [backported to 4.9 - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-01-17 09:38:55 +01:00
Chenbo Feng	0521e0b3fc	UPSTREAM: selinux: bpf: Add addtional check for bpf object file receive Introduce a bpf object related check when sending and receiving files through unix domain socket as well as binder. It checks if the receiving process have privilege to read/write the bpf map or use the bpf program. This check is necessary because the bpf maps and programs are using a anonymous inode as their shared inode so the normal way of checking the files and sockets when passing between processes cannot work properly on eBPF object. This check only works when the BPF_SYSCALL is configured. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Reviewed-by: James Morris <james.l.morris@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry-pick from net-next: f66e448cfda021b0bcd884f26709796fe19c7cc1) Bug: 30950746 Change-Id: I5b2cf4ccb4eab7eda91ddd7091d6aa3e7ed9f2cd	2017-11-07 12:59:54 -08:00
Chenbo Feng	f3ad3766a9	BACKPORT: security: bpf: Add LSM hooks for bpf object related syscall Introduce several LSM hooks for the syscalls that will allow the userspace to access to eBPF object such as eBPF programs and eBPF maps. The security check is aimed to enforce a per object security protection for eBPF object so only processes with the right priviliges can read/write to a specific map or use a specific eBPF program. Besides that, a general security hook is added before the multiplexer of bpf syscall to check the cmd and the attribute used for the command. The actual security module can decide which command need to be checked and how the cmd should be checked. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: James Morris <james.l.morris@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Added the LIST_HEAD_INIT call for security hooks, it nolonger exist in uptream code. (cherry-pick from net-next: afdb09c720b62b8090584c11151d856df330e57d) Bug: 30950746 Change-Id: Ieb3ac74392f531735fc7c949b83346a5f587a77b	2017-11-07 12:59:20 -08:00
Chenbo Feng	4672ded3ec	BACKPORT: bpf: Add file mode configuration into bpf maps Introduce the map read/write flags to the eBPF syscalls that returns the map fd. The flags is used to set up the file mode when construct a new file descriptor for bpf maps. To not break the backward capability, the f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise it should be O_RDONLY or O_WRONLY. When the userspace want to modify or read the map content, it will check the file mode to see if it is allowed to make the change. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Deleted the file mode configuration code in unsupported map type and removed the file mode check in non-existing helper functions. (cherry-pick from net-next: 6e71b04a82248ccf13a94b85cbc674a9fefe53f5) Bug: 30950746 Change-Id: Icfad20f1abb77f91068d244fb0d87fa40824dd1b	2017-11-07 12:47:56 -08:00
Greg Kroah-Hartman	184ce810ce	This is the 4.9.36 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAllc3lYACgkQONu9yGCS aT7rhBAAoYx4e8Cbp2iUSEOupe4zWD087bWaQELz9EDh5fsr8zgelyzoeK6y0Rxy ngwjVViu9gkouqeHY8gLCFcL4Q/2NFMCGmZ+x7rDV1+d7PWrxCtDw9xnlwIzHtlb K9KeWs1TTNtmN2ZQMmnnOQZhdwSseHsA+bdn60n/ol3+n5cLcil6B/xOFCLPlynq kr3CZg70HV91V1Cy1jj4/QOY+LE19ratIAY3lirLNphrVt/AWFYBUnBwH4Y7Pked VrQ5v0UsB7SHzuuWPkoTE2hM103TjVMoA+/SF6zurb0KcxP+uo9o6RPsOzxeBco/ cfKGI6ZozOFHy/KIEzC+/EDhdlG08t4MMiAS73hYNthMkVSz6AafNrfGQNmgO+lG b2jCmzdp3vdAan4JQ2FvGuUYHFQaskJh5FGCSFiWvzOl1fbIBknjBA+Wubj91lKn YqC3srZwi5Z+Ol644uEMorBI3mYVPHlkdSvw17dS786x4zL1ATxpRrAGhcbK/jdU e8x/S2nr+A3thvLgK75R6nAfr+srnKz4MUaN3vHLnvXYECGNEXddj5FNPO1lSbeK 28F/5+W8Onm+pBsldLJjzN4v3Fsq+JWtUZGod+aMZJq0l+Mu8bjZ6rtTbepv5YOv 6RA8MgBt2heLU39thiXXcpk7BPLTHk79cvKV8drQww4cGNuk48k= =HbH+ -----END PGP SIGNATURE----- Merge 4.9.36 into android-4.9 Changes in 4.9.36 ipv6: release dst on error in ip6_dst_lookup_tail net: don't call strlen on non-terminated string in dev_set_alias() decnet: dn_rtmsg: Improve input length sanitization in dnrmg_receive_user_skb net: Zero ifla_vf_info in rtnl_fill_vfinfo() net: vrf: Make add_fib_rules per network namespace flag af_unix: Add sockaddr length checks before accessing sa_family in bind and connect handlers Fix an intermittent pr_emerg warning about lo becoming free. sctp: disable BH in sctp_for_each_endpoint net: caif: Fix a sleep-in-atomic bug in cfpkt_create_pfx net: tipc: Fix a sleep-in-atomic bug in tipc_msg_reverse net/mlx5e: Added BW check for DIM decision mechanism net/mlx5e: Fix wrong indications in DIM due to counter wraparound proc: snmp6: Use correct type in memset igmp: acquire pmc lock for ip_mc_clear_src() igmp: add a missing spin_lock_init() ipv6: fix calling in6_ifa_hold incorrectly for dad work sctp: return next obj by passing pos + 1 into sctp_transport_get_idx net/mlx5e: Avoid doing a cleanup call if the profile doesn't have it net/mlx5: Wait for FW readiness before initializing command interface net/mlx5e: Fix timestamping capabilities reporting decnet: always not take dst->__refcnt when inserting dst into hash table net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev sfc: provide dummy definitions of vswitch functions ipv6: Do not leak throw route references rtnetlink: add IFLA_GROUP to ifla_policy netfilter: xt_TCPMSS: add more sanity tests on tcph->doff netfilter: synproxy: fix conntrackd interaction NFSv4: fix a reference leak caused WARNING messages NFSv4.x/callback: Create the callback service through svc_create_pooled xen/blkback: don't use xen_blkif_get() in xen-blkback kthread drm/ast: Handle configuration without P2A bridge mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff() MIPS: head: Reorder instructions missing a delay slot MIPS: Avoid accidental raw backtrace MIPS: pm-cps: Drop manual cache-line alignment of ready_count MIPS: Fix IRQ tracing & lockdep when rescheduling ALSA: hda - Fix endless loop of codec configure ALSA: hda - set input_path bitmap to zero after moving it to new place NFSv4.1: Fix a race in nfs4_proc_layoutget gpiolib: fix filtering out unwanted events drm/vmwgfx: Free hash table allocated by cmdbuf managed res mgr dm thin: do not queue freed thin mapping for next stage processing x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds() usb: gadget: f_fs: Fix possibe deadlock l2tp: fix race in l2tp_recv_common() l2tp: ensure session can't get removed during pppol2tp_session_ioctl() l2tp: fix duplicate session creation l2tp: hold session while sending creation notifications l2tp: take a reference on sessions used in genetlink handlers mm: numa: avoid waiting on freed migrated pages sparc64: Handle PIO & MEM non-resumable errors. sparc64: Zero pages on allocation for mondo and error queues. net: ethtool: add support for 2500BaseT and 5000BaseT link modes net: phy: add an option to disable EEE advertisement dt-bindings: net: add EEE capability constants net: phy: fix sign type error in genphy_config_eee_advert net: phy: use boolean dt properties for eee broken modes dt: bindings: net: use boolean dt properties for eee broken modes ARM64: dts: meson-gxbb-odroidc2: fix GbE tx link breakage xen/blkback: don't free be structure too early KVM: x86: fix fixing of hypercalls scsi: sd: Fix wrong DPOFUA disable in sd_read_cache_type stmmac: add missing of_node_put scsi: lpfc: Set elsiocb contexts to NULL after freeing it qla2xxx: Terminate exchange if corrupted qla2xxx: Fix erroneous invalid handle message drm/amdgpu: fix program vce instance logic error. drm/amdgpu: add support for new hainan variants net: phy: dp83848: add DP83620 PHY support perf/x86/intel: Handle exclusive threadid correctly on CPU hotplug net: korina: Fix NAPI versus resources freeing powerpc/eeh: Enable IO path on permanent error net: ethtool: Initialize buffer when querying device channel settings xen-netback: fix memory leaks on XenBus disconnect xen-netback: protect resource cleaning on XenBus disconnect bnxt_en: Fix "uninitialized variable" bug in TPA code path. bpf: don't trigger OOM killer under pressure with map alloc objtool: Fix IRET's opcode gianfar: Do not reuse pages from emergency reserve Btrfs: Fix deadlock between direct IO and fast fsync Btrfs: fix truncate down when no_holes feature is enabled virtio_console: fix a crash in config_work_handler swiotlb-xen: update dev_addr after swapping pages xen-netfront: Fix Rx stall during network stress and OOM scsi: virtio_scsi: Reject commands when virtqueue is broken iwlwifi: fix kernel crash when unregistering thermal zone platform/x86: ideapad-laptop: handle ACPI event 1 amd-xgbe: Check xgbe_init() return code net: dsa: Check return value of phy_connect_direct() drm/amdgpu: check ring being ready before using vfio/spapr: fail tce_iommu_attach_group() when iommu_data is null mlxsw: spectrum_router: Correctly reallocate adjacency entries virtio_net: fix PAGE_SIZE > 64k ip6_tunnel: must reload ipv6h in ip6ip6_tnl_xmit() vxlan: do not age static remote mac entries ibmveth: Add a proper check for the availability of the checksum features kernel/panic.c: add missing \n Documentation: devicetree: change the mediatek ethernet compatible string drm/etnaviv: trick drm_mm into giving out a low IOVA perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code pinctrl: intel: Set pin direction properly net: phy: marvell: fix Marvell 88E1512 used in SGMII mode mac80211: recalculate min channel width on VHT opmode changes perf/x86/intel: Use ULL constant to prevent undefined shift behaviour HID: i2c-hid: Add sleep between POWER ON and RESET scsi: lpfc: avoid double free of resource identifiers spi: davinci: use dma_mapping_error() arm64: assembler: make adr_l work in modules under KASLR net: thunderx: acpi: fix LMAC initialization drm/radeon/si: load special ucode for certain MC configs drm/amd/powerplay: fix vce cg logic error on CZ/St. drm/amd/powerplay: refine vce dpm update code on Cz. pmem: return EIO on read_pmem() failure mac80211: initialize SMPS field in HT capabilities x86/tsc: Add the Intel Denverton Processor to native_calibrate_tsc() x86/mpx: Use compatible types in comparison to fix sparse error perf/core: Fix sys_perf_event_open() vs. hotplug perf/x86: Reject non sampling events with precise_ip aio: fix lock dep warning coredump: Ensure proper size of sparse core files swiotlb: ensure that page-sized mappings are page-aligned s390/ctl_reg: make __ctl_load a full memory barrier usb: dwc2: gadget: Fix GUSBCFG.USBTRDTIM value be2net: fix status check in be_cmd_pmac_add() be2net: don't delete MAC on close on unprivileged BE3 VFs be2net: fix MAC addr setting on privileged BE3 VFs perf probe: Fix to show correct locations for events on modules net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types tipc: allocate user memory with GFP_KERNEL flag perf probe: Fix to probe on gcc generated functions in modules net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV sctp: check af before verify address in sctp_addr_id2transport ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets ravb: Fix use-after-free on `ifconfig eth0 down` mm/vmalloc.c: huge-vmap: fail gracefully on unexpected huge vmap mappings xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY xfrm: NULL dereference on allocation failure xfrm: Oops on error in pfkey_msg2xfrm_state() netfilter: use skb_to_full_sk in ip_route_me_harder watchdog: bcm281xx: Fix use of uninitialized spinlock. sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting spi: When no dma_chan map buffers with spi_master's parent spi: fix device-node leaks regulator: tps65086: Fix expected switch DT node names regulator: tps65086: Fix DT node referencing in of_parse_cb ARM: OMAP2+: omap_device: Sync omap_device and pm_runtime after probe defer ARM: dts: OMAP3: Fix MFG ID EEPROM ARM64/ACPI: Fix BAD_MADT_GICC_ENTRY() macro implementation ARM: 8685/1: ensure memblock-limit is pmd-aligned tools arch: Sync arch/x86/lib/memcpy_64.S with the kernel x86/boot/KASLR: Fix kexec crash due to 'virt_addr' calculation bug x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space x86/mm: Fix flush_tlb_page() on Xen ocfs2: o2hb: revert hb threshold to keep compatible iommu/vt-d: Don't over-free page table directories iommu: Handle default domain attach failure iommu/dma: Don't reserve PCI I/O windows iommu/amd: Fix incorrect error handling in amd_iommu_bind_pasid() iommu/amd: Fix interrupt remapping when disable guest_mode cpufreq: s3c2416: double free on driver init error path clk: scpi: don't add cpufreq device if the scpi dvfs node is disabled objtool: Fix another GCC jump table detection issue infiniband: hns: avoid gcc-7.0.1 warning for uninitialized data brcmfmac: avoid writing channel out of allocated array i2c: brcmstb: Fix START and STOP conditions mtd: nand: brcmnand: Check flash #WP pin status before nand erase/program arm64: fix NULL dereference in have_cpu_die() KVM: x86: fix emulation of RSM and IRET instructions KVM: x86/vPMU: fix undefined shift in intel_pmu_refresh() KVM: x86: zero base3 of unusable segments KVM: nVMX: Fix exception injection Linux 4.9.36 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2017-07-05 16:18:14 +02:00
Daniel Borkmann	251d00bf13	bpf: don't trigger OOM killer under pressure with map alloc [ Upstream commit d407bd25a204bd66b7346dde24bd3d37ef0e0b05 ] This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(), that are to be used for map allocations. Using kmalloc() for very large allocations can cause excessive work within the page allocator, so i) fall back earlier to vmalloc() when the attempt is considered costly anyway, and even more importantly ii) don't trigger OOM killer with any of the allocators. Since this is based on a user space request, for example, when creating maps with element pre-allocation, we really want such requests to fail instead of killing other user space processes. Also, don't spam the kernel log with warnings should any of the allocations fail under pressure. Given that, we can make backend selection in bpf_map_area_alloc() generic, and convert all maps over to use this API for spots with potentially large allocation requests. Note, replacing the one kmalloc_array() is fine as overflow checks happen earlier in htab_map_alloc(), since it must also protect the multiplication for vmalloc() should kmalloc_array() fail. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-07-05 14:40:21 +02:00
Chenbo Feng	b9aad97657	FROMLIST: [net-next,v2,2/2] bpf: Remove the capability check for cgroup skb eBPF program Currently loading a cgroup skb eBPF program require a CAP_SYS_ADMIN capability while attaching the program to a cgroup only requires the user have CAP_NET_ADMIN privilege. We can escape the capability check when load the program just like socket filter program to make the capability requirement consistent. Change since v1: Change the code style in order to be compliant with checkpatch.pl preference (url: http://patchwork.ozlabs.org/patch/769460/) Signed-off-by: Chenbo Feng <fengc@google.com> Bug: 30950746 Change-Id: Ibe51235127d6f9349b8f563ad31effc061b278ed	2017-06-06 00:34:00 +00:00
Alexei Starovoitov	1ee2b4b803	BACKPORT: bpf: introduce BPF_F_ALLOW_OVERRIDE flag If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command to the given cgroup the descendent cgroup will be able to override effective bpf program that was inherited from this cgroup. By default it's not passed, therefore override is disallowed. Examples: 1. prog X attached to /A with default prog Y fails to attach to /A/B and /A/B/C Everything under /A runs prog X 2. prog X attached to /A with allow_override. prog Y fails to attach to /A/B with default (non-override) prog M attached to /A/B with allow_override. Everything under /A/B runs prog M only. 3. prog X attached to /A with allow_override. prog Y fails to attach to /A with default. The user has to detach first to switch the mode. In the future this behavior may be extended with a chain of non-overridable programs. Also fix the bug where detach from cgroup where nothing is attached was not throwing error. Return ENOENT in such case. Add several testcases and adjust libbpf. Fixes: 3007098494be ("cgroup: add support for eBPF programs") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Daniel Mack <daniel@zonque.org> Signed-off-by: David S. Miller <davem@davemloft.net> Fixes: Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28 ("cgroup: add support for eBPF programs") [AmitP: Refactored original patch for android-4.9 where libbpf sources are in samples/bpf/ and test_cgrp2_attach2, test_cgrp2_sock, and test_cgrp2_sock2 sample tests do not exist.] (cherry picked from commit 7f677633379b4abb3281cdbe7e7006f049305c03) Signed-off-by: Amit Pundir <amit.pundir@linaro.org>	2017-05-30 17:27:28 -07:00
Daniel Mack	00615dfcd2	UPSTREAM: bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands Cherry-pick from commit f4324551489e8781d838f941b7aee4208e52e8bf Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and BPF_PROG_DETACH which allow attaching and detaching eBPF programs to a target. On the API level, the target could be anything that has an fd in userspace, hence the name of the field in union bpf_attr is called 'target_fd'. When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is expected to be a valid file descriptor of a cgroup v2 directory which has the bpf controller enabled. These are the only use-cases implemented by this patch at this point, but more can be added. If a program of the given type already exists in the given cgroup, the program is swapped automically, so userspace does not have to drop an existing program first before installing a new one, which would otherwise leave a gap in which no program is attached. For more information on the propagation logic to subcgroups, please refer to the bpf cgroup controller implementation. The API is guarded by CAP_NET_ADMIN. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: Iab156859332166835d51e1e6f64e5cb8b81870f2	2017-05-22 15:30:56 -07:00
Daniel Borkmann	20b2b24f91	bpf: fix map not being uncharged during map creation failure In map_create(), we first find and create the map, then once that suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch a new anon fd through anon_inode_getfd(). The problem is, once the latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct the map via map->ops->map_free(), but without uncharging the previously locked memory first. That means that the user_struct allocation is leaked as well as the accounted RLIMIT_MEMLOCK memory not released. Make the label names in the fix consistent with bpf_prog_load(). Fixes: `aaac3ba95e` ("bpf: charge user for creation of BPF maps and programs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-11-07 13:22:26 -05:00
Brenden Blanco	59d3656d5b	bpf: add bpf_prog_add api for bulk prog refcnt A subsystem may need to store many copies of a bpf program, each deserving its own reference. Rather than requiring the caller to loop one by one (with possible mid-loop failure), add a bulk bpf_prog_add api. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-19 21:46:31 -07:00
Martin KaFai Lau	4ed8ec521e	cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations. To update an element, the caller is expected to obtain a cgroup2 backed fd by open(cgroup2_dir) and then update the array with that fd. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:30:38 -04:00
Daniel Borkmann	113214be7f	bpf: refactor bpf_prog_get and type check into helper Since bpf_prog_get() and program type check is used in a couple of places, refactor this into a small helper function that we can make use of. Since the non RO prog->aux part is not used in performance critical paths and a program destruction via RCU is rather very unlikley when doing the put, we shouldn't have an issue just doing the bpf_prog_get() + prog->type != type check, but actually not taking the ref at all (due to being in fdget() / fdput() section of the bpf fd) is even cleaner and makes the diff smaller as well, so just go for that. Callsites are changed to make use of the new helper where possible. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:00:47 -04:00
Daniel Borkmann	1aacde3d22	bpf: generally move prog destruction to RCU deferral Jann Horn reported following analysis that could potentially result in a very hard to trigger (if not impossible) UAF race, to quote his event timeline: - Set up a process with threads T1, T2 and T3 - Let T1 set up a socket filter F1 that invokes another filter F2 through a BPF map [tail call] - Let T1 trigger the socket filter via a unix domain socket write, don't wait for completion - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put() - Let T3 close the file descriptor for F2, dropping the reference count of F2 to 2 - At this point, T1 should have looked up F2 from the map, but not finished executing it - Let T3 remove F2 from the BPF map, dropping the reference count of F2 to 1 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping the reference count of F2 to 0 and scheduling bpf_prog_free_deferred() via schedule_work() - At this point, the BPF program could be freed - BPF execution is still running in a freed BPF program While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf event fd we're doing the syscall on doesn't disappear from underneath us for whole syscall time, it may not be the case for the bpf fd used as an argument only after we did the put. It needs to be a valid fd pointing to a BPF program at the time of the call to make the bpf_prog_get() and while T2 gets preempted, F2 must have dropped reference to 1 on the other CPU. The fput() from the close() in T3 should also add additionally delay to the reference drop via exit_task_work() when bpf_prog_release() gets called as well as scheduling bpf_prog_free_deferred(). That said, it makes nevertheless sense to move the BPF prog destruction generally after RCU grace period to guarantee that such scenario above, but also others as recently fixed in `ceb5607035` ("bpf, perf: delay release of BPF prog after grace period") with regards to tail calls won't happen. Integrating bpf_prog_free_deferred() directly into the RCU callback is not allowed since the invocation might happen from either softirq or process context, so we're not permitted to block. Reviewing all bpf_prog_put() invocations from eBPF side (note, cBPF -> eBPF progs don't use this for their destruction) with call_rcu() look good to me. Since we don't know whether at the time of attaching the program, we're already part of a tail call map, we need to use RCU variant. However, due to this, there won't be severely more stress on the RCU callback queue: situations with above bpf_prog_get() and bpf_prog_put() combo in practice normally won't lead to releases, but even if they would, enough effort/ cycles have to be put into loading a BPF program into the kernel already. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-07-01 16:00:47 -04:00
Daniel Borkmann	d056a78876	bpf, maps: extend map_fd_get_ptr arguments This patch extends map_fd_get_ptr() callback that is used by fd array maps, so that struct file pointer from the related map can be passed in. It's safe to remove map_update_elem() callback for the two maps since this is only allowed from syscall side, but not from eBPF programs for these two map types. Like in per-cpu map case, bpf_fd_array_map_update_elem() needs to be called directly here due to the extra argument. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 23:42:57 -07:00
Daniel Borkmann	61d1b6a42f	bpf, maps: add release callback Add a release callback for maps that is invoked when the last reference to its struct file is gone and the struct file about to be released by vfs. The handler will be used by fd array maps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-06-15 23:42:57 -07:00
Daniel Borkmann	d1c55ab5e4	bpf: prepare bpf_int_jit_compile/bpf_prog_select_runtime apis Since the blinding is strictly only called from inside eBPF JITs, we need to change signatures for bpf_int_jit_compile() and bpf_prog_select_runtime() first in order to prepare that the eBPF program we're dealing with can change underneath. Hence, for call sites, we need to return the latest prog. No functional change in this patch. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-05-16 13:49:32 -04:00
Alexei Starovoitov	92117d8443	bpf: fix refcnt overflow On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK, the malicious application may overflow 32-bit bpf program refcnt. It's also possible to overflow map refcnt on 1Tb system. Impose 32k hard limit which means that the same bpf program or map cannot be shared by more than 32k processes. Fixes: `1be7f75d16` ("bpf: enable non-root eBPF programs") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-04-28 17:29:45 -04:00
Daniel Borkmann	322cea2f41	bpf: add missing map_flags to bpf_map_show_fdinfo Add map_flags attribute to bpf_map_show_fdinfo(), so that tools like tc can check for them when loading objects from a pinned entry, e.g. if user intent wrt allocation (BPF_F_NO_PREALLOC) is different to the pinned object, it can bail out. Follow-up to `6c90598174` ("bpf: pre-allocate hash map elements"), so that tc can still support this with v4.6. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-25 11:36:41 -04:00
Alexei Starovoitov	b8cdc05173	bpf: bpf_stackmap_copy depends on CONFIG_PERF_EVENTS 0-day bot reported build error: kernel/built-in.o: In function `map_lookup_elem': >> kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to `bpf_stackmap_copy' when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not. Add weak definition to resolve it. This code path in map_lookup_elem() is never taken when CONFIG_PERF_EVENTS is not set. Fixes: `557c0c6e7d` ("bpf: convert stackmap to pre-allocation") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-09 23:26:51 -05:00
Alexei Starovoitov	557c0c6e7d	bpf: convert stackmap to pre-allocation It was observed that calling bpf_get_stackid() from a kprobe inside slub or from spin_unlock causes similar deadlock as with hashmap, therefore convert stackmap to use pre-allocated memory. The call_rcu is no longer feasible mechanism, since delayed freeing causes bpf_get_stackid() to fail unpredictably when number of actual stacks is significantly less than user requested max_entries. Since elements are no longer freed into slub, we can push elements into freelist immediately and let them be recycled. However the very unlikley race between user space map_lookup() and program-side recycling is possible: cpu0 cpu1 ---- ---- user does lookup(stackidX) starts copying ips into buffer delete(stackidX) calls bpf_get_stackid() which recyles the element and overwrites with new stack trace To avoid user space seeing a partial stack trace consisting of two merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket); to preserve consistent stack trace delivery to user space. Now we can move memset(,0) of left-over element value from critical path of bpf_get_stackid() into slow-path of user space lookup. Also disallow lookup() from bpf program, since it's useless and program shouldn't be messing with collected stack trace. Note that similar race between user space lookup and kernel side updates is also present in hashmap, but it's not a new race. bpf programs were always allowed to modify hash and array map elements while user space is copying them. Fixes: `d5a3b1f691` ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:31 -05:00
Alexei Starovoitov	6c90598174	bpf: pre-allocate hash map elements If kprobe is placed on spin_unlock then calling kmalloc/kfree from bpf programs is not safe, since the following dead lock is possible: kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe-> bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock) and deadlocks. The following solutions were considered and some implemented, but eventually discarded - kmem_cache_create for every map - add recursion check to slow-path of slub - use reserved memory in bpf_map_update for in_irq or in preempt_disabled - kmalloc via irq_work At the end pre-allocation of all map elements turned out to be the simplest solution and since the user is charged upfront for all the memory, such pre-allocation doesn't affect the user space visible behavior. Since it's impossible to tell whether kprobe is triggered in a safe location from kmalloc point of view, use pre-allocation by default and introduce new BPF_F_NO_PREALLOC flag. While testing of per-cpu hash maps it was discovered that alloc_percpu(GFP_ATOMIC) has odd corner cases and often fails to allocate memory even when 90% of it is free. The pre-allocation of per-cpu hash elements solves this problem as well. Turned out that bpf_map_update() quickly followed by bpf_map_lookup()+bpf_map_delete() is very common pattern used in many of iovisor/bcc/tools, so there is additional benefit of pre-allocation, since such use cases are must faster. Since all hash map elements are now pre-allocated we can remove atomic increment of htab->count and save few more cycles. Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid large malloc/free done by users who don't have sufficient limits. Pre-allocation is done with vmalloc and alloc/free is done via percpu_freelist. Here are performance numbers for different pre-allocation algorithms that were implemented, but discarded in favor of percpu_freelist: 1 cpu: pcpu_ida 2.1M pcpu_ida nolock 2.3M bt 2.4M kmalloc 1.8M hlist+spinlock 2.3M pcpu_freelist 2.6M 4 cpu: pcpu_ida 1.5M pcpu_ida nolock 1.8M bt w/smp_align 1.7M bt no/smp_align 1.1M kmalloc 0.7M hlist+spinlock 0.2M pcpu_freelist 2.0M 8 cpu: pcpu_ida 0.7M bt w/smp_align 0.8M kmalloc 0.4M pcpu_freelist 1.5M 32 cpu: kmalloc 0.13M pcpu_freelist 0.49M pcpu_ida nolock is a modified percpu_ida algorithm without percpu_ida_cpu locks and without cross-cpu tag stealing. It's faster than existing percpu_ida, but not as fast as pcpu_freelist. bt is a variant of block/blk-mq-tag.c simlified and customized for bpf use case. bt w/smp_align is using cache line for every 'long' (similar to blk-mq-tag). bt no/smp_align allocates 'long' bitmasks continuously to save memory. It's comparable to percpu_ida and in some cases faster, but slower than percpu_freelist hlist+spinlock is the simplest free list with single spinlock. As expeceted it has very bad scaling in SMP. kmalloc is existing implementation which is still available via BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist, but saves memory, so in cases where map->max_entries can be large and number of map update/delete per second is low, it may make sense to use it. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:31 -05:00
Alexei Starovoitov	b121d1e74d	bpf: prevent kprobe+bpf deadlocks if kprobe is placed within update or delete hash map helpers that hold bucket spin lock and triggered bpf program is trying to grab the spinlock for the same bucket on the same cpu, it will deadlock. Fix it by extending existing recursion prevention mechanism. Note, map_lookup and other tracing helpers don't have this problem, since they don't hold any locks and don't modify global data. bpf_trace_printk has its own recursive check and ok as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-03-08 15:28:30 -05:00
Alexei Starovoitov	15a07b3381	bpf: add lookup/update support for per-cpu hash and array maps The functions bpf_map_lookup_elem(map, key, value) and bpf_map_update_elem(map, key, value, flags) need to get/set values from all-cpus for per-cpu hash and array maps, so that user space can aggregate/update them as necessary. Example of single counter aggregation in user space: unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF); long values[nr_cpus]; long value = 0; bpf_lookup_elem(fd, key, values); for (i = 0; i < nr_cpus; i++) value += values[i]; The user space must provide round_up(value_size, 8) * nr_cpus array to get/set values, since kernel will use 'long' copy of per-cpu values to try to copy good counters atomically. It's a best-effort, since bpf programs and user space are racing to access the same memory. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-02-06 03:34:36 -05:00
David S. Miller	f188b951f3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/renesas/ravb_main.c kernel/bpf/syscall.c net/ipv4/ipmr.c All three conflicts were cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-12-03 21:09:12 -05:00
Alexei Starovoitov	01b3f52157	bpf: fix allocation warnings in bpf maps and integer overflow For large map->value_size the user space can trigger memory allocation warnings like: WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989 __alloc_pages_nodemask+0x695/0x14e0() Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff82743b56>] dump_stack+0x68/0x92 lib/dump_stack.c:50 [<ffffffff81244ec9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460 [<ffffffff812450f9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493 [< inline >] __alloc_pages_slowpath mm/page_alloc.c:2989 [<ffffffff81554e95>] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235 [<ffffffff816188fe>] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055 [< inline >] alloc_pages include/linux/gfp.h:451 [<ffffffff81550706>] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414 [<ffffffff815a1c89>] kmalloc_order+0x19/0x60 mm/slab_common.c:1007 [<ffffffff815a1cef>] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018 [< inline >] kmalloc_large include/linux/slab.h:390 [<ffffffff81627784>] __kmalloc+0x234/0x250 mm/slub.c:3525 [< inline >] kmalloc include/linux/slab.h:463 [< inline >] map_update_elem kernel/bpf/syscall.c:288 [< inline >] SYSC_bpf kernel/bpf/syscall.c:744 To avoid never succeeding kmalloc with order >= MAX_ORDER check that elem->value_size and computed elem_size are within limits for both hash and array type maps. Also add __GFP_NOWARN to kmalloc(value_size \| elem_size) to avoid OOM warnings. Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size <= 512, so keep those kmalloc-s as-is. Large value_size can cause integer overflows in elem_size and map.pages formulas, so check for that as well. Fixes: `aaac3ba95e` ("bpf: charge user for creation of BPF maps and programs") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-12-02 23:36:00 -05:00
Daniel Borkmann	c9da161c65	bpf: fix clearing on persistent program array maps Currently, when having map file descriptors pointing to program arrays, there's still the issue that we unconditionally flush program array contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens when such a file descriptor is released and is independent of the map's refcount. Having this flush independent of the refcount is for a reason: there can be arbitrary complex dependency chains among tail calls, also circular ones (direct or indirect, nesting limit determined during runtime), and we need to make sure that the map drops all references to eBPF programs it holds, so that the map's refcount can eventually drop to zero and initiate its freeing. Btw, a walk of the whole dependency graph would not be possible for various reasons, one being complexity and another one inconsistency, i.e. new programs can be added to parts of the graph at any time, so there's no guaranteed consistent state for the time of such a walk. Now, the program array pinning itself works, but the issue is that each derived file descriptor on close would nevertheless call unconditionally into bpf_fd_array_map_clear(). Instead, keep track of users and postpone this flush until the last reference to a user is dropped. As this only concerns a subset of references (f.e. a prog array could hold a program that itself has reference on the prog array holding it, etc), we need to track them separately. Short analysis on the refcounting: on map creation time usercnt will be one, so there's no change in behaviour for bpf_map_release(), if unpinned. If we already fail in map_create(), we are immediately freed, and no file descriptor has been made public yet. In bpf_obj_pin_user(), we need to probe for a possible map in bpf_fd_probe_obj() already with a usercnt reference, so before we drop the reference on the fd with fdput(). Therefore, if actual pinning fails, we need to drop that reference again in bpf_any_put(), otherwise we keep holding it. When last reference drops on the inode, the bpf_any_put() in bpf_evict_inode() will take care of dropping the usercnt again. In the bpf_obj_get_user() case, the bpf_any_get() will grab a reference on the usercnt, still at a time when we have the reference on the path. Should we later on fail to grab a new file descriptor, bpf_any_put() will drop it, otherwise we hold it until bpf_map_release() time. Joint work with Alexei. Fixes: `b2197755b2` ("bpf: add support for persistent maps/progs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-11-25 12:14:09 -05:00
Daniel Borkmann	f99bf205da	bpf: add show_fdinfo handler for maps Add a handler for show_fdinfo() to be used by the anon-inodes backend for eBPF maps, and dump the map specification there. Not only useful for admins, but also it provides a minimal way to compare specs from ELF vs pinned object. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-11-20 11:04:15 -05:00
Daniel Borkmann	b2197755b2	bpf: add support for persistent maps/progs This work adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users. Just to name one example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits. So a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path. The current workaround is that Unix domain sockets (UDS) need to be instrumented in order to pass the created eBPF map/program file descriptors to a third party management daemon through UDS' socket passing facility. This makes it a bit complicated to deploy shared eBPF maps or programs (programs f.e. for tail calls) among various processes. We've been brainstorming on how we could tackle this issue and various approches have been tried out so far, which can be read up further in the below reference. The architecture we eventually ended up with is a minimal file system that can hold map/prog objects. The file system is a per mount namespace singleton, and the default mount point is /sys/fs/bpf/. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through bpf(2) with two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor along with a pathname is being passed to bpf(2) that in turn creates (we call it eBPF object pinning) the file system nodes. Only the pathname is being passed to bpf(2) for getting a new BPF file descriptor to an existing node. The user can use that to access maps and progs later on, through bpf(2). Removal of file system nodes is being managed through normal VFS functions such as unlink(2), etc. The file system code is kept to a very minimum and can be further extended later on. The next step I'm working on is to add dump eBPF map/prog commands to bpf(2), so that a specification from a given file descriptor can be retrieved. This can be used by things like CRIU but also applications can inspect the meta data after calling BPF_OBJ_GET. Big thanks also to Alexei and Hannes who significantly contributed in the design discussion that eventually let us end up with this architecture here. Reference: https://lkml.org/lkml/2015/10/15/925 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-11-02 22:48:39 -05:00
Daniel Borkmann	e9d8afa90b	bpf: consolidate bpf_prog_put{, _rcu} dismantle paths We currently have duplicated cleanup code in bpf_prog_put() and bpf_prog_put_rcu() cleanup paths. Back then we decided that it was not worth it to make it a common helper called by both, but with the recent addition of resource charging, we could have avoided the fix in commit `ac00737f4e` ("bpf: Need to call bpf_prog_uncharge_memlock from bpf_prog_put") if we would have had only a single, common path. We can simplify it further by assigning aux->prog only once during allocation time. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-11-02 22:48:39 -05:00
Daniel Borkmann	c210129760	bpf: align and clean bpf_{map,prog}_get helpers Add a bpf_map_get() function that we're going to use later on and align/clean the remaining helpers a bit so that we have them a bit more consistent: - __bpf_map_get() and __bpf_prog_get() that both work on the fd struct, check whether the descriptor is eBPF and return the pointer to the map/prog stored in the private data. Also, we can return f.file->private_data directly, the function signature is enough of a documentation already. - bpf_map_get() and bpf_prog_get() that both work on u32 user fd, call their respective __bpf_map_get()/__bpf_prog_get() variants, and take a reference. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-11-02 22:48:39 -05:00
Daniel Borkmann	aa79781b65	bpf: abstract anon_inode_getfd invocations Since we're going to use anon_inode_getfd() invocations in more than just the current places, make a helper function for both, so that we only need to pass a map/prog pointer to the helper itself in order to get a fd. The new helpers are called bpf_map_new_fd() and bpf_prog_new_fd(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-11-02 22:48:39 -05:00
Tom Herbert	ac00737f4e	bpf: Need to call bpf_prog_uncharge_memlock from bpf_prog_put Currently, is only called from __prog_put_rcu in the bpf_prog_release path. Need this to call this from bpf_prog_put also to get correct accounting. Fixes: `aaac3ba95e` ("bpf: charge user for creation of BPF maps and programs") Signed-off-by: Tom Herbert <tom@herbertland.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 00:55:02 -07:00
Alexei Starovoitov	aaac3ba95e	bpf: charge user for creation of BPF maps and programs since eBPF programs and maps use kernel memory consider it 'locked' memory from user accounting point of view and charge it against RLIMIT_MEMLOCK limit. This limit is typically set to 64Kbytes by distros, so almost all bpf+tracing programs would need to increase it, since they use maps, but kernel charges maximum map size upfront. For example the hash map of 1024 elements will be charged as 64Kbyte. It's inconvenient for current users and changes current behavior for root, but probably worth doing to be consistent root vs non-root. Similar accounting logic is done by mmap of perf_event. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-12 19:13:36 -07:00
Alexei Starovoitov	1be7f75d16	bpf: enable non-root eBPF programs In order to let unprivileged users load and execute eBPF programs teach verifier to prevent pointer leaks. Verifier will prevent - any arithmetic on pointers (except R10+Imm which is used to compute stack addresses) - comparison of pointers (except if (map_value_ptr == 0) ... ) - passing pointers to helper functions - indirectly passing pointers in stack to helper functions - returning pointer from bpf program - storing pointers into ctx or maps Spill/fill of pointers into stack is allowed, but mangling of pointers stored in the stack or reading them byte by byte is not. Within bpf programs the pointers do exist, since programs need to be able to access maps, pass skb pointer to LD_ABS insns, etc but programs cannot pass such pointer values to the outside or obfuscate them. Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs, so that socket filters (tcpdump), af_packet (quic acceleration) and future kcm can use it. tracing and tc cls/act program types still require root permissions, since tracing actually needs to be able to see all kernel pointers and tc is for root only. For example, the following unprivileged socket filter program is allowed: int bpf_prog1(struct __sk_buff skb) { u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); u64 value = bpf_map_lookup_elem(&my_map, &index); if (value) value += skb->len; return 0; } but the following program is not: int bpf_prog1(struct __sk_buff skb) { u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); u64 value = bpf_map_lookup_elem(&my_map, &index); if (value) value += (u64) skb; return 0; } since it would leak the kernel address into the map. Unprivileged socket filter bpf programs have access to the following helper functions: - map lookup/update/delete (but they cannot store kernel pointers into them) - get_random (it's already exposed to unprivileged user space) - get_smp_processor_id - tail_call into another socket filter program - ktime_get_ns The feature is controlled by sysctl kernel.unprivileged_bpf_disabled. This toggle defaults to off (0), but can be set true (1). Once true, bpf programs and maps cannot be accessed from unprivileged process, and the toggle cannot be set back to false. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-12 19:13:35 -07:00
Daniel Borkmann	3ad0040573	bpf: split state from prandom_u32() and consolidate {c, e}BPF prngs While recently arguing on a seccomp discussion that raw prandom_u32() access shouldn't be exposed to unpriviledged user space, I forgot the fact that SKF_AD_RANDOM extension actually already does it for some time in cBPF via commit `4cd3675ebf` ("filter: added BPF random opcode"). Since prandom_u32() is being used in a lot of critical networking code, lets be more conservative and split their states. Furthermore, consolidate eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF, bpf_get_prandom_u32() was only accessible for priviledged users, but should that change one day, we also don't want to leak raw sequences through things like eBPF maps. One thought was also to have own per bpf_prog states, but due to ABI reasons this is not easily possible, i.e. the program code currently cannot access bpf_prog itself, and copying the rnd_state to/from the stack scratch space whenever a program uses the prng seems not really worth the trouble and seems too hacky. If needed, taus113 could in such cases be implemented within eBPF using a map entry to keep the state space, or get_random_bytes() could become a second helper in cases where performance would not be critical. Both sides can trigger a one-time late init via prandom_init_once() on the shared state. Performance-wise, there should even be a tiny gain as bpf_user_rnd_u32() saves one function call. The PRNG needs to live inside the BPF core since kernels could have a NET-less config as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Chema Gonzalez <chema@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-08 05:26:39 -07:00
Daniel Borkmann	c46646d048	sched, bpf: add helper for retrieving routing realms Using routing realms as part of the classifier is quite useful, it can be viewed as a tag for one or multiple routing entries (think of an analogy to net_cls cgroup for processes), set by user space routing daemons or via iproute2 as an indicator for traffic classifiers and later on processed in the eBPF program. Unlike actions, the classifier can inspect device flags and enable netif_keep_dst() if necessary. tc actions don't have that possibility, but in case people know what they are doing, it can be used from there as well (e.g. via devs that must keep dsts by design anyway). If a realm is set, the handler returns the non-zero realm. User space can set the full 32bit realm for the dst. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-03 05:02:41 -07:00
Daniel Borkmann	a91263d520	ebpf: migrate bpf_prog's flags to bitfield As we need to add further flags to the bpf_prog structure, lets migrate both bools to a bitfield representation. The size of the base structure (excluding insns) remains unchanged at 40 bytes. Add also tags for the kmemchecker, so that it doesn't throw false positives. Even in case gcc would generate suboptimal code, it's not being accessed in performance critical paths. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-03 05:02:39 -07:00
Daniel Borkmann	592867bfab	ebpf: fix fd refcount leaks related to maps in bpf syscall We may already have gotten a proper fd struct through fdget(), so whenever we return at the end of an map operation, we need to call fdput(). However, each map operation from syscall side first probes CHECK_ATTR() to verify that unused fields in the bpf_attr union are zero. In case of malformed input, we return with error, but the lookup to the map_fd was already performed at that time, so that we return without an corresponding fdput(). Fix it by performing an fdget() only right before bpf_map_get(). The fdget() invocation on maps in the verifier is not affected. Fixes: `db20fd2b01` ("bpf: add lookup/update/delete/iterate methods to BPF maps") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-09-09 12:39:34 -07:00

1 2 3

117 commits