forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
missing panel/dsi/gpu/mdss in mainlined yoshino device tree #2
Comments
konradybcio
pushed a commit
that referenced
this issue
Sep 22, 2022
Petr Machata says: ==================== mlxsw: Adjust QOS tests for Spectrum-4 testing Amit writes: Quality Of Service tests create congestion and verify the switch behavior. To create congestion, they need to have more traffic than the port can handle, so some of them force 1Gbps speed. The tests assume that 1Gbps speed is supported. Spectrum-4 ASIC will not support this speed in all ports, so to be able to run QOS tests there, some adjustments are required. Patch set overview: Patch #1 adjusts qos_ets_strict, qos_mc_aware and sch_ets tests. Patch #2 adjusts RED tests. Patch #3 extends devlink_lib to support querying maximum pool size. Patch #4 adds a test which can be used instead of qos_burst and do not assume that 1Gbps speed is supported. Patch #5 removes qos_burst test. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Sep 22, 2022
…itical-section' Ido Schimmel says: ==================== ipmr: Always call ip{,6}_mr_forward() from RCU read-side critical section Patch #1 fixes a bug in ipmr code. Patch #2 adds corresponding test cases. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Sep 22, 2022
…org/drm/drm-intel into drm-next drm/i915 feature pull #2 for v6.1: Features and functionality: - More Meteorlake platform enabling (Radhakrishna, Imre, Madhumitha) - Allow seamless M/N changes on eDP panels that support it (Ville) - Switch DSC debugfs from output bpp to input bpc (Swati) Refactoring and cleanups: - Clocking and DPLL refactoring and cleanups to support seamless M/N (Ville) - Plenty of VBT definition and parsing updates and cleanups (Ville) - Extract SKL watermark code to a separate file, and clean up (Ville) - Clean up IPC interfaces and debugfs (Jani) - Continue moving display data under drm_i915_private display sub-struct (Jani) - Display quirk handling refactoring and abstractions (Jani) - Stop using implicit dev_priv in gmbus registers (Jani) - BUG_ON() removals and conversions to drm_WARN_ON() and BUILD_BUG_ON() (Jani) - Use drm_dp_phy_name() for logging (Jani) - Use REG_BIT() macros for CDCLK registers (Stan) - Move display and media IP versions to runtime info (Radhakrishna) Fixes: - Fix DP MST suspend to avoid use-after-free (Andrzej) - Fix HPD suspend to avoid use-after-free for fbdev (Andrzej) - Fix various PSR issues regarding selective update and damage clips (Jouni) - Fix runtime pm wakerefs for driver remove and release (Mitul Golani) - Fix conditions for filtering fixed modes for panels (Ville) - Fix TV encoder clock computation (Ville) - Fix dvo mode_valid hook return type (Nathan Huckleberry) Merges: - Backmerge drm-next to sync the DP MST atomic changes (Jani) Signed-off-by: Dave Airlie <[email protected]> From: Jani Nikula <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
@rhjdvsgsgks May you try a current |
konradybcio
pushed a commit
that referenced
this issue
Oct 26, 2022
With lockdeps enabled, we get the following warning: ====================================================== WARNING: possible circular locking dependency detected ------------------------------------------------------ kworker/u12:1/53 is trying to acquire lock: ffff80000adce220 (coresight_mutex){+.+.}-{4:4}, at: coresight_set_assoc_ectdev_mutex+0x3c/0x5c but task is already holding lock: ffff80000add1f60 (ect_mutex){+.+.}-{4:4}, at: cti_probe+0x318/0x394 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (ect_mutex){+.+.}-{4:4}: __mutex_lock_common+0xd8/0xe60 mutex_lock_nested+0x44/0x50 cti_add_assoc_to_csdev+0x4c/0x184 coresight_register+0x2f0/0x314 tmc_probe+0x33c/0x414 -> #0 (coresight_mutex){+.+.}-{4:4}: __lock_acquire+0x1a20/0x32d0 lock_acquire+0x160/0x308 __mutex_lock_common+0xd8/0xe60 mutex_lock_nested+0x44/0x50 coresight_set_assoc_ectdev_mutex+0x3c/0x5c cti_update_conn_xrefs+0x6c/0xf8 cti_probe+0x33c/0x394 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(ect_mutex); lock(coresight_mutex); lock(ect_mutex); lock(coresight_mutex); *** DEADLOCK *** 4 locks held by kworker/u12:1/53: #0: ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1fc/0x63c #1: (deferred_probe_work){+.+.}-{0:0}, at: process_one_work+0x228/0x63c #2: (&dev->mutex){....}-{4:4}, at: __device_attach+0x48/0x1a8 #3: (ect_mutex){+.+.}-{4:4}, at: cti_probe+0x318/0x394 To fix the same, call cti_add_assoc_to_csdev without the holding coresight_mutex and confine the locking while setting the associated ect / cti device using coresight_set_assoc_ectdev_mutex(). Fixes: 177af82 ("coresight: cti: Enable CTI associated with devices") Cc: Mathieu Poirier <[email protected]> Cc: Suzuki K Poulose <[email protected]> Cc: Mike Leach <[email protected]> Cc: Leo Yan <[email protected]> Signed-off-by: Sudeep Holla <[email protected]> Reviewed-by: Mike Leach <[email protected]> Signed-off-by: Suzuki K Poulose <[email protected]> Link: https://lore.kernel.org/r/[email protected]
konradybcio
pushed a commit
that referenced
this issue
Oct 26, 2022
…locator' Hou Tao says: ==================== From: Hou Tao <[email protected]> Hi, The patchset aims to fix one problem of bpf memory allocator destruction when there is PREEMPT_RT kernel or kernel with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host or mips). The root cause is that there may be busy refill_work when the allocator is destroying and it may incur oops or other problems as shown in patch #1. Patch #1 fixes the problem by waiting for the completion of irq work during destroying and patch #2 is just a clean-up patch based on patch #1. Please see individual patches for more details. Comments are always welcome. Change Log: v2: * patch 1: fix typos and add notes about the overhead of irq_work_sync() * patch 1 & 2: add Acked-by tags from [email protected] v1: https://lore.kernel.org/bpf/[email protected]/T/#t ==================== Signed-off-by: Alexei Starovoitov <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Oct 26, 2022
…kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.1, take #2 - Fix a bug preventing restoring an ITS containing mappings for very large and very sparse device topology - Work around a relocation handling error when compiling the nVHE object with profile optimisation
konradybcio
pushed a commit
that referenced
this issue
Oct 26, 2022
Merge series from Andy Shevchenko <[email protected]>: Currently the SPI PXA2xx devices on Intel platforms can be instantiated via the following paths: 1) as ACPI LPSS device on Haswell, Bay Trail and Cherry Trail; 2) as ACPI LPSS device on the Sky Lake and newer; 3) as PCI LPSS device on Haswell, Bay Trail and Cherry Trail; 4) as PCI LPSS device on the Sky Lake and newer; 5) as PCI device via ID table. Each of these cases provides some platform related data differently, i.e.: 1) via drivers/acpi/acpi_lpss.c and drivers/spi/spi-pxa2xx.c 2) via drivers/mfd/intel-lpss-acpi.c 3) via drivers/spi/spi-pxa2xx-pci.c 4) via drivers/mfd/intel-lpss-pci.c and drivers/spi/spi-pxa2xx.c 5) via drivers/spi/spi-pxa2xx-pci.c This approach has two downsides: a) there is no data propagated in the case #2 because we can't have two or more drivers to match the same ACPI ID and hence some cases are still not supported (Sky Lake and newer ACPI enabled LPSS); b) the data is duplicated over two drivers in the cases #1 & #4 and, besides to be a bloatware, it is error prone (e.g. Lakefield has a wrong data right now due to missed PCI entry in the spi-pxa2xx.c). This series fixes the downsides, and enables previously unsupported cases.
Haxk20
pushed a commit
that referenced
this issue
Nov 14, 2022
esw_attr is only allocated if namespace is fdb. BUG: KASAN: slab-out-of-bounds in parse_tc_actions+0xdc6/0x10e0 [mlx5_core] Write of size 4 at addr ffff88815f185b04 by task tc/2135 CPU: 5 PID: 2135 Comm: tc Not tainted 6.1.0-rc2+ #2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x57/0x7d print_report+0x170/0x471 ? parse_tc_actions+0xdc6/0x10e0 [mlx5_core] kasan_report+0xbc/0xf0 ? parse_tc_actions+0xdc6/0x10e0 [mlx5_core] parse_tc_actions+0xdc6/0x10e0 [mlx5_core] Fixes: 94d6517 ("net/mlx5e: TC, Fix cloned flow attr instance dests are not zeroed") Signed-off-by: Roi Dayan <[email protected]> Reviewed-by: Maor Dickman <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
Haxk20
pushed a commit
that referenced
this issue
Nov 14, 2022
…e-during-reload' Jiri Pirko says: ==================== net: devlink: move netdev notifier block to dest namespace during reload Patch #1 is just a dependency of patch #2, which is the actual fix. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
Haxk20
pushed a commit
that referenced
this issue
Nov 14, 2022
Resolve forward declarations that don't take part in type graphs comparisons if declaration name is unambiguous. Example: CU #1: struct foo; // standalone forward declaration struct foo *some_global; CU #2: struct foo { int x; }; struct foo *another_global; The `struct foo` from CU #1 is not a part of any definition that is compared against another definition while `btf_dedup_struct_types` processes structural types. The the BTF after `btf_dedup_struct_types` the BTF looks as follows: [1] STRUCT 'foo' size=4 vlen=1 ... [2] INT 'int' size=4 ... [3] PTR '(anon)' type_id=1 [4] FWD 'foo' fwd_kind=struct [5] PTR '(anon)' type_id=4 This commit adds a new pass `btf_dedup_resolve_fwds`, that maps such forward declarations to structs or unions with identical name in case if the name is not ambiguous. The pass is positioned before `btf_dedup_ref_types` so that types [3] and [5] could be merged as a same type after [1] and [4] are merged. The final result for the example above looks as follows: [1] STRUCT 'foo' size=4 vlen=1 'x' type_id=2 bits_offset=0 [2] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED [3] PTR '(anon)' type_id=1 For defconfig kernel with BTF enabled this removes 63 forward declarations. Examples of removed declarations: `pt_regs`, `in6_addr`. The running time of `btf__dedup` function is increased by about 3%. Signed-off-by: Eduard Zingerman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Reviewed-by: Alan Maguire <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
Haxk20
pushed a commit
that referenced
this issue
Nov 14, 2022
Eduard Zingerman says: ==================== The patch-set is consists of the following parts: - An update for libbpf's hashmap interface from void* -> void* to a polymorphic one, allowing to use both long and void* keys and values w/o additional casts. Interface functions are hidden behind auxiliary macro that add casts as necessary. This simplifies many use cases in libbpf as hashmaps there are mostly integer to integer and required awkward looking incantations like `(void *)(long)off` previously. Also includes updates for perf as it copies map implementation from libbpf. - A change to `lib/bpf/btf.c:btf__dedup` that adds a new pass named "Resolve unambiguous forward declaration". This pass builds a hashmap `name_off -> uniquely named struct or union` and uses it to replace FWD types by structs or unions. This is necessary for corner cases when FWD is not used as a part of some struct or union definition de-duplicated by `btf_dedup_struct_types`. The goal of the patch-set is to resolve forward declarations that don't take part in type graphs comparisons if declaration name is unambiguous. Example: CU #1: struct foo; // standalone forward declaration struct foo *some_global; CU #2: struct foo { int x; }; struct foo *another_global; Currently the de-duplicated BTF for this example looks as follows: [1] STRUCT 'foo' size=4 vlen=1 ... [2] INT 'int' size=4 ... [3] PTR '(anon)' type_id=1 [4] FWD 'foo' fwd_kind=struct [5] PTR '(anon)' type_id=4 The goal of this patch-set is to simplify it as follows: [1] STRUCT 'foo' size=4 vlen=1 'x' type_id=2 bits_offset=0 [2] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED [3] PTR '(anon)' type_id=1 For defconfig kernel with BTF enabled this removes 63 forward declarations. For allmodconfig kernel with BTF enabled this removes ~5K out of ~21K forward declarations in ko objects. This unlocks some additional de-duplication in ko objects, but impact is tiny: ~13K less BTF ids out of ~2M. Changelog: v3 -> v4 Changes suggested by Andrii: - hashmap interface rework to allow use of integer and pointer keys an values w/o casts as suggested in [1]. v2 -> v3 Changes suggested by Andrii: - perf's util/hashtable.{c,h} are synchronized with libbpf implementation, perf's source code updated accordingly; - changes to libbpf, bpf selftests and perf are combined in a single patch to simplify bisecting; - hashtable interface updated to be long -> long instead of uintptr_t -> uintptr_t; - btf_dedup_resolve_fwds updated to correctly use IS_ERR / PTR_ERR macro; - test cases for btf_dedup_resolve_fwds are updated for better clarity. v1 -> v2 - Style fixes in btf_dedup_resolve_fwd and btf_dedup_resolve_fwds as suggested by Alan. [v1] https://lore.kernel.org/bpf/[email protected]/ [v2] https://lore.kernel.org/bpf/[email protected]/ [v3] https://lore.kernel.org/bpf/[email protected]/ [1] https://lore.kernel.org/bpf/CAEf4BzZ8KFneEJxFAaNCCFPGqp20hSpS2aCj76uRk3-qZUH5xg@mail.gmail.com/ ==================== Signed-off-by: Andrii Nakryiko <[email protected]>
Haxk20
pushed a commit
that referenced
this issue
Nov 14, 2022
Yang Yingliang says: ==================== stmmac: dwmac-loongson: fixes three leaks patch #2 fixes missing pci_disable_device() in the error path in probe() patch #1 and pach #3 fix missing pci_disable_msi() and of_node_put() in error and remove() path. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
Haxk20
pushed a commit
that referenced
this issue
Nov 14, 2022
Syzbot reported the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Not tainted ------------------------------------------------------ syz-executor307/3029 is trying to acquire lock: ffff0000c02525d8 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x54/0xb4 mm/memory.c:5576 but task is already holding lock: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (btrfs-root-00){++++}-{3:3}: down_read_nested+0x64/0x84 kernel/locking/rwsem.c:1624 __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 btrfs_search_slot_get_root+0x74/0x338 fs/btrfs/ctree.c:1637 btrfs_search_slot+0x1b0/0xfd8 fs/btrfs/ctree.c:1944 btrfs_update_root+0x6c/0x5a0 fs/btrfs/root-tree.c:132 commit_fs_roots+0x1f0/0x33c fs/btrfs/transaction.c:1459 btrfs_commit_transaction+0x89c/0x12d8 fs/btrfs/transaction.c:2343 flush_space+0x66c/0x738 fs/btrfs/space-info.c:786 btrfs_async_reclaim_metadata_space+0x43c/0x4e0 fs/btrfs/space-info.c:1059 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289 worker_thread+0x340/0x610 kernel/workqueue.c:2436 kthread+0x12c/0x158 kernel/kthread.c:376 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860 -> #2 (&fs_info->reloc_mutex){+.+.}-{3:3}: __mutex_lock_common+0xd4/0xca8 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x38/0x44 kernel/locking/mutex.c:799 btrfs_record_root_in_trans fs/btrfs/transaction.c:516 [inline] start_transaction+0x248/0x944 fs/btrfs/transaction.c:752 btrfs_start_transaction+0x34/0x44 fs/btrfs/transaction.c:781 btrfs_create_common+0xf0/0x1b4 fs/btrfs/inode.c:6651 btrfs_create+0x8c/0xb0 fs/btrfs/inode.c:6697 lookup_open fs/namei.c:3413 [inline] open_last_lookups fs/namei.c:3481 [inline] path_openat+0x804/0x11c4 fs/namei.c:3688 do_filp_open+0xdc/0x1b8 fs/namei.c:3718 do_sys_openat2+0xb8/0x22c fs/open.c:1313 do_sys_open fs/open.c:1329 [inline] __do_sys_openat fs/open.c:1345 [inline] __se_sys_openat fs/open.c:1340 [inline] __arm64_sys_openat+0xb0/0xe0 fs/open.c:1340 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #1 (sb_internal#2){.+.+}-{0:0}: percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1826 [inline] sb_start_intwrite include/linux/fs.h:1948 [inline] start_transaction+0x360/0x944 fs/btrfs/transaction.c:683 btrfs_join_transaction+0x30/0x40 fs/btrfs/transaction.c:795 btrfs_dirty_inode+0x50/0x140 fs/btrfs/inode.c:6103 btrfs_update_time+0x1c0/0x1e8 fs/btrfs/inode.c:6145 inode_update_time fs/inode.c:1872 [inline] touch_atime+0x1f0/0x4a8 fs/inode.c:1945 file_accessed include/linux/fs.h:2516 [inline] btrfs_file_mmap+0x50/0x88 fs/btrfs/file.c:2407 call_mmap include/linux/fs.h:2192 [inline] mmap_region+0x7fc/0xc14 mm/mmap.c:1752 do_mmap+0x644/0x97c mm/mmap.c:1540 vm_mmap_pgoff+0xe8/0x1d0 mm/util.c:552 ksys_mmap_pgoff+0x1cc/0x278 mm/mmap.c:1586 __do_sys_mmap arch/arm64/kernel/sys.c:28 [inline] __se_sys_mmap arch/arm64/kernel/sys.c:21 [inline] __arm64_sys_mmap+0x58/0x6c arch/arm64/kernel/sys.c:21 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #0 (&mm->mmap_lock){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 other info that might help us debug this: Chain exists of: &mm->mmap_lock --> &fs_info->reloc_mutex --> btrfs-root-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-root-00); lock(&fs_info->reloc_mutex); lock(btrfs-root-00); lock(&mm->mmap_lock); *** DEADLOCK *** 1 lock held by syz-executor307/3029: #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 stack backtrace: CPU: 0 PID: 3029 Comm: syz-executor307 Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022 Call trace: dump_backtrace+0x1c4/0x1f0 arch/arm64/kernel/stacktrace.c:156 show_stack+0x2c/0x54 arch/arm64/kernel/stacktrace.c:163 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x104/0x16c lib/dump_stack.c:106 dump_stack+0x1c/0x58 lib/dump_stack.c:113 print_circular_bug+0x2c4/0x2c8 kernel/locking/lockdep.c:2053 check_noncircular+0x14c/0x154 kernel/locking/lockdep.c:2175 check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 We do generally the right thing here, copying the references into a temporary buffer, however we are still holding the path when we do copy_to_user from the temporary buffer. Fix this by freeing the path before we copy to user space. Reported-by: [email protected] CC: [email protected] # 4.19+ Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
Haxk20
pushed a commit
that referenced
this issue
Nov 14, 2022
Syzbot reported the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Not tainted ------------------------------------------------------ syz-executor307/3029 is trying to acquire lock: ffff0000c02525d8 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x54/0xb4 mm/memory.c:5576 but task is already holding lock: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (btrfs-root-00){++++}-{3:3}: down_read_nested+0x64/0x84 kernel/locking/rwsem.c:1624 __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 btrfs_search_slot_get_root+0x74/0x338 fs/btrfs/ctree.c:1637 btrfs_search_slot+0x1b0/0xfd8 fs/btrfs/ctree.c:1944 btrfs_update_root+0x6c/0x5a0 fs/btrfs/root-tree.c:132 commit_fs_roots+0x1f0/0x33c fs/btrfs/transaction.c:1459 btrfs_commit_transaction+0x89c/0x12d8 fs/btrfs/transaction.c:2343 flush_space+0x66c/0x738 fs/btrfs/space-info.c:786 btrfs_async_reclaim_metadata_space+0x43c/0x4e0 fs/btrfs/space-info.c:1059 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289 worker_thread+0x340/0x610 kernel/workqueue.c:2436 kthread+0x12c/0x158 kernel/kthread.c:376 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860 -> #2 (&fs_info->reloc_mutex){+.+.}-{3:3}: __mutex_lock_common+0xd4/0xca8 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x38/0x44 kernel/locking/mutex.c:799 btrfs_record_root_in_trans fs/btrfs/transaction.c:516 [inline] start_transaction+0x248/0x944 fs/btrfs/transaction.c:752 btrfs_start_transaction+0x34/0x44 fs/btrfs/transaction.c:781 btrfs_create_common+0xf0/0x1b4 fs/btrfs/inode.c:6651 btrfs_create+0x8c/0xb0 fs/btrfs/inode.c:6697 lookup_open fs/namei.c:3413 [inline] open_last_lookups fs/namei.c:3481 [inline] path_openat+0x804/0x11c4 fs/namei.c:3688 do_filp_open+0xdc/0x1b8 fs/namei.c:3718 do_sys_openat2+0xb8/0x22c fs/open.c:1313 do_sys_open fs/open.c:1329 [inline] __do_sys_openat fs/open.c:1345 [inline] __se_sys_openat fs/open.c:1340 [inline] __arm64_sys_openat+0xb0/0xe0 fs/open.c:1340 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #1 (sb_internal#2){.+.+}-{0:0}: percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1826 [inline] sb_start_intwrite include/linux/fs.h:1948 [inline] start_transaction+0x360/0x944 fs/btrfs/transaction.c:683 btrfs_join_transaction+0x30/0x40 fs/btrfs/transaction.c:795 btrfs_dirty_inode+0x50/0x140 fs/btrfs/inode.c:6103 btrfs_update_time+0x1c0/0x1e8 fs/btrfs/inode.c:6145 inode_update_time fs/inode.c:1872 [inline] touch_atime+0x1f0/0x4a8 fs/inode.c:1945 file_accessed include/linux/fs.h:2516 [inline] btrfs_file_mmap+0x50/0x88 fs/btrfs/file.c:2407 call_mmap include/linux/fs.h:2192 [inline] mmap_region+0x7fc/0xc14 mm/mmap.c:1752 do_mmap+0x644/0x97c mm/mmap.c:1540 vm_mmap_pgoff+0xe8/0x1d0 mm/util.c:552 ksys_mmap_pgoff+0x1cc/0x278 mm/mmap.c:1586 __do_sys_mmap arch/arm64/kernel/sys.c:28 [inline] __se_sys_mmap arch/arm64/kernel/sys.c:21 [inline] __arm64_sys_mmap+0x58/0x6c arch/arm64/kernel/sys.c:21 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #0 (&mm->mmap_lock){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 other info that might help us debug this: Chain exists of: &mm->mmap_lock --> &fs_info->reloc_mutex --> btrfs-root-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-root-00); lock(&fs_info->reloc_mutex); lock(btrfs-root-00); lock(&mm->mmap_lock); *** DEADLOCK *** 1 lock held by syz-executor307/3029: #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 stack backtrace: CPU: 0 PID: 3029 Comm: syz-executor307 Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022 Call trace: dump_backtrace+0x1c4/0x1f0 arch/arm64/kernel/stacktrace.c:156 show_stack+0x2c/0x54 arch/arm64/kernel/stacktrace.c:163 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x104/0x16c lib/dump_stack.c:106 dump_stack+0x1c/0x58 lib/dump_stack.c:113 print_circular_bug+0x2c4/0x2c8 kernel/locking/lockdep.c:2053 check_noncircular+0x14c/0x154 kernel/locking/lockdep.c:2175 check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 We do generally the right thing here, copying the references into a temporary buffer, however we are still holding the path when we do copy_to_user from the temporary buffer. Fix this by freeing the path before we copy to user space. Reported-by: [email protected] CC: [email protected] # 4.19+ Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
MarijnS95
pushed a commit
that referenced
this issue
Nov 15, 2022
CQE compression feature improves performance by reducing PCI bandwidth bottleneck on CQEs write. Enhanced CQE compression introduced in ConnectX-6 and it aims to reduce CPU utilization of SW side packets decompression by eliminating the need to rewrite ownership bit, which is likely to cost a cache-miss, is replaced by validity byte handled solely by HW. Another advantage of the enhanced feature is that session packets are available to SW as soon as a single CQE slot is filled, instead of waiting for session to close, this improves packet latency from NIC to host. Performance: Following are tested scenarios and reults comparing basic and enahnced CQE compression. setup: IXIA 100GbE connected directly to port 0 and port 1 of ConnectX-6 Dx 100GbE dual port. Case #1 RX only, single flow goes to single queue: IRQ rate reduced by ~ 30%, CPU utilization improved by 2%. Case #2 IP forwarding from port 1 to port 0 single flow goes to single queue: Avg latency improved from 60us to 21us, frame loss improved from 0.5% to 0.0%. Case #3 IP forwarding from port 1 to port 0 Max Throughput IXIA sends 100%, 8192 UDP flows, goes to 24 queues: Enhanced is equal or slightly better than basic. Testing the basic compression feature with this patch shows there is no perfrormance degradation of the basic compression feature. Signed-off-by: Ofer Levi <[email protected]> Reviewed-by: Tariq Toukan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
MarijnS95
pushed a commit
that referenced
this issue
Nov 15, 2022
Patch series "rapidio: fix three possible memory leaks". This patchset fixes three name leaks in error handling. - patch #1 fixes two name leaks while rio_add_device() fails. - patch #2 fixes a name leak while rio_register_mport() fails. This patch (of 2): If rio_add_device() returns error, the name allocated by dev_set_name() need be freed. It should use put_device() to give up the reference in the error path, so that the name can be freed in kobject_cleanup(), and the 'rdev' can be freed in rio_release_dev(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: e8de370 ("rapidio: add mport char device driver") Fixes: 1fa5ae8 ("driver core: get rid of struct device's bus_id string array") Signed-off-by: Yang Yingliang <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Matt Porter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Nov 18, 2022
Syzbot reported the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Not tainted ------------------------------------------------------ syz-executor307/3029 is trying to acquire lock: ffff0000c02525d8 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x54/0xb4 mm/memory.c:5576 but task is already holding lock: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (btrfs-root-00){++++}-{3:3}: down_read_nested+0x64/0x84 kernel/locking/rwsem.c:1624 __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 btrfs_search_slot_get_root+0x74/0x338 fs/btrfs/ctree.c:1637 btrfs_search_slot+0x1b0/0xfd8 fs/btrfs/ctree.c:1944 btrfs_update_root+0x6c/0x5a0 fs/btrfs/root-tree.c:132 commit_fs_roots+0x1f0/0x33c fs/btrfs/transaction.c:1459 btrfs_commit_transaction+0x89c/0x12d8 fs/btrfs/transaction.c:2343 flush_space+0x66c/0x738 fs/btrfs/space-info.c:786 btrfs_async_reclaim_metadata_space+0x43c/0x4e0 fs/btrfs/space-info.c:1059 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289 worker_thread+0x340/0x610 kernel/workqueue.c:2436 kthread+0x12c/0x158 kernel/kthread.c:376 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860 -> #2 (&fs_info->reloc_mutex){+.+.}-{3:3}: __mutex_lock_common+0xd4/0xca8 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x38/0x44 kernel/locking/mutex.c:799 btrfs_record_root_in_trans fs/btrfs/transaction.c:516 [inline] start_transaction+0x248/0x944 fs/btrfs/transaction.c:752 btrfs_start_transaction+0x34/0x44 fs/btrfs/transaction.c:781 btrfs_create_common+0xf0/0x1b4 fs/btrfs/inode.c:6651 btrfs_create+0x8c/0xb0 fs/btrfs/inode.c:6697 lookup_open fs/namei.c:3413 [inline] open_last_lookups fs/namei.c:3481 [inline] path_openat+0x804/0x11c4 fs/namei.c:3688 do_filp_open+0xdc/0x1b8 fs/namei.c:3718 do_sys_openat2+0xb8/0x22c fs/open.c:1313 do_sys_open fs/open.c:1329 [inline] __do_sys_openat fs/open.c:1345 [inline] __se_sys_openat fs/open.c:1340 [inline] __arm64_sys_openat+0xb0/0xe0 fs/open.c:1340 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #1 (sb_internal#2){.+.+}-{0:0}: percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1826 [inline] sb_start_intwrite include/linux/fs.h:1948 [inline] start_transaction+0x360/0x944 fs/btrfs/transaction.c:683 btrfs_join_transaction+0x30/0x40 fs/btrfs/transaction.c:795 btrfs_dirty_inode+0x50/0x140 fs/btrfs/inode.c:6103 btrfs_update_time+0x1c0/0x1e8 fs/btrfs/inode.c:6145 inode_update_time fs/inode.c:1872 [inline] touch_atime+0x1f0/0x4a8 fs/inode.c:1945 file_accessed include/linux/fs.h:2516 [inline] btrfs_file_mmap+0x50/0x88 fs/btrfs/file.c:2407 call_mmap include/linux/fs.h:2192 [inline] mmap_region+0x7fc/0xc14 mm/mmap.c:1752 do_mmap+0x644/0x97c mm/mmap.c:1540 vm_mmap_pgoff+0xe8/0x1d0 mm/util.c:552 ksys_mmap_pgoff+0x1cc/0x278 mm/mmap.c:1586 __do_sys_mmap arch/arm64/kernel/sys.c:28 [inline] __se_sys_mmap arch/arm64/kernel/sys.c:21 [inline] __arm64_sys_mmap+0x58/0x6c arch/arm64/kernel/sys.c:21 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #0 (&mm->mmap_lock){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 other info that might help us debug this: Chain exists of: &mm->mmap_lock --> &fs_info->reloc_mutex --> btrfs-root-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-root-00); lock(&fs_info->reloc_mutex); lock(btrfs-root-00); lock(&mm->mmap_lock); *** DEADLOCK *** 1 lock held by syz-executor307/3029: #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 stack backtrace: CPU: 0 PID: 3029 Comm: syz-executor307 Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022 Call trace: dump_backtrace+0x1c4/0x1f0 arch/arm64/kernel/stacktrace.c:156 show_stack+0x2c/0x54 arch/arm64/kernel/stacktrace.c:163 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x104/0x16c lib/dump_stack.c:106 dump_stack+0x1c/0x58 lib/dump_stack.c:113 print_circular_bug+0x2c4/0x2c8 kernel/locking/lockdep.c:2053 check_noncircular+0x14c/0x154 kernel/locking/lockdep.c:2175 check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 We do generally the right thing here, copying the references into a temporary buffer, however we are still holding the path when we do copy_to_user from the temporary buffer. Fix this by freeing the path before we copy to user space. Reported-by: [email protected] CC: [email protected] # 4.19+ Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Nov 18, 2022
netfslib has a number of places in which it performs iteration of an xarray whilst being under the RCU read lock. It *should* call xas_retry() as the first thing inside of the loop and do "continue" if it returns true in case the xarray walker passed out a special value indicating that the walk needs to be redone from the root[*]. Fix this by adding the missing retry checks. [*] I wonder if this should be done inside xas_find(), xas_next_node() and suchlike, but I'm told that's not an simple change to effect. This can cause an oops like that below. Note the faulting address - this is an internal value (|0x2) returned from xarray. BUG: kernel NULL pointer dereference, address: 0000000000000402 ... RIP: 0010:netfs_rreq_unlock+0xef/0x380 [netfs] ... Call Trace: netfs_rreq_assess+0xa6/0x240 [netfs] netfs_readpage+0x173/0x3b0 [netfs] ? init_wait_var_entry+0x50/0x50 filemap_read_page+0x33/0xf0 filemap_get_pages+0x2f2/0x3f0 filemap_read+0xaa/0x320 ? do_filp_open+0xb2/0x150 ? rmqueue+0x3be/0xe10 ceph_read_iter+0x1fe/0x680 [ceph] ? new_sync_read+0x115/0x1a0 new_sync_read+0x115/0x1a0 vfs_read+0xf3/0x180 ksys_read+0x5f/0xe0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Changes: ======== ver #2) - Changed an unsigned int to a size_t to reduce the likelihood of an overflow as per Willy's suggestion. - Added an additional patch to fix the maths. Fixes: 3d3c950 ("netfs: Provide readahead and readpage netfs helpers") Reported-by: George Law <[email protected]> Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Jingbo Xu <[email protected]> cc: Matthew Wilcox <[email protected]> cc: [email protected] cc: [email protected] Link: https://lore.kernel.org/r/166749229733.107206.17482609105741691452.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/166757987929.950645.12595273010425381286.stgit@warthog.procyon.org.uk/ # v2
konradybcio
pushed a commit
that referenced
this issue
Nov 18, 2022
Patch series "rapidio: fix three possible memory leaks". This patchset fixes three name leaks in error handling. - patch #1 fixes two name leaks while rio_add_device() fails. - patch #2 fixes a name leak while rio_register_mport() fails. This patch (of 2): If rio_add_device() returns error, the name allocated by dev_set_name() need be freed. It should use put_device() to give up the reference in the error path, so that the name can be freed in kobject_cleanup(), and the 'rdev' can be freed in rio_release_dev(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: e8de370 ("rapidio: add mport char device driver") Fixes: 1fa5ae8 ("driver core: get rid of struct device's bus_id string array") Signed-off-by: Yang Yingliang <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Matt Porter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Nov 18, 2022
The "ib_port" structure must be set before adding the sysfs kobject, and reset after removing it, otherwise it may crash when accessing the sysfs node: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000050 Mem abort info: ESR = 0x96000006 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000e85f5ba5 [0000000000000050] pgd=0000000848fd9003, pud=000000085b387003, pmd=0000000000000000 Internal error: Oops: 96000006 [#2] PREEMPT SMP Modules linked in: ib_umad(O) mlx5_ib(O) nfnetlink_cttimeout(E) nfnetlink(E) act_gact(E) cls_flower(E) sch_ingress(E) openvswitch(E) nsh(E) nf_nat_ipv6(E) nf_nat_ipv4(E) nf_conncount(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) mst_pciconf(O) ipmi_devintf(E) ipmi_msghandler(E) ipmb_dev_int(OE) mlx5_core(O) mlxfw(O) mlxdevm(O) auxiliary(O) ib_uverbs(O) ib_core(O) mlx_compat(O) psample(E) sbsa_gwdt(E) uio_pdrv_genirq(E) uio(E) mlxbf_pmc(OE) mlxbf_gige(OE) mlxbf_tmfifo(OE) gpio_mlxbf2(OE) pwr_mlxbf(OE) mlx_trio(OE) i2c_mlxbf(OE) mlx_bootctl(OE) bluefield_edac(OE) knem(O) ip_tables(E) ipv6(E) crc_ccitt(E) [last unloaded: mst_pci] Process grep (pid: 3372, stack limit = 0x0000000022055c92) CPU: 5 PID: 3372 Comm: grep Tainted: G D OE 4.19.161-mlnx.47.gadcd9e3 #1 Hardware name: https://www.mellanox.com BlueField SoC/BlueField SoC, BIOS BlueField:3.9.2-15-ga2403ab Sep 8 2022 pstate: 40000005 (nZcv daif -PAN -UAO) pc : hw_stat_port_show+0x4c/0x80 [ib_core] lr : port_attr_show+0x40/0x58 [ib_core] sp : ffff000029f43b50 x29: ffff000029f43b50 x28: 0000000019375000 x27: ffff8007b821a540 x26: ffff000029f43e30 x25: 0000000000008000 x24: ffff000000eaa958 x23: 0000000000001000 x22: ffff8007a4ce3000 x21: ffff8007baff8000 x20: ffff8007b9066ac0 x19: ffff8007bae97578 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 x8 : ffff8007a4ce4000 x7 : 0000000000000000 x6 : 000000000000003f x5 : ffff000000e6a280 x4 : ffff8007a4ce3000 x3 : 0000000000000000 x2 : aaaaaaaaaaaaaaab x1 : ffff8007b9066a10 x0 : ffff8007baff8000 Call trace: hw_stat_port_show+0x4c/0x80 [ib_core] port_attr_show+0x40/0x58 [ib_core] sysfs_kf_seq_show+0x8c/0x150 kernfs_seq_show+0x44/0x50 seq_read+0x1b4/0x45c kernfs_fop_read+0x148/0x1d8 __vfs_read+0x58/0x180 vfs_read+0x94/0x154 ksys_read+0x68/0xd8 __arm64_sys_read+0x28/0x34 el0_svc_common+0x88/0x18c el0_svc_handler+0x78/0x94 el0_svc+0x8/0xe8 Code: f2955562 aa1603e4 aa1503e0 f9405683 (f9402861) Fixes: d8a5883 ("RDMA/core: Replace the ib_port_data hw_stats pointers with a ib_port pointer") Signed-off-by: Mark Zhang <[email protected]> Reviewed-by: Michael Guralnik <[email protected]> Link: https://lore.kernel.org/r/88867e705c42c1cd2011e45201c25eecdb9fef94.1667810736.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Nov 18, 2022
Patch series "rapidio: fix three possible memory leaks". This patchset fixes three name leaks in error handling. - patch #1 fixes two name leaks while rio_add_device() fails. - patch #2 fixes a name leak while rio_register_mport() fails. This patch (of 2): If rio_add_device() returns error, the name allocated by dev_set_name() need be freed. It should use put_device() to give up the reference in the error path, so that the name can be freed in kobject_cleanup(), and the 'rdev' can be freed in rio_release_dev(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: e8de370 ("rapidio: add mport char device driver") Fixes: 1fa5ae8 ("driver core: get rid of struct device's bus_id string array") Signed-off-by: Yang Yingliang <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Matt Porter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Haxk20
pushed a commit
that referenced
this issue
Nov 22, 2022
Sahid Orentino Ferdjaoui says: ==================== As part of commit 93b8952 ("libbpf: deprecate legacy BPF map definitions") and commit bd05410 ("libbpf: enforce strict libbpf 1.0 behaviors") The --legacy option is not relevant anymore. #1 is removing it. #4 is cleaning the code from using libbpf_get_error(). About patches #2 and #3 They are changes discovered while working on this series (credits to Quentin Monnet). #2 is cleaning-up usage of an unnecessary PTR_ERR(NULL), finally #3 is fixing an invalid value passed to strerror(). v1 -> v2: - Addressed review comments from Yonghong Song on patch #4 - Added a patch #5 that removes unwanted function noticed by Yonghong Song v2 -> v3 - Addressed review comments from Andrii Nakryiko on patch #2, #3, #4 * clean-up usage of libbpf_get_error() (#2, #3) * fix possible return of an uninitialized local variable err * fix returned errors using errno v3 -> v4 - Addressed review comments from Quentin Monnet * fix line moved from patch #2 to patch #3 * fix missing returned errors using errno * fix some returned values to errno instead of -1 ==================== Reviewed-by: Quentin Monnet <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
Haxk20
pushed a commit
that referenced
this issue
Nov 22, 2022
…nel/git/at91/linux into soc/dt AT91 DT for 6.2 #2 It contains: - one typo fix for a SAMA7G5 pin; the pin is not used anywhere in the device trees. * tag 'at91-dt-6.2-2' of https://git.kernel.org/pub/scm/linux/kernel/git/at91/linux: ARM: dts: at91: sama7g5: fix signal name of pin PD8 Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]>
Haxk20
pushed a commit
that referenced
this issue
Nov 22, 2022
…kernel/git/at91/linux into arm/fixes AT91 fixes for 6.1 #2 It contains: - fix UDC on at91sam9g20ek boards by adding vbus pin * tag 'at91-fixes-6.1-2' of https://git.kernel.org/pub/scm/linux/kernel/git/at91/linux: ARM: dts: at91: sam9g20ek: enable udc vbus gpio pinctrl Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]>
Haxk20
pushed a commit
that referenced
this issue
Nov 22, 2022
Patch series "rapidio: fix three possible memory leaks". This patchset fixes three name leaks in error handling. - patch #1 fixes two name leaks while rio_add_device() fails. - patch #2 fixes a name leak while rio_register_mport() fails. This patch (of 2): If rio_add_device() returns error, the name allocated by dev_set_name() need be freed. It should use put_device() to give up the reference in the error path, so that the name can be freed in kobject_cleanup(), and the 'rdev' can be freed in rio_release_dev(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: e8de370 ("rapidio: add mport char device driver") Fixes: 1fa5ae8 ("driver core: get rid of struct device's bus_id string array") Signed-off-by: Yang Yingliang <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Matt Porter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
MarijnS95
pushed a commit
that referenced
this issue
Dec 7, 2022
Matt reported a splat at msk close time: BUG: sleeping function called from invalid context at net/mptcp/protocol.c:2877 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 155, name: packetdrill preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 4 locks held by packetdrill/155: #0: ffff888001536990 (&sb->s_type->i_mutex_key#6){+.+.}-{3:3}, at: __sock_release (net/socket.c:650) #1: ffff88800b498130 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close (net/mptcp/protocol.c:2973) #2: ffff88800b49a130 (sk_lock-AF_INET/1){+.+.}-{0:0}, at: __mptcp_close_ssk (net/mptcp/protocol.c:2363) #3: ffff88800b49a0b0 (slock-AF_INET){+...}-{2:2}, at: __lock_sock_fast (include/net/sock.h:1820) Preemption disabled at: 0x0 CPU: 1 PID: 155 Comm: packetdrill Not tainted 6.1.0-rc5 torvalds#365 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) __might_resched.cold (kernel/sched/core.c:9891) __mptcp_destroy_sock (include/linux/kernel.h:110) __mptcp_close (net/mptcp/protocol.c:2959) mptcp_subflow_queue_clean (include/net/sock.h:1777) __mptcp_close_ssk (net/mptcp/protocol.c:2363) mptcp_destroy_common (net/mptcp/protocol.c:3170) mptcp_destroy (include/net/sock.h:1495) __mptcp_destroy_sock (net/mptcp/protocol.c:2886) __mptcp_close (net/mptcp/protocol.c:2959) mptcp_close (net/mptcp/protocol.c:2974) inet_release (net/ipv4/af_inet.c:432) __sock_release (net/socket.c:651) sock_close (net/socket.c:1367) __fput (fs/file_table.c:320) task_work_run (kernel/task_work.c:181 (discriminator 1)) exit_to_user_mode_prepare (include/linux/resume_user_mode.h:49) syscall_exit_to_user_mode (kernel/entry/common.c:130) do_syscall_64 (arch/x86/entry/common.c:87) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) We can't call mptcp_close under the 'fast' socket lock variant, replace it with a sock_lock_nested() as the relevant code is already under the listening msk socket lock protection. Reported-by: Matthieu Baerts <[email protected]> Closes: multipath-tcp/mptcp_net-next#316 Fixes: 30e51b9 ("mptcp: fix unreleased socket in accept queue") Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts <[email protected]> Signed-off-by: Matthieu Baerts <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
MarijnS95
pushed a commit
that referenced
this issue
Dec 7, 2022
QAT devices on Intel Sapphire Rapids and Emerald Rapids have a defect in address translation service (ATS). These devices may inadvertently issue ATS invalidation completion before posted writes initiated with translated address that utilized translations matching the invalidation address range, violating the invalidation completion ordering. This patch adds an extra device TLB invalidation for the affected devices, it is needed to ensure no more posted writes with translated address following the invalidation completion. Therefore, the ordering is preserved and data-corruption is prevented. Device TLBs are invalidated under the following six conditions: 1. Device driver does DMA API unmap IOVA 2. Device driver unbind a PASID from a process, sva_unbind_device() 3. PASID is torn down, after PASID cache is flushed. e.g. process exit_mmap() due to crash 4. Under SVA usage, called by mmu_notifier.invalidate_range() where VM has to free pages that were unmapped 5. userspace driver unmaps a DMA buffer 6. Cache invalidation in vSVA usage (upcoming) For #1 and #2, device drivers are responsible for stopping DMA traffic before unmap/unbind. For #3, iommu driver gets mmu_notifier to invalidate TLB the same way as normal user unmap which will do an extra invalidation. The dTLB invalidation after PASID cache flush does not need an extra invalidation. Therefore, we only need to deal with #4 and #5 in this patch. #1 is also covered by this patch due to common code path with #5. Tested-by: Yuzhang Luo <[email protected]> Reviewed-by: Ashok Raj <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Dec 14, 2022
Fix some checker warnings in the trace code by adding __printf attributes to a number of trace functions and their declarations. Changes: ======== ver #2) - Dropped the fix for the unconditional tracing_max_lat_fops decl[1]. Link: https://lore.kernel.org/r/[email protected]/ [1] Link: https://lore.kernel.org/r/166992525941.1716618.13740663757583361463.stgit@warthog.procyon.org.uk/ # v1 Link: https://lkml.kernel.org/r/167023571258.382307.15314866482834835192.stgit@warthog.procyon.org.uk Signed-off-by: David Howells <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Dec 14, 2022
…g the sock There is a race condition in vxlan that when deleting a vxlan device during receiving packets, there is a possibility that the sock is released after getting vxlan_sock vs from sk_user_data. Then in later vxlan_ecn_decapsulate(), vxlan_get_sk_family() we will got NULL pointer dereference. e.g. #0 [ffffa25ec6978a38] machine_kexec at ffffffff8c669757 #1 [ffffa25ec6978a90] __crash_kexec at ffffffff8c7c0a4d #2 [ffffa25ec6978b58] crash_kexec at ffffffff8c7c1c48 #3 [ffffa25ec6978b60] oops_end at ffffffff8c627f2b #4 [ffffa25ec6978b80] page_fault_oops at ffffffff8c678fcb #5 [ffffa25ec6978bd8] exc_page_fault at ffffffff8d109542 torvalds#6 [ffffa25ec6978c00] asm_exc_page_fault at ffffffff8d200b62 [exception RIP: vxlan_ecn_decapsulate+0x3b] RIP: ffffffffc1014e7b RSP: ffffa25ec6978cb0 RFLAGS: 00010246 RAX: 0000000000000008 RBX: ffff8aa000888000 RCX: 0000000000000000 RDX: 000000000000000e RSI: ffff8a9fc7ab803e RDI: ffff8a9fd1168700 RBP: ffff8a9fc7ab803e R8: 0000000000700000 R9: 00000000000010ae R10: ffff8a9fcb748980 R11: 0000000000000000 R12: ffff8a9fd1168700 R13: ffff8aa000888000 R14: 00000000002a0000 R15: 00000000000010ae ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 torvalds#7 [ffffa25ec6978ce8] vxlan_rcv at ffffffffc10189cd [vxlan] torvalds#8 [ffffa25ec6978d90] udp_queue_rcv_one_skb at ffffffff8cfb6507 torvalds#9 [ffffa25ec6978dc0] udp_unicast_rcv_skb at ffffffff8cfb6e45 torvalds#10 [ffffa25ec6978dc8] __udp4_lib_rcv at ffffffff8cfb8807 torvalds#11 [ffffa25ec6978e20] ip_protocol_deliver_rcu at ffffffff8cf76951 torvalds#12 [ffffa25ec6978e48] ip_local_deliver at ffffffff8cf76bde torvalds#13 [ffffa25ec6978ea0] __netif_receive_skb_one_core at ffffffff8cecde9b torvalds#14 [ffffa25ec6978ec8] process_backlog at ffffffff8cece139 torvalds#15 [ffffa25ec6978f00] __napi_poll at ffffffff8ceced1a torvalds#16 [ffffa25ec6978f28] net_rx_action at ffffffff8cecf1f3 torvalds#17 [ffffa25ec6978fa0] __softirqentry_text_start at ffffffff8d4000ca torvalds#18 [ffffa25ec6978ff0] do_softirq at ffffffff8c6fbdc3 Reproducer: https://github.com/Mellanox/ovs-tests/blob/master/test-ovs-vxlan-remove-tunnel-during-traffic.sh Fix this by waiting for all sk_user_data reader to finish before releasing the sock. Reported-by: Jianlin Shi <[email protected]> Suggested-by: Jakub Sitnicki <[email protected]> Fixes: 6a93cc9 ("udp-tunnel: Add a few more UDP tunnel APIs") Signed-off-by: Hangbin Liu <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
May 7, 2024
Merge series from Jerome Brunet <[email protected]>: This patchset fixes 2 problems on TDM which both find a solution by properly implementing the .trigger() callback for the TDM backend. ATM, enabling the TDM formatters is done by the .prepare() callback because handling the formatter is slow due to necessary calls to CCF. The first problem affects the TDMIN. Because .prepare() is called on DPCM backend first, the formatter are started before the FIFOs and this may cause a random channel shifts if the TDMIN use multiple lanes with more than 2 slots per lanes. Using trigger() allows to set the FE/BE order, solving the problem. There has already been an attempt to fix this 3y ago [1] and reverted [2] It triggered a 'sleep in irq' error on the period IRQ. The solution is to just use the bottom half of threaded IRQ. This is patch #1. Patch #2 and #3 remain mostly the same as 3y ago. For TDMOUT, the problem is on pause. ATM pause only stops the FIFO and the TDMOUT just starves. When it does, it will actually repeat the last sample continuously. Depending on the platform, if there is no high-pass filter on the analog path, this may translate to a constant position of the speaker membrane. There is no audible glitch but it may damage the speaker coil. Properly stopping the TDMOUT in pause solves the problem. There is behaviour change associated with that fix. Clocks used to be continuous on pause because of the problem above. They will now be gated on pause by default, as they should. The last change introduce the proper support for continuous clocks, if needed. [1]: https://lore.kernel.org/linux-amlogic/[email protected] [2]: https://lore.kernel.org/linux-amlogic/[email protected]
konradybcio
pushed a commit
that referenced
this issue
May 10, 2024
ui_browser__show() is capturing the input title that is stack allocated memory in hist_browser__run(). Avoid a use after return by strdup-ing the string. Committer notes: Further explanation from Ian Rogers: My command line using tui is: $ sudo bash -c 'rm /tmp/asan.log*; export ASAN_OPTIONS="log_path=/tmp/asan.log"; /tmp/perf/perf mem record -a sleep 1; /tmp/perf/perf mem report' I then go to the perf annotate view and quit. This triggers the asan error (from the log file): ``` ==1254591==ERROR: AddressSanitizer: stack-use-after-return on address 0x7f2813331920 at pc 0x7f28180 65991 bp 0x7fff0a21c750 sp 0x7fff0a21bf10 READ of size 80 at 0x7f2813331920 thread T0 #0 0x7f2818065990 in __interceptor_strlen ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:461 #1 0x7f2817698251 in SLsmg_write_wrapped_string (/lib/x86_64-linux-gnu/libslang.so.2+0x98251) #2 0x7f28176984b9 in SLsmg_write_nstring (/lib/x86_64-linux-gnu/libslang.so.2+0x984b9) #3 0x55c94045b365 in ui_browser__write_nstring ui/browser.c:60 #4 0x55c94045c558 in __ui_browser__show_title ui/browser.c:266 #5 0x55c94045c776 in ui_browser__show ui/browser.c:288 torvalds#6 0x55c94045c06d in ui_browser__handle_resize ui/browser.c:206 torvalds#7 0x55c94047979b in do_annotate ui/browsers/hists.c:2458 torvalds#8 0x55c94047fb17 in evsel__hists_browse ui/browsers/hists.c:3412 torvalds#9 0x55c940480a0c in perf_evsel_menu__run ui/browsers/hists.c:3527 torvalds#10 0x55c940481108 in __evlist__tui_browse_hists ui/browsers/hists.c:3613 torvalds#11 0x55c9404813f7 in evlist__tui_browse_hists ui/browsers/hists.c:3661 torvalds#12 0x55c93ffa253f in report__browse_hists tools/perf/builtin-report.c:671 torvalds#13 0x55c93ffa58ca in __cmd_report tools/perf/builtin-report.c:1141 torvalds#14 0x55c93ffaf159 in cmd_report tools/perf/builtin-report.c:1805 torvalds#15 0x55c94000c05c in report_events tools/perf/builtin-mem.c:374 torvalds#16 0x55c94000d96d in cmd_mem tools/perf/builtin-mem.c:516 torvalds#17 0x55c9400e44ee in run_builtin tools/perf/perf.c:350 torvalds#18 0x55c9400e4a5a in handle_internal_command tools/perf/perf.c:403 torvalds#19 0x55c9400e4e22 in run_argv tools/perf/perf.c:447 torvalds#20 0x55c9400e53ad in main tools/perf/perf.c:561 torvalds#21 0x7f28170456c9 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 torvalds#22 0x7f2817045784 in __libc_start_main_impl ../csu/libc-start.c:360 torvalds#23 0x55c93ff544c0 in _start (/tmp/perf/perf+0x19a4c0) (BuildId: 84899b0e8c7d3a3eaa67b2eb35e3d8b2f8cd4c93) Address 0x7f2813331920 is located in stack of thread T0 at offset 32 in frame #0 0x55c94046e85e in hist_browser__run ui/browsers/hists.c:746 This frame has 1 object(s): [32, 192) 'title' (line 747) <== Memory access at offset 32 is inside this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork ``` hist_browser__run isn't on the stack so the asan error looks legit. There's no clean init/exit on struct ui_browser so I may be trading a use-after-return for a memory leak, but that seems look a good trade anyway. Fixes: 05e8b08 ("perf ui browser: Stop using 'self'") Signed-off-by: Ian Rogers <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Athira Rajeev <[email protected]> Cc: Ben Gainey <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Clark <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kajol Jain <[email protected]> Cc: Kan Liang <[email protected]> Cc: K Prateek Nayak <[email protected]> Cc: Li Dong <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Paran Lee <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ravi Bangoria <[email protected]> Cc: Sun Haiyong <[email protected]> Cc: Tim Chen <[email protected]> Cc: Yanteng Si <[email protected]> Cc: Yicong Yang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
May 10, 2024
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed. To measure the performance impact of this loop, two tests are developed. [8] The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results. The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples. Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook. Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__ 32 11811 12398 587 18 105% mprotect 1 439 465 26 26 106% mprotect 2 1659 1745 86 43 105% mprotect 4 3747 3889 142 36 104% mprotect 8 6755 6969 215 27 103% mprotect 16 13748 14144 396 25 103% mprotect 32 27827 28969 1142 36 104% madvise_ 1 240 262 22 22 109% madvise_ 2 366 442 76 38 121% madvise_ 4 623 751 128 32 121% madvise_ 8 1110 1324 215 27 119% madvise_ 16 2127 2451 324 20 115% madvise_ 32 4109 4642 534 17 113% The second test (measuring cpu cycle) syscall__ vmas cpu cmseal delta_cpu per_vma % munmap__ 1 1790 1890 100 100 106% munmap__ 2 2819 3033 214 107 108% munmap__ 4 4959 5271 312 78 106% munmap__ 8 8262 8745 483 60 106% munmap__ 16 13099 14116 1017 64 108% munmap__ 32 23221 24785 1565 49 107% mprotect 1 906 967 62 62 107% mprotect 2 3019 3203 184 92 106% mprotect 4 6149 6569 420 105 107% mprotect 8 9978 10524 545 68 105% mprotect 16 20448 21427 979 61 105% mprotect 32 40972 42935 1963 61 105% madvise_ 1 434 497 63 63 115% madvise_ 2 752 899 147 74 120% madvise_ 4 1313 1513 200 50 115% madvise_ 8 2271 2627 356 44 116% madvise_ 16 4312 4883 571 36 113% madvise_ 32 8376 9319 943 29 111% Based on the result, for 6.8 kernel, sealing check adds 20-40 nano seconds, or around 50-100 CPU cycles, per VMA. In addition, I applied the sealing to 5.10 kernel: The first test (measuring time) syscall__ vmas t tmseal delta_ns per_vma % munmap__ 1 357 390 33 33 109% munmap__ 2 442 463 21 11 105% munmap__ 4 614 634 20 5 103% munmap__ 8 1017 1137 120 15 112% munmap__ 16 1889 2153 263 16 114% munmap__ 32 4109 4088 -21 -1 99% mprotect 1 235 227 -7 -7 97% mprotect 2 495 464 -30 -15 94% mprotect 4 741 764 24 6 103% mprotect 8 1434 1437 2 0 100% mprotect 16 2958 2991 33 2 101% mprotect 32 6431 6608 177 6 103% madvise_ 1 191 208 16 16 109% madvise_ 2 300 324 24 12 108% madvise_ 4 450 473 23 6 105% madvise_ 8 753 806 53 7 107% madvise_ 16 1467 1592 125 8 108% madvise_ 32 2795 3405 610 19 122% The second test (measuring cpu cycle) syscall__ nbr_vma cpu cmseal delta_cpu per_vma % munmap__ 1 684 715 31 31 105% munmap__ 2 861 898 38 19 104% munmap__ 4 1183 1235 51 13 104% munmap__ 8 1999 2045 46 6 102% munmap__ 16 3839 3816 -23 -1 99% munmap__ 32 7672 7887 216 7 103% mprotect 1 397 443 46 46 112% mprotect 2 738 788 50 25 107% mprotect 4 1221 1256 35 9 103% mprotect 8 2356 2429 72 9 103% mprotect 16 4961 4935 -26 -2 99% mprotect 32 9882 10172 291 9 103% madvise_ 1 351 380 29 29 108% madvise_ 2 565 615 49 25 109% madvise_ 4 872 933 61 15 107% madvise_ 8 1508 1640 132 16 109% madvise_ 16 3078 3323 245 15 108% madvise_ 32 5893 6704 811 25 114% For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30 CPU cycles, there is even decrease in some cases. It might be interesting to compare 5.10 and 6.8 kernel The first test (measuring time) syscall__ vmas t_5_10 t_6_8 delta_ns per_vma % munmap__ 1 357 909 552 552 254% munmap__ 2 442 1398 956 478 316% munmap__ 4 614 2444 1830 458 398% munmap__ 8 1017 4029 3012 377 396% munmap__ 16 1889 6647 4758 297 352% munmap__ 32 4109 11811 7702 241 287% mprotect 1 235 439 204 204 187% mprotect 2 495 1659 1164 582 335% mprotect 4 741 3747 3006 752 506% mprotect 8 1434 6755 5320 665 471% mprotect 16 2958 13748 10790 674 465% mprotect 32 6431 27827 21397 669 433% madvise_ 1 191 240 49 49 125% madvise_ 2 300 366 67 33 122% madvise_ 4 450 623 173 43 138% madvise_ 8 753 1110 357 45 147% madvise_ 16 1467 2127 660 41 145% madvise_ 32 2795 4109 1314 41 147% The second test (measuring cpu cycle) syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma % munmap__ 1 684 1790 1106 1106 262% munmap__ 2 861 2819 1958 979 327% munmap__ 4 1183 4959 3776 944 419% munmap__ 8 1999 8262 6263 783 413% munmap__ 16 3839 13099 9260 579 341% munmap__ 32 7672 23221 15549 486 303% mprotect 1 397 906 509 509 228% mprotect 2 738 3019 2281 1140 409% mprotect 4 1221 6149 4929 1232 504% mprotect 8 2356 9978 7622 953 423% mprotect 16 4961 20448 15487 968 412% mprotect 32 9882 40972 31091 972 415% madvise_ 1 351 434 82 82 123% madvise_ 2 565 752 186 93 133% madvise_ 4 872 1313 442 110 151% madvise_ 8 1508 2271 763 95 151% madvise_ 16 3078 4312 1234 77 140% madvise_ 32 5893 8376 2483 78 142% From 5.10 to 6.8 munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma. mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma. madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma. In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the increase from 5.10 to 6.8 is significantly larger, approximately ten times greater for munmap and mprotect. When I discuss the mm performance with Brian Makin, an engineer who worked on performance, it was brought to my attention that such performance benchmarks, which measuring millions of mm syscall in a tight loop, may not accurately reflect real-world scenarios, such as that of a database service. Also this is tested using a single HW and ChromeOS, the data from another HW or distribution might be different. It might be best to take this data with a grain of salt. This patch (of 5): Wire up mseal syscall for all architectures. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jeff Xu <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Jann Horn <[email protected]> [Bug #2] Cc: Jeff Xu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Jorge Lucangeli Obes <[email protected]> Cc: Kees Cook <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muhammad Usama Anjum <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Stephen Röttger <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Amer Al Shanawany <[email protected]> Cc: Javier Carrasco <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
May 23, 2024
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed. To measure the performance impact of this loop, two tests are developed. [8] The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results. The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples. Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook. Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__ 32 11811 12398 587 18 105% mprotect 1 439 465 26 26 106% mprotect 2 1659 1745 86 43 105% mprotect 4 3747 3889 142 36 104% mprotect 8 6755 6969 215 27 103% mprotect 16 13748 14144 396 25 103% mprotect 32 27827 28969 1142 36 104% madvise_ 1 240 262 22 22 109% madvise_ 2 366 442 76 38 121% madvise_ 4 623 751 128 32 121% madvise_ 8 1110 1324 215 27 119% madvise_ 16 2127 2451 324 20 115% madvise_ 32 4109 4642 534 17 113% The second test (measuring cpu cycle) syscall__ vmas cpu cmseal delta_cpu per_vma % munmap__ 1 1790 1890 100 100 106% munmap__ 2 2819 3033 214 107 108% munmap__ 4 4959 5271 312 78 106% munmap__ 8 8262 8745 483 60 106% munmap__ 16 13099 14116 1017 64 108% munmap__ 32 23221 24785 1565 49 107% mprotect 1 906 967 62 62 107% mprotect 2 3019 3203 184 92 106% mprotect 4 6149 6569 420 105 107% mprotect 8 9978 10524 545 68 105% mprotect 16 20448 21427 979 61 105% mprotect 32 40972 42935 1963 61 105% madvise_ 1 434 497 63 63 115% madvise_ 2 752 899 147 74 120% madvise_ 4 1313 1513 200 50 115% madvise_ 8 2271 2627 356 44 116% madvise_ 16 4312 4883 571 36 113% madvise_ 32 8376 9319 943 29 111% Based on the result, for 6.8 kernel, sealing check adds 20-40 nano seconds, or around 50-100 CPU cycles, per VMA. In addition, I applied the sealing to 5.10 kernel: The first test (measuring time) syscall__ vmas t tmseal delta_ns per_vma % munmap__ 1 357 390 33 33 109% munmap__ 2 442 463 21 11 105% munmap__ 4 614 634 20 5 103% munmap__ 8 1017 1137 120 15 112% munmap__ 16 1889 2153 263 16 114% munmap__ 32 4109 4088 -21 -1 99% mprotect 1 235 227 -7 -7 97% mprotect 2 495 464 -30 -15 94% mprotect 4 741 764 24 6 103% mprotect 8 1434 1437 2 0 100% mprotect 16 2958 2991 33 2 101% mprotect 32 6431 6608 177 6 103% madvise_ 1 191 208 16 16 109% madvise_ 2 300 324 24 12 108% madvise_ 4 450 473 23 6 105% madvise_ 8 753 806 53 7 107% madvise_ 16 1467 1592 125 8 108% madvise_ 32 2795 3405 610 19 122% The second test (measuring cpu cycle) syscall__ nbr_vma cpu cmseal delta_cpu per_vma % munmap__ 1 684 715 31 31 105% munmap__ 2 861 898 38 19 104% munmap__ 4 1183 1235 51 13 104% munmap__ 8 1999 2045 46 6 102% munmap__ 16 3839 3816 -23 -1 99% munmap__ 32 7672 7887 216 7 103% mprotect 1 397 443 46 46 112% mprotect 2 738 788 50 25 107% mprotect 4 1221 1256 35 9 103% mprotect 8 2356 2429 72 9 103% mprotect 16 4961 4935 -26 -2 99% mprotect 32 9882 10172 291 9 103% madvise_ 1 351 380 29 29 108% madvise_ 2 565 615 49 25 109% madvise_ 4 872 933 61 15 107% madvise_ 8 1508 1640 132 16 109% madvise_ 16 3078 3323 245 15 108% madvise_ 32 5893 6704 811 25 114% For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30 CPU cycles, there is even decrease in some cases. It might be interesting to compare 5.10 and 6.8 kernel The first test (measuring time) syscall__ vmas t_5_10 t_6_8 delta_ns per_vma % munmap__ 1 357 909 552 552 254% munmap__ 2 442 1398 956 478 316% munmap__ 4 614 2444 1830 458 398% munmap__ 8 1017 4029 3012 377 396% munmap__ 16 1889 6647 4758 297 352% munmap__ 32 4109 11811 7702 241 287% mprotect 1 235 439 204 204 187% mprotect 2 495 1659 1164 582 335% mprotect 4 741 3747 3006 752 506% mprotect 8 1434 6755 5320 665 471% mprotect 16 2958 13748 10790 674 465% mprotect 32 6431 27827 21397 669 433% madvise_ 1 191 240 49 49 125% madvise_ 2 300 366 67 33 122% madvise_ 4 450 623 173 43 138% madvise_ 8 753 1110 357 45 147% madvise_ 16 1467 2127 660 41 145% madvise_ 32 2795 4109 1314 41 147% The second test (measuring cpu cycle) syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma % munmap__ 1 684 1790 1106 1106 262% munmap__ 2 861 2819 1958 979 327% munmap__ 4 1183 4959 3776 944 419% munmap__ 8 1999 8262 6263 783 413% munmap__ 16 3839 13099 9260 579 341% munmap__ 32 7672 23221 15549 486 303% mprotect 1 397 906 509 509 228% mprotect 2 738 3019 2281 1140 409% mprotect 4 1221 6149 4929 1232 504% mprotect 8 2356 9978 7622 953 423% mprotect 16 4961 20448 15487 968 412% mprotect 32 9882 40972 31091 972 415% madvise_ 1 351 434 82 82 123% madvise_ 2 565 752 186 93 133% madvise_ 4 872 1313 442 110 151% madvise_ 8 1508 2271 763 95 151% madvise_ 16 3078 4312 1234 77 140% madvise_ 32 5893 8376 2483 78 142% From 5.10 to 6.8 munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma. mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma. madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma. In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the increase from 5.10 to 6.8 is significantly larger, approximately ten times greater for munmap and mprotect. When I discuss the mm performance with Brian Makin, an engineer who worked on performance, it was brought to my attention that such performance benchmarks, which measuring millions of mm syscall in a tight loop, may not accurately reflect real-world scenarios, such as that of a database service. Also this is tested using a single HW and ChromeOS, the data from another HW or distribution might be different. It might be best to take this data with a grain of salt. This patch (of 5): Wire up mseal syscall for all architectures. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jeff Xu <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Jann Horn <[email protected]> [Bug #2] Cc: Jeff Xu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Jorge Lucangeli Obes <[email protected]> Cc: Kees Cook <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muhammad Usama Anjum <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Stephen Röttger <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Amer Al Shanawany <[email protected]> Cc: Javier Carrasco <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
May 29, 2024
In dctcp_update_alpha(), we use a module parameter dctcp_shift_g as follows: alpha -= min_not_zero(alpha, alpha >> dctcp_shift_g); ... delivered_ce <<= (10 - dctcp_shift_g); It seems syzkaller started fuzzing module parameters and triggered shift-out-of-bounds [0] by setting 100 to dctcp_shift_g: memcpy((void*)0x20000080, "/sys/module/tcp_dctcp/parameters/dctcp_shift_g\000", 47); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x20000080ul, /*flags=*/2ul, /*mode=*/0ul); memcpy((void*)0x20000000, "100\000", 4); syscall(__NR_write, /*fd=*/r[0], /*val=*/0x20000000ul, /*len=*/4ul); Let's limit the max value of dctcp_shift_g by param_set_uint_minmax(). With this patch: # echo 10 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g # cat /sys/module/tcp_dctcp/parameters/dctcp_shift_g 10 # echo 11 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g -bash: echo: write error: Invalid argument [0]: UBSAN: shift-out-of-bounds in net/ipv4/tcp_dctcp.c:143:12 shift exponent 100 is too large for 32-bit type 'u32' (aka 'unsigned int') CPU: 0 PID: 8083 Comm: syz-executor345 Not tainted 6.9.0-05151-g1b294a1f3561 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x201/0x300 lib/dump_stack.c:114 ubsan_epilogue lib/ubsan.c:231 [inline] __ubsan_handle_shift_out_of_bounds+0x346/0x3a0 lib/ubsan.c:468 dctcp_update_alpha+0x540/0x570 net/ipv4/tcp_dctcp.c:143 tcp_in_ack_event net/ipv4/tcp_input.c:3802 [inline] tcp_ack+0x17b1/0x3bc0 net/ipv4/tcp_input.c:3948 tcp_rcv_state_process+0x57a/0x2290 net/ipv4/tcp_input.c:6711 tcp_v4_do_rcv+0x764/0xc40 net/ipv4/tcp_ipv4.c:1937 sk_backlog_rcv include/net/sock.h:1106 [inline] __release_sock+0x20f/0x350 net/core/sock.c:2983 release_sock+0x61/0x1f0 net/core/sock.c:3549 mptcp_subflow_shutdown+0x3d0/0x620 net/mptcp/protocol.c:2907 mptcp_check_send_data_fin+0x225/0x410 net/mptcp/protocol.c:2976 __mptcp_close+0x238/0xad0 net/mptcp/protocol.c:3072 mptcp_close+0x2a/0x1a0 net/mptcp/protocol.c:3127 inet_release+0x190/0x1f0 net/ipv4/af_inet.c:437 __sock_release net/socket.c:659 [inline] sock_close+0xc0/0x240 net/socket.c:1421 __fput+0x41b/0x890 fs/file_table.c:422 task_work_run+0x23b/0x300 kernel/task_work.c:180 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0x9c8/0x2540 kernel/exit.c:878 do_group_exit+0x201/0x2b0 kernel/exit.c:1027 __do_sys_exit_group kernel/exit.c:1038 [inline] __se_sys_exit_group kernel/exit.c:1036 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1036 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xe4/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x67/0x6f RIP: 0033:0x7f6c2b5005b6 Code: Unable to access opcode bytes at 0x7f6c2b50058c. RSP: 002b:00007ffe883eb948 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00007f6c2b5862f0 RCX: 00007f6c2b5005b6 RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001 RBP: 0000000000000001 R08: 00000000000000e7 R09: ffffffffffffffc0 R10: 0000000000000006 R11: 0000000000000246 R12: 00007f6c2b5862f0 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 </TASK> Reported-by: syzkaller <[email protected]> Reported-by: Yue Sun <[email protected]> Reported-by: xingwei lee <[email protected]> Closes: https://lore.kernel.org/netdev/CAEkJfYNJM=cw-8x7_Vmj1J6uYVCWMbbvD=EFmDPVBGpTsqOxEA@mail.gmail.com/ Fixes: e3118e8 ("net: tcp: add DCTCP congestion control algorithm") Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Jul 9, 2024
This adds a check before freeing the rx->skb in flush and close functions to handle the kernel crash seen while removing driver after FW download fails or before FW download completes. dmesg log: [ 54.634586] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080 [ 54.643398] Mem abort info: [ 54.646204] ESR = 0x0000000096000004 [ 54.649964] EC = 0x25: DABT (current EL), IL = 32 bits [ 54.655286] SET = 0, FnV = 0 [ 54.658348] EA = 0, S1PTW = 0 [ 54.661498] FSC = 0x04: level 0 translation fault [ 54.666391] Data abort info: [ 54.669273] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 54.674768] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 54.674771] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 54.674775] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000048860000 [ 54.674780] [0000000000000080] pgd=0000000000000000, p4d=0000000000000000 [ 54.703880] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 54.710152] Modules linked in: btnxpuart(-) overlay fsl_jr_uio caam_jr caamkeyblob_desc caamhash_desc caamalg_desc crypto_engine authenc libdes crct10dif_ce polyval_ce polyval_generic snd_soc_imx_spdif snd_soc_imx_card snd_soc_ak5558 snd_soc_ak4458 caam secvio error snd_soc_fsl_micfil snd_soc_fsl_spdif snd_soc_fsl_sai snd_soc_fsl_utils imx_pcm_dma gpio_ir_recv rc_core sch_fq_codel fuse [ 54.744357] CPU: 3 PID: 72 Comm: kworker/u9:0 Not tainted 6.6.3-otbr-g128004619037 #2 [ 54.744364] Hardware name: FSL i.MX8MM EVK board (DT) [ 54.744368] Workqueue: hci0 hci_power_on [ 54.757244] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 54.757249] pc : kfree_skb_reason+0x18/0xb0 [ 54.772299] lr : btnxpuart_flush+0x40/0x58 [btnxpuart] [ 54.782921] sp : ffff8000805ebca0 [ 54.782923] x29: ffff8000805ebca0 x28: ffffa5c6cf1869c0 x27: ffffa5c6cf186000 [ 54.782931] x26: ffff377b84852400 x25: ffff377b848523c0 x24: ffff377b845e7230 [ 54.782938] x23: ffffa5c6ce8dbe08 x22: ffffa5c6ceb65410 x21: 00000000ffffff92 [ 54.782945] x20: ffffa5c6ce8dbe98 x19: ffffffffffffffac x18: ffffffffffffffff [ 54.807651] x17: 0000000000000000 x16: ffffa5c6ce2824ec x15: ffff8001005eb857 [ 54.821917] x14: 0000000000000000 x13: ffffa5c6cf1a02e0 x12: 0000000000000642 [ 54.821924] x11: 0000000000000040 x10: ffffa5c6cf19d690 x9 : ffffa5c6cf19d688 [ 54.821931] x8 : ffff377b86000028 x7 : 0000000000000000 x6 : 0000000000000000 [ 54.821938] x5 : ffff377b86000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 54.843331] x2 : 0000000000000000 x1 : 0000000000000002 x0 : ffffffffffffffac [ 54.857599] Call trace: [ 54.857601] kfree_skb_reason+0x18/0xb0 [ 54.863878] btnxpuart_flush+0x40/0x58 [btnxpuart] [ 54.863888] hci_dev_open_sync+0x3a8/0xa04 [ 54.872773] hci_power_on+0x54/0x2e4 [ 54.881832] process_one_work+0x138/0x260 [ 54.881842] worker_thread+0x32c/0x438 [ 54.881847] kthread+0x118/0x11c [ 54.881853] ret_from_fork+0x10/0x20 [ 54.896406] Code: a9be7bfd 910003fd f9000bf3 aa0003f3 (b940d400) [ 54.896410] ---[ end trace 0000000000000000 ]--- Signed-off-by: Neeraj Sanjay Kale <[email protected]> Tested-by: Guillaume Legoupil <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Jul 19, 2024
Add a set of tests to validate that stack traces captured from or in the presence of active uprobes and uretprobes are valid and complete. For this we use BPF program that are installed either on entry or exit of user function, plus deep-nested USDT. One of target funtions (target_1) is recursive to generate two different entries in the stack trace for the same uprobe/uretprobe, testing potential edge conditions. If there is no fixes, we get something like this for one of the scenarios: caller: 0x758fff - 0x7595ab target_1: 0x758fd5 - 0x758fff target_2: 0x758fca - 0x758fd5 target_3: 0x758fbf - 0x758fca target_4: 0x758fb3 - 0x758fbf ENTRY #0: 0x758fb3 (in target_4) ENTRY #1: 0x758fd3 (in target_2) ENTRY #2: 0x758ffd (in target_1) ENTRY #3: 0x7fffffffe000 ENTRY #4: 0x7fffffffe000 ENTRY #5: 0x6f8f39 ENTRY torvalds#6: 0x6fa6f0 ENTRY torvalds#7: 0x7f403f229590 Entry #3 and #4 (0x7fffffffe000) are uretprobe trampoline addresses which obscure actual target_1 and another target_1 invocations. Also note that between entry #0 and entry #1 we are missing an entry for target_3. With fixes, we get desired full stack traces: caller: 0x758fff - 0x7595ab target_1: 0x758fd5 - 0x758fff target_2: 0x758fca - 0x758fd5 target_3: 0x758fbf - 0x758fca target_4: 0x758fb3 - 0x758fbf ENTRY #0: 0x758fb7 (in target_4) ENTRY #1: 0x758fc8 (in target_3) ENTRY #2: 0x758fd3 (in target_2) ENTRY #3: 0x758ffd (in target_1) ENTRY #4: 0x758ff3 (in target_1) ENTRY #5: 0x75922c (in caller) ENTRY torvalds#6: 0x6f8f39 ENTRY torvalds#7: 0x6fa6f0 ENTRY torvalds#8: 0x7f986adc4cd0 Now there is a logical and complete sequence of function calls. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Jiri Olsa <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Jul 19, 2024
…git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net: Patch #1 fixes a bogus WARN_ON splat in nfnetlink_queue. Patch #2 fixes a crash due to stack overflow in chain loop detection by using the existing chain validation routines Both patches from Florian Westphal. netfilter pull request 24-07-11 * tag 'nf-24-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: prefer nft_chain_validate netfilter: nfnetlink_queue: drop bogus WARN_ON ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Jul 19, 2024
When putting an inode during extent map shrinking we're doing a standard iput() but that may take a long time in case the inode is dirty and we are doing the final iput that triggers eviction - the VFS will have to wait for writeback before calling the btrfs evict callback (see fs/inode.c:evict()). This slows down the task running the shrinker which may have been triggered while updating some tree for example, meaning locks are held as well as an open transaction handle. Also if the iput() ends up triggering eviction and the inode has no links anymore, then we trigger item truncation which requires flushing delayed items, space reservation to start a transaction and that may trigger the space reclaim task and wait for it, resulting in deadlocks in case the reclaim task needs for example to commit a transaction and the shrinker is being triggered from a path holding a transaction handle. Syzbot reported such a case with the following stack traces: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted ------------------------------------------------------ kswapd0/111 is trying to acquire lock: ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 but task is already holding lock: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (fs_reclaim){+.+.}-{0:0}: __fs_reclaim_acquire mm/page_alloc.c:3783 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3890 [inline] slab_alloc_node mm/slub.c:3980 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911 btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114 btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373 __btrfs_balance fs/btrfs/volumes.c:4157 [inline] btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534 btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline] btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742 __do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #1 (btrfs_trans_num_writers){++++}-{0:0}: join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #0 (sb_internal#3){.+.+}-{0:0}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Chain exists of: sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(btrfs_trans_num_extwriters); lock(fs_reclaim); rlock(sb_internal#3); *** DEADLOCK *** 2 locks held by kswapd0/111: #0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline] #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196 stack backtrace: CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114 check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> So fix this by using btrfs_add_delayed_iput() so that the final iput is delegated to the cleaner kthread. Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Reported-by: [email protected] Fixes: 956a17d ("btrfs: add a shrinker for extent maps") Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Jul 19, 2024
Patch series "ioctl()-based API to query VMAs from /proc/<pid>/maps", v6. Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow applications to query VMA information more efficiently than reading *all* VMAs nonselectively through text-based interface of /proc/<pid>/maps file. Patch #2 goes into a lot of details and background on some common patterns of using /proc/<pid>/maps in the area of performance profiling and subsequent symbolization of captured stack traces. As mentioned in that patch, patterns of VMA querying can differ depending on specific use case, but can generally be grouped into two main categories: the need to query a small subset of VMAs covering a given batch of addresses, or reading/storing/caching all (typically, executable) VMAs upfront for later processing. The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by the former pattern of usage. Earlier revisions had a patch adding a tool that faithfully reproduces an efficient VMA matching pass of a symbolizer, collecting a subset of covering VMAs for a given set of addresses as efficiently as possible. This tool served both as a testing ground, as well as a benchmarking tool. It implements everything both for currently existing text-based /proc/<pid>/maps interface, as well as for newly-added PROCMAP_QUERY ioctl(). This revision dropped the tool from the patch set and, once the API lands upstream, this tool might be added separately on Github as an example. Based on discussion on earlier revisions of this patch set, it turned out that this ioctl() API is competitive with highly-optimized text-based pre-processing pattern that perf tool is using. Based on perf discussion, this revision adds more flexibility in specifying a subset of VMAs that are of interest. Now it's possible to specify desired permissions of VMAs (e.g., request only executable ones) and/or restrict to only a subset of VMAs that have file backing. This further improves the efficiency when using this new API thanks to more selective (executable VMAs only) querying. In addition to a custom benchmarking tool, and experimental perf integration (available at [0]), Daniel Mueller has since also implemented an experimental integration into blazesym (see [1]), a library used for stack trace symbolization by our server fleet-wide profiler and another on-device profiler agent that runs on weaker ARM devices. The latter ARM-based device profiler is especially sensitive to performance, and so we benchmarked and compared text-based /proc/<pid>/maps solution to the equivalent one using PROCMAP_QUERY ioctl(). Results are very encouraging, giving us 5x improvement for end-to-end so-called "address normalization" pass, which is the part of the symbolization process that happens locally on ARM device, before being sent out for further heavier-weight processing on more powerful remote server. Note that this is not an artificial microbenchmark. It's a full end-to-end API call being measured with real-world data on real-world device. TEXT-BASED ========== Benchmarking main/normalize_process_no_build_ids_uncached_maps main/normalize_process_no_build_ids_uncached_maps time: [49.777 µs 49.982 µs 50.250 µs] IOCTL-BASED =========== Benchmarking main/normalize_process_no_build_ids_uncached_maps main/normalize_process_no_build_ids_uncached_maps time: [10.328 µs 10.391 µs 10.457 µs] change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02) Performance has improved. You can see above that we see the drop from 50µs down to 10µs for exactly the same amount of work, with the same data and target process. With the aforementioned custom tool, we see about ~40x improvement (it might vary a bit, depending on a specific captured set of addresses). And even for perf-based benchmark it's on par or slightly ahead when using permission-based filtering (fetching only executable VMAs). Earlier revisions attempted to use per-VMA locking, if kernel was compiled with CONFIG_PER_VMA_LOCK=y, but it turned out that anon_vma_name() is not yet compatible with per-VMA locking and assumes mmap_lock to be taken, which makes the use of per-VMA locking for this API premature. It was agreed ([2]) to continue for now with just mmap_lock, but the code structure is such that it should be easy to add per-VMA locking support once all the pieces are ready. One thing that did not change was basing this new API as an ioctl() command on /proc/<pid>/maps file. An ioctl-based API on top of pidfd was considered, but has its own downsides. Implementing ioctl() directly on pidfd will cause access permission checks on every single ioctl(), which leads to performance concerns and potential spam of capable() audit messages. It also prevents a nice pattern, possible with /proc/<pid>/maps, in which application opens /proc/self/maps FD (requiring no additional capabilities) and passed this FD to profiling agent for querying. To achieve similar pattern, a new file would have to be created from pidf just for VMA querying, which is considered to be inferior to just querying /proc/<pid>/maps FD as proposed in current approach. These aspects were discussed in the hallway track at recent LSF/MM/BPF 2024 and sticking to procfs ioctl() was the final agreement we arrived at. [0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/ [1] libbpf/blazesym#675 [2] https://lore.kernel.org/bpf/7rm3izyq2vjp5evdjc7c6z4crdd3oerpiknumdnmmemwyiwx7t@hleldw7iozi3/ This patch (of 6): Extract generic logic to fetch relevant pieces of data to describe VMA name. This could be just some string (either special constant or user-provided), or a string with some formatted wrapping text (e.g., "[anon_shmem:<something>]"), or, commonly, file path. seq_file-based logic has different methods to handle all three cases, but they are currently mixed in with extracting underlying sources of data. This patch splits this into data fetching and data formatting, so that data fetching can be reused later on. There should be no functional changes. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Liam R. Howlett <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Jul 19, 2024
This adds a check before freeing the rx->skb in flush and close functions to handle the kernel crash seen while removing driver after FW download fails or before FW download completes. dmesg log: [ 54.634586] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080 [ 54.643398] Mem abort info: [ 54.646204] ESR = 0x0000000096000004 [ 54.649964] EC = 0x25: DABT (current EL), IL = 32 bits [ 54.655286] SET = 0, FnV = 0 [ 54.658348] EA = 0, S1PTW = 0 [ 54.661498] FSC = 0x04: level 0 translation fault [ 54.666391] Data abort info: [ 54.669273] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 54.674768] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 54.674771] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 54.674775] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000048860000 [ 54.674780] [0000000000000080] pgd=0000000000000000, p4d=0000000000000000 [ 54.703880] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 54.710152] Modules linked in: btnxpuart(-) overlay fsl_jr_uio caam_jr caamkeyblob_desc caamhash_desc caamalg_desc crypto_engine authenc libdes crct10dif_ce polyval_ce polyval_generic snd_soc_imx_spdif snd_soc_imx_card snd_soc_ak5558 snd_soc_ak4458 caam secvio error snd_soc_fsl_micfil snd_soc_fsl_spdif snd_soc_fsl_sai snd_soc_fsl_utils imx_pcm_dma gpio_ir_recv rc_core sch_fq_codel fuse [ 54.744357] CPU: 3 PID: 72 Comm: kworker/u9:0 Not tainted 6.6.3-otbr-g128004619037 #2 [ 54.744364] Hardware name: FSL i.MX8MM EVK board (DT) [ 54.744368] Workqueue: hci0 hci_power_on [ 54.757244] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 54.757249] pc : kfree_skb_reason+0x18/0xb0 [ 54.772299] lr : btnxpuart_flush+0x40/0x58 [btnxpuart] [ 54.782921] sp : ffff8000805ebca0 [ 54.782923] x29: ffff8000805ebca0 x28: ffffa5c6cf1869c0 x27: ffffa5c6cf186000 [ 54.782931] x26: ffff377b84852400 x25: ffff377b848523c0 x24: ffff377b845e7230 [ 54.782938] x23: ffffa5c6ce8dbe08 x22: ffffa5c6ceb65410 x21: 00000000ffffff92 [ 54.782945] x20: ffffa5c6ce8dbe98 x19: ffffffffffffffac x18: ffffffffffffffff [ 54.807651] x17: 0000000000000000 x16: ffffa5c6ce2824ec x15: ffff8001005eb857 [ 54.821917] x14: 0000000000000000 x13: ffffa5c6cf1a02e0 x12: 0000000000000642 [ 54.821924] x11: 0000000000000040 x10: ffffa5c6cf19d690 x9 : ffffa5c6cf19d688 [ 54.821931] x8 : ffff377b86000028 x7 : 0000000000000000 x6 : 0000000000000000 [ 54.821938] x5 : ffff377b86000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 54.843331] x2 : 0000000000000000 x1 : 0000000000000002 x0 : ffffffffffffffac [ 54.857599] Call trace: [ 54.857601] kfree_skb_reason+0x18/0xb0 [ 54.863878] btnxpuart_flush+0x40/0x58 [btnxpuart] [ 54.863888] hci_dev_open_sync+0x3a8/0xa04 [ 54.872773] hci_power_on+0x54/0x2e4 [ 54.881832] process_one_work+0x138/0x260 [ 54.881842] worker_thread+0x32c/0x438 [ 54.881847] kthread+0x118/0x11c [ 54.881853] ret_from_fork+0x10/0x20 [ 54.896406] Code: a9be7bfd 910003fd f9000bf3 aa0003f3 (b940d400) [ 54.896410] ---[ end trace 0000000000000000 ]--- Signed-off-by: Neeraj Sanjay Kale <[email protected]> Tested-by: Guillaume Legoupil <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Jul 19, 2024
When tries to demote 1G hugetlb folios, a lockdep warning is observed: ============================================ WARNING: possible recursive locking detected 6.10.0-rc6-00452-ga4d0275fa660-dirty torvalds#79 Not tainted -------------------------------------------- bash/710 is trying to acquire lock: ffffffff8f0a7850 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0x244/0x460 but task is already holding lock: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&h->resize_lock); lock(&h->resize_lock); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by bash/710: #0: ffff8f118439c3f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 #1: ffff8f11893b9e88 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0 #2: ffff8f1183dc4428 (kn->active#98){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0 #3: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460 stack backtrace: CPU: 3 PID: 710 Comm: bash Not tainted 6.10.0-rc6-00452-ga4d0275fa660-dirty torvalds#79 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x68/0xa0 __lock_acquire+0x10f2/0x1ca0 lock_acquire+0xbe/0x2d0 __mutex_lock+0x6d/0x400 demote_store+0x244/0x460 kernfs_fop_write_iter+0x12c/0x1d0 vfs_write+0x380/0x540 ksys_write+0x64/0xe0 do_syscall_64+0xb9/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa61db14887 RSP: 002b:00007ffc56c48358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa61db14887 RDX: 0000000000000002 RSI: 000055a030050220 RDI: 0000000000000001 RBP: 000055a030050220 R08: 00007fa61dbd1460 R09: 000000007fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 R13: 00007fa61dc1b780 R14: 00007fa61dc17600 R15: 00007fa61dc16a00 </TASK> Lockdep considers this an AA deadlock because the different resize_lock mutexes reside in the same lockdep class, but this is a false positive. Place them in distinct classes to avoid these warnings. Link: https://lkml.kernel.org/r/[email protected] Fixes: 8531fc6 ("hugetlb: add hugetlb demote page support") Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Jul 19, 2024
When tries to demote 1G hugetlb folios, a lockdep warning is observed: ============================================ WARNING: possible recursive locking detected 6.10.0-rc6-00452-ga4d0275fa660-dirty torvalds#79 Not tainted -------------------------------------------- bash/710 is trying to acquire lock: ffffffff8f0a7850 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0x244/0x460 but task is already holding lock: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&h->resize_lock); lock(&h->resize_lock); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by bash/710: #0: ffff8f118439c3f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 #1: ffff8f11893b9e88 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0 #2: ffff8f1183dc4428 (kn->active#98){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0 #3: ffffffff8f0a6f48 (&h->resize_lock){+.+.}-{3:3}, at: demote_store+0xae/0x460 stack backtrace: CPU: 3 PID: 710 Comm: bash Not tainted 6.10.0-rc6-00452-ga4d0275fa660-dirty torvalds#79 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x68/0xa0 __lock_acquire+0x10f2/0x1ca0 lock_acquire+0xbe/0x2d0 __mutex_lock+0x6d/0x400 demote_store+0x244/0x460 kernfs_fop_write_iter+0x12c/0x1d0 vfs_write+0x380/0x540 ksys_write+0x64/0xe0 do_syscall_64+0xb9/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa61db14887 RSP: 002b:00007ffc56c48358 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa61db14887 RDX: 0000000000000002 RSI: 000055a030050220 RDI: 0000000000000001 RBP: 000055a030050220 R08: 00007fa61dbd1460 R09: 000000007fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 R13: 00007fa61dc1b780 R14: 00007fa61dc17600 R15: 00007fa61dc16a00 </TASK> Lockdep considers this an AA deadlock because the different resize_lock mutexes reside in the same lockdep class, but this is a false positive. Place them in distinct classes to avoid these warnings. Link: https://lkml.kernel.org/r/[email protected] Fixes: 8531fc6 ("hugetlb: add hugetlb demote page support") Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Jul 25, 2024
…/git/gregkh/tty Pull tty / serial updates from Greg KH: "Here is a small set of tty and serial driver updates for 6.11-rc1. Not much happened this cycle, unlike the previous kernel release which had lots of "excitement" in this part of the kernel. Included in here are the following changes: - dt binding updates for new platforms - 8250 driver updates - various small serial driver fixes and updates - printk/console naming and matching attempt #2 (was reverted for 6.10-final, should be good to go this time around, acked by the relevant maintainers). All of these have been in linux-next for a while with no reported issues" * tag 'tty-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (22 commits) Documentation: kernel-parameters: Add DEVNAME:0.0 format for serial ports serial: core: Add serial_base_match_and_update_preferred_console() printk: Add match_devname_and_update_preferred_console() serial: sc16is7xx: hardware reset chip if reset-gpios is defined in DT dt-bindings: serial: sc16is7xx: add reset-gpios dt-bindings: serial: vt8500-uart: convert to json-schema serial: 8250_platform: Explicitly show we initialise ISA ports only once tty: add missing MODULE_DESCRIPTION() macros dt-bindings: serial: mediatek,uart: add MT7988 serial: sh-sci: Add support for RZ/V2H(P) SoC dt-bindings: serial: Add documentation for Renesas RZ/V2H(P) (R9A09G057) SCIF support dt-bindings: serial: renesas,scif: Make 'interrupt-names' property as required dt-bindings: serial: renesas,scif: Validate 'interrupts' and 'interrupt-names' dt-bindings: serial: renesas,scif: Move ref for serial.yaml at the end riscv: dts: starfive: jh7110: Add the core reset and jh7110 compatible for uarts serial: 8250_dw: Use reset array API to get resets dt-bindings: serial: snps-dw-apb-uart: Add one more reset signal for StarFive JH7110 SoC serial: 8250: Extract platform driver serial: 8250: Extract RSA bits serial: imx: stop casting struct uart_port to struct imx_port ...
konradybcio
pushed a commit
that referenced
this issue
Jul 25, 2024
In z_erofs_get_gbuf(), the current task may be migrated to another CPU between `z_erofs_gbuf_id()` and `spin_lock(&gbuf->lock)`. Therefore, z_erofs_put_gbuf() will trigger the following issue which was found by stress test: <2>[772156.434168] kernel BUG at fs/erofs/zutil.c:58! .. <4>[772156.435007] <4>[772156.439237] CPU: 0 PID: 3078 Comm: stress Kdump: loaded Tainted: G E 6.10.0-rc7+ #2 <4>[772156.439239] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 1.0.0 01/01/2017 <4>[772156.439241] pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) <4>[772156.439243] pc : z_erofs_put_gbuf+0x64/0x70 [erofs] <4>[772156.439252] lr : z_erofs_lz4_decompress+0x600/0x6a0 [erofs] .. <6>[772156.445958] stress (3127): drop_caches: 1 <4>[772156.446120] Call trace: <4>[772156.446121] z_erofs_put_gbuf+0x64/0x70 [erofs] <4>[772156.446761] z_erofs_lz4_decompress+0x600/0x6a0 [erofs] <4>[772156.446897] z_erofs_decompress_queue+0x740/0xa10 [erofs] <4>[772156.447036] z_erofs_runqueue+0x428/0x8c0 [erofs] <4>[772156.447160] z_erofs_readahead+0x224/0x390 [erofs] .. Fixes: f36f301 ("erofs: rename per-CPU buffers to global buffer pool and make it configurable") Cc: <[email protected]> # 6.10+ Cc: Chunhai Guo <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
konradybcio
pushed a commit
that referenced
this issue
Jul 29, 2024
When using cachefiles, lockdep may emit something similar to the circular locking dependency notice below. The problem appears to stem from the following: (1) Cachefiles manipulates xattrs on the files in its cache when called from ->writepages(). (2) The setxattr() and removexattr() system call handlers get the name (and value) from userspace after taking the sb_writers lock, putting accesses of the vma->vm_lock and mm->mmap_lock inside of that. (3) The afs filesystem uses a per-inode lock to prevent multiple revalidation RPCs and in writeback vs truncate to prevent parallel operations from deadlocking against the server on one side and local page locks on the other. Fix this by moving the getting of the name and value in {get,remove}xattr() outside of the sb_writers lock. This also has the minor benefits that we don't need to reget these in the event of a retry and we never try to take the sb_writers lock in the event we can't pull the name and value into the kernel. Alternative approaches that might fix this include moving the dispatch of a write to the cache off to a workqueue or trying to do without the validation lock in afs. Note that this might also affect other filesystems that use netfslib and/or cachefiles. ====================================================== WARNING: possible circular locking dependency detected 6.10.0-build2+ torvalds#956 Not tainted ------------------------------------------------------ fsstress/6050 is trying to acquire lock: ffff888138fd82f0 (mapping.invalidate_lock#3){++++}-{3:3}, at: filemap_fault+0x26e/0x8b0 but task is already holding lock: ffff888113f26d18 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x165/0x250 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&vma->vm_lock->lock){++++}-{3:3}: __lock_acquire+0xaf0/0xd80 lock_acquire.part.0+0x103/0x280 down_write+0x3b/0x50 vma_start_write+0x6b/0xa0 vma_link+0xcc/0x140 insert_vm_struct+0xb7/0xf0 alloc_bprm+0x2c1/0x390 kernel_execve+0x65/0x1a0 call_usermodehelper_exec_async+0x14d/0x190 ret_from_fork+0x24/0x40 ret_from_fork_asm+0x1a/0x30 -> #3 (&mm->mmap_lock){++++}-{3:3}: __lock_acquire+0xaf0/0xd80 lock_acquire.part.0+0x103/0x280 __might_fault+0x7c/0xb0 strncpy_from_user+0x25/0x160 removexattr+0x7f/0x100 __do_sys_fremovexattr+0x7e/0xb0 do_syscall_64+0x9f/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (sb_writers#14){.+.+}-{0:0}: __lock_acquire+0xaf0/0xd80 lock_acquire.part.0+0x103/0x280 percpu_down_read+0x3c/0x90 vfs_iocb_iter_write+0xe9/0x1d0 __cachefiles_write+0x367/0x430 cachefiles_issue_write+0x299/0x2f0 netfs_advance_write+0x117/0x140 netfs_write_folio.isra.0+0x5ca/0x6e0 netfs_writepages+0x230/0x2f0 afs_writepages+0x4d/0x70 do_writepages+0x1e8/0x3e0 filemap_fdatawrite_wbc+0x84/0xa0 __filemap_fdatawrite_range+0xa8/0xf0 file_write_and_wait_range+0x59/0x90 afs_release+0x10f/0x270 __fput+0x25f/0x3d0 __do_sys_close+0x43/0x70 do_syscall_64+0x9f/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&vnode->validate_lock){++++}-{3:3}: __lock_acquire+0xaf0/0xd80 lock_acquire.part.0+0x103/0x280 down_read+0x95/0x200 afs_writepages+0x37/0x70 do_writepages+0x1e8/0x3e0 filemap_fdatawrite_wbc+0x84/0xa0 filemap_invalidate_inode+0x167/0x1e0 netfs_unbuffered_write_iter+0x1bd/0x2d0 vfs_write+0x22e/0x320 ksys_write+0xbc/0x130 do_syscall_64+0x9f/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (mapping.invalidate_lock#3){++++}-{3:3}: check_noncircular+0x119/0x160 check_prev_add+0x195/0x430 __lock_acquire+0xaf0/0xd80 lock_acquire.part.0+0x103/0x280 down_read+0x95/0x200 filemap_fault+0x26e/0x8b0 __do_fault+0x57/0xd0 do_pte_missing+0x23b/0x320 __handle_mm_fault+0x2d4/0x320 handle_mm_fault+0x14f/0x260 do_user_addr_fault+0x2a2/0x500 exc_page_fault+0x71/0x90 asm_exc_page_fault+0x22/0x30 other info that might help us debug this: Chain exists of: mapping.invalidate_lock#3 --> &mm->mmap_lock --> &vma->vm_lock->lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(&vma->vm_lock->lock); lock(&mm->mmap_lock); lock(&vma->vm_lock->lock); rlock(mapping.invalidate_lock#3); *** DEADLOCK *** 1 lock held by fsstress/6050: #0: ffff888113f26d18 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x165/0x250 stack backtrace: CPU: 0 PID: 6050 Comm: fsstress Not tainted 6.10.0-build2+ torvalds#956 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 Call Trace: <TASK> dump_stack_lvl+0x57/0x80 check_noncircular+0x119/0x160 ? queued_spin_lock_slowpath+0x4be/0x510 ? __pfx_check_noncircular+0x10/0x10 ? __pfx_queued_spin_lock_slowpath+0x10/0x10 ? mark_lock+0x47/0x160 ? init_chain_block+0x9c/0xc0 ? add_chain_block+0x84/0xf0 check_prev_add+0x195/0x430 __lock_acquire+0xaf0/0xd80 ? __pfx___lock_acquire+0x10/0x10 ? __lock_release.isra.0+0x13b/0x230 lock_acquire.part.0+0x103/0x280 ? filemap_fault+0x26e/0x8b0 ? __pfx_lock_acquire.part.0+0x10/0x10 ? rcu_is_watching+0x34/0x60 ? lock_acquire+0xd7/0x120 down_read+0x95/0x200 ? filemap_fault+0x26e/0x8b0 ? __pfx_down_read+0x10/0x10 ? __filemap_get_folio+0x25/0x1a0 filemap_fault+0x26e/0x8b0 ? __pfx_filemap_fault+0x10/0x10 ? find_held_lock+0x7c/0x90 ? __pfx___lock_release.isra.0+0x10/0x10 ? __pte_offset_map+0x99/0x110 __do_fault+0x57/0xd0 do_pte_missing+0x23b/0x320 __handle_mm_fault+0x2d4/0x320 ? __pfx___handle_mm_fault+0x10/0x10 handle_mm_fault+0x14f/0x260 do_user_addr_fault+0x2a2/0x500 exc_page_fault+0x71/0x90 asm_exc_page_fault+0x22/0x30 Signed-off-by: David Howells <[email protected]> Link: https://lore.kernel.org/r/[email protected] cc: Alexander Viro <[email protected]> cc: Christian Brauner <[email protected]> cc: Jan Kara <[email protected]> cc: Jeff Layton <[email protected]> cc: Gao Xiang <[email protected]> cc: Matthew Wilcox <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] [brauner: fix minor issues] Signed-off-by: Christian Brauner <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Jul 29, 2024
In z_erofs_get_gbuf(), the current task may be migrated to another CPU between `z_erofs_gbuf_id()` and `spin_lock(&gbuf->lock)`. Therefore, z_erofs_put_gbuf() will trigger the following issue which was found by stress test: <2>[772156.434168] kernel BUG at fs/erofs/zutil.c:58! .. <4>[772156.435007] <4>[772156.439237] CPU: 0 PID: 3078 Comm: stress Kdump: loaded Tainted: G E 6.10.0-rc7+ #2 <4>[772156.439239] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 1.0.0 01/01/2017 <4>[772156.439241] pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) <4>[772156.439243] pc : z_erofs_put_gbuf+0x64/0x70 [erofs] <4>[772156.439252] lr : z_erofs_lz4_decompress+0x600/0x6a0 [erofs] .. <6>[772156.445958] stress (3127): drop_caches: 1 <4>[772156.446120] Call trace: <4>[772156.446121] z_erofs_put_gbuf+0x64/0x70 [erofs] <4>[772156.446761] z_erofs_lz4_decompress+0x600/0x6a0 [erofs] <4>[772156.446897] z_erofs_decompress_queue+0x740/0xa10 [erofs] <4>[772156.447036] z_erofs_runqueue+0x428/0x8c0 [erofs] <4>[772156.447160] z_erofs_readahead+0x224/0x390 [erofs] .. Fixes: f36f301 ("erofs: rename per-CPU buffers to global buffer pool and make it configurable") Cc: <[email protected]> # 6.10+ Reviewed-by: Chunhai Guo <[email protected]> Reviewed-by: Sandeep Dhavale <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
konradybcio
pushed a commit
that referenced
this issue
Aug 23, 2024
Ido Schimmel says: ==================== Preparations for FIB rule DSCP selector This patchset moves the masking of the upper DSCP bits in 'flowi4_tos' to the core instead of relying on callers of the FIB lookup API to do it. This will allow us to start changing users of the API to initialize the 'flowi4_tos' field with all six bits of the DSCP field. In turn, this will allow us to extend FIB rules with a new DSCP selector. By masking the upper DSCP bits in the core we are able to maintain the behavior of the TOS selector in FIB rules and routes to only match on the lower DSCP bits. While working on this I found two users of the API that do not mask the upper DSCP bits before performing the lookup. The first is an ancient netlink family that is unlikely to be used. It is adjusted in patch #1 to mask both the upper DSCP bits and the ECN bits before calling the API. The second user is a nftables module that differs in this regard from its equivalent iptables module. It is adjusted in patch #2 to invoke the API with the upper DSCP bits masked, like all other callers. The relevant selftest passed, but in the unlikely case that regressions are reported because of this change, we can restore the existing behavior using a new flow information flag as discussed here [1]. The last patch moves the masking of the upper DSCP bits to the core, making the first two patches redundant, but I wanted to post them separately to call attention to the behavior change for these two users of the FIB lookup API. Future patchsets (around 3) will start unmasking the upper DSCP bits throughout the networking stack before adding support for the new FIB rule DSCP selector. Changes from v1 [2]: Patch #3: Include <linux/ip.h> in <linux/in_route.h> instead of including it in net/ip_fib.h [1] https://lore.kernel.org/netdev/ZpqpB8vJU%2FQ6LSqa@debian/ [2] https://lore.kernel.org/netdev/[email protected]/ ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Aug 23, 2024
Patch series "zram: introduce custom comp backends API", v6. This series introduces support for run-time compression algorithms tuning, so users, for instance, can adjust compression/acceleration levels and provide pre-trained compression/decompression dictionaries which certain algorithms support. At this point we stop supporting (old/deprecated) comp API. We may add new acomp API support in the future, but before that zram needs to undergo some major rework (we are not ready for async compression). Some benchmarks for reference (look at column #2) *** zstd /sys/block/zram0/mm_stat 1750650880 504575194 514392064 0 514392064 1 0 34204 34204 *** zstd level=-1 /sys/block/zram0/mm_stat 1750638592 591816152 603758592 0 603758592 1 0 34288 34288 *** zstd level=8 /sys/block/zram0/mm_stat 1750659072 486924248 496377856 0 496377856 1 0 34204 34204 *** zstd dict=/home/ss/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750634496 465853994 475230208 0 475230208 1 0 34185 34185 *** zstd level=8 dict=/home/ss/zstd-dict-amd64 /sys/block/zram0/mm_stat 1750650880 430760956 439967744 0 439967744 1 0 34185 34185 *** lz4 /sys/block/zram0/mm_stat 1750663168 664194239 676970496 0 676970496 1 0 34288 34288 *** lz4 dict=/home/ss/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750650880 619901052 632061952 0 632061952 1 0 34278 34278 *** lz4 level=5 dict=/home/ss/lz4-dict-amd64 /sys/block/zram0/mm_stat 1750650880 727180082 740884480 0 740884480 1 0 34438 34438 This patch (of 6): We need to export a number of API functions that enable advanced zstd usage - C/D dictionaries, dictionaries sharing between contexts, etc. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Nick Terrell <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Aug 23, 2024
…ptions Patch series "mm: split PTE/PMD PT table Kconfig cleanups+clarifications". This series is a follow up to the fixes: "[PATCH v1 0/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking" When working on the fixes, I wondered why 8xx is fine (-> never uses split PT locks) and how PT locking even works properly with PMD page table sharing (-> always requires split PMD PT locks). Let's improve the split PT lock detection, make hugetlb properly depend on it and make 8xx bail out if it would ever get enabled by accident. As an alternative to patch #3 we could extend the Kconfig SPLIT_PTE_PTLOCKS option from patch #2 -- but enforcing it closer to the code that actually implements it feels a bit nicer for documentation purposes, and there is no need to actually disable it because it should always be disabled (!SMP). Did a bunch of cross-compilations to make sure that split PTE/PMD PT locks are still getting used where we would expect them. [1] https://lkml.kernel.org/r/[email protected] This patch (of 3): Let's clean that up a bit and prepare for depending on CONFIG_SPLIT_PMD_PTLOCKS in other Kconfig options. More cleanups would be reasonable (like the arch-specific "depends on" for CONFIG_SPLIT_PTE_PTLOCKS), but we'll leave that for another day. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Russell King (Oracle) <[email protected]> Reviewed-by: Qi Zheng <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Muchun Song <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Aug 23, 2024
Eduard Zingerman says: ==================== __jited test tag to check disassembly after jit Some of the logic in the BPF jits might be non-trivial. It might be useful to allow testing this logic by comparing generated native code with expected code template. This patch set adds a macro __jited() that could be used for test_loader based tests in a following manner: SEC("tp") __arch_x86_64 __jited(" endbr64") __jited(" nopl (%rax,%rax)") __jited(" xorq %rax, %rax") ... __naked void some_test(void) { ... } Also add a test for jit code generated for tail calls handling to demonstrate the feature. The feature uses LLVM libraries to do the disassembly. At selftests compilation time Makefile detects if these libraries are available. When libraries are not available tests using __jit_x86() are skipped. Current CI environment does not include llvm development libraries, but changes to add these are trivial. This was previously discussed here: https://lore.kernel.org/bpf/[email protected]/ Patch-set includes a few auxiliary steps: - patches #2 and #3 fix a few bugs in test_loader behaviour; - patch #4 replaces __regex macro with ability to specify regular expressions in __msg and __xlated using "{{" "}}" escapes; - patch torvalds#8 updates __xlated to match disassembly lines consequently, same way as __jited does. Changes v2->v3: - changed macro name from __jit_x86 to __jited with __arch_* to specify disassembly arch (Yonghong); - __jited matches disassembly lines consequently with "..." allowing to skip some number of lines (Andrii); - __xlated matches disassembly lines consequently, same as __jited; - "{{...}}" regex brackets instead of __regex macro; - bug fixes for old commits. Changes v1->v2: - stylistic changes suggested by Yonghong; - fix for -Wformat-truncation related warning when compiled with llvm15 (Yonghong). v1: https://lore.kernel.org/bpf/[email protected]/ v2: https://lore.kernel.org/bpf/[email protected]/ ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Sep 30, 2024
Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock on x86 due to a chain of locks and SRCU synchronizations. Translating the below lockdep splat, CPU1 torvalds#6 will wait on CPU0 #1, CPU0 torvalds#8 will wait on CPU2 #3, and CPU2 torvalds#7 will wait on CPU1 #4 (if there's a writer, due to the fairness of r/w semaphores). CPU0 CPU1 CPU2 1 lock(&kvm->slots_lock); 2 lock(&vcpu->mutex); 3 lock(&kvm->srcu); 4 lock(cpu_hotplug_lock); 5 lock(kvm_lock); 6 lock(&kvm->slots_lock); 7 lock(cpu_hotplug_lock); 8 sync(&kvm->srcu); Note, there are likely more potential deadlocks in KVM x86, e.g. the same pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with __kvmclock_cpufreq_notifier(): cpuhp_cpufreq_online() | -> cpufreq_online() | -> cpufreq_gov_performance_limits() | -> __cpufreq_driver_target() | -> __target_index() | -> cpufreq_freq_transition_begin() | -> cpufreq_notify_transition() | -> ... __kvmclock_cpufreq_notifier() But, actually triggering such deadlocks is beyond rare due to the combination of dependencies and timings involved. E.g. the cpufreq notifier is only used on older CPUs without a constant TSC, mucking with the NX hugepage mitigation while VMs are running is very uncommon, and doing so while also onlining/offlining a CPU (necessary to generate contention on cpu_hotplug_lock) would be even more unusual. The most robust solution to the general cpu_hotplug_lock issue is likely to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq notifier doesn't to take kvm_lock. For now, settle for fixing the most blatant deadlock, as switching to an RCU-protected list is a much more involved change, but add a comment in locking.rst to call out that care needs to be taken when walking holding kvm_lock and walking vm_list. ====================================================== WARNING: possible circular locking dependency detected 6.10.0-smp--c257535a0c9d-pip torvalds#330 Tainted: G S O ------------------------------------------------------ tee/35048 is trying to acquire lock: ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm] but task is already holding lock: ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (kvm_lock){+.+.}-{3:3}: __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 kvm_dev_ioctl+0x4fb/0xe50 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x2e/0xb0 static_key_slow_inc+0x16/0x30 kvm_lapic_set_base+0x6a/0x1c0 [kvm] kvm_set_apic_base+0x8f/0xe0 [kvm] kvm_set_msr_common+0x9ae/0xf80 [kvm] vmx_set_msr+0xa54/0xbe0 [kvm_intel] __kvm_set_msr+0xb6/0x1a0 [kvm] kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm] kvm_vcpu_ioctl+0x485/0x5b0 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&kvm->srcu){.+.+}-{0:0}: __synchronize_srcu+0x44/0x1a0 synchronize_srcu_expedited+0x21/0x30 kvm_swap_active_memslots+0x110/0x1c0 [kvm] kvm_set_memslot+0x360/0x620 [kvm] __kvm_set_memory_region+0x27b/0x300 [kvm] kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm] kvm_vm_ioctl+0x295/0x650 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&kvm->slots_lock){+.+.}-{3:3}: __lock_acquire+0x15ef/0x2e30 lock_acquire+0xe0/0x260 __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 set_nx_huge_pages+0x179/0x1e0 [kvm] param_attr_store+0x93/0x100 module_attr_store+0x22/0x40 sysfs_kf_write+0x81/0xb0 kernfs_fop_write_iter+0x133/0x1d0 vfs_write+0x28d/0x380 ksys_write+0x70/0xe0 __x64_sys_write+0x1f/0x30 x64_sys_call+0x281b/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: Chao Gao <[email protected]> Fixes: 0bf5049 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock") Cc: [email protected] Reviewed-by: Kai Huang <[email protected]> Acked-by: Kai Huang <[email protected]> Tested-by: Farrah Chen <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Sep 30, 2024
…git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net v2: with kdoc fixes per Paolo Abeni. The following patchset contains Netfilter fixes for net: Patch #1 and #2 handle an esoteric scenario: Given two tasks sending UDP packets to one another, two packets of the same flow in each direction handled by different CPUs that result in two conntrack objects in NEW state, where reply packet loses race. Then, patch #3 adds a testcase for this scenario. Series from Florian Westphal. 1) NAT engine can falsely detect a port collision if it happens to pick up a reply packet as NEW rather than ESTABLISHED. Add extra code to detect this and suppress port reallocation in this case. 2) To complete the clash resolution in the reply direction, extend conntrack logic to detect clashing conntrack in the reply direction to existing entry. 3) Adds a test case. Then, an assorted list of fixes follow: 4) Add a selftest for tproxy, from Antonio Ojea. 5) Guard ctnetlink_*_size() functions under #if defined(CONFIG_NETFILTER_NETLINK_GLUE_CT) || defined(CONFIG_NF_CONNTRACK_EVENTS) From Andy Shevchenko. 6) Use -m socket --transparent in iptables tproxy documentation. From XIE Zhibang. 7) Call kfree_rcu() when releasing flowtable hooks to address race with netlink dump path, from Phil Sutter. 8) Fix compilation warning in nf_reject with CONFIG_BRIDGE_NETFILTER=n. From Simon Horman. 9) Guard ctnetlink_label_size() under CONFIG_NF_CONNTRACK_EVENTS which is its only user, to address a compilation warning. From Simon Horman. 10) Use rcu-protected list iteration over basechain hooks from netlink dump path. 11) Fix memcg for nf_tables, use GFP_KERNEL_ACCOUNT is not complete. 12) Remove old nfqueue conntrack clash resolution. Instead trying to use same destination address consistently which requires double DNAT, use the existing clash resolution which allows clashing packets go through with different destination. Antonio Ojea originally reported an issue from the postrouting chain, I proposed a fix: https://lore.kernel.org/netfilter-devel/ZuwSwAqKgCB2a51-@calendula/T/ which he reported it did not work for him. 13) Adds a selftest for patch 12. 14) Fixes ipvs.sh selftest. netfilter pull request 24-09-26 * tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: selftests: netfilter: Avoid hanging ipvs.sh kselftest: add test for nfqueue induced conntrack race netfilter: nfnetlink_queue: remove old clash resolution logic netfilter: nf_tables: missing objects with no memcg accounting netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n netfilter: nf_tables: Keep deleted flowtable hooks until after RCU docs: tproxy: ignore non-transparent sockets in iptables netfilter: ctnetlink: Guard possible unused functions selftests: netfilter: nft_tproxy.sh: add tcp tests selftests: netfilter: add reverse-clash resolution test case netfilter: conntrack: add clash resolution for reverse collisions netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Sep 30, 2024
Target slot selection for recompression is just a simple iteration over zram->table entries (stored pages) from slot 0 to max slot. Given that zram->table slots are written in random order and are not sorted by size, a simple iteration over slots selects suboptimal targets for recompression. This is not a problem if we recompress every single zram->table slot, but we never do that in reality. In reality we limit the number of slots we can recompress (via max_pages parameter) and hence proper slot selection becomes very important. The strategy is quite simple, suppose we have two candidate slots for recompression, one of size 48 bytes and one of size 2800 bytes, and we can recompress only one, then it certainly makes more sense to pick 2800 entry for recompression. Because even if we manage to compress 48 bytes objects even further the savings are going to be very small. Potential savings after good re-compression of 2800 bytes objects are much higher. This patch reworks slot selection and introduces the strategy described above: among candidate slots always select the biggest ones first. For that the patch introduces zram_pp_ctl (post-processing) structure which holds NUM_PP_BUCKETS pp buckets of slots. Slots are assigned to a particular group based on their sizes - the larger the size of the slot the higher the group index. This, basically, sorts slots by size in liner time (we still perform just one iteration over zram->table slots). When we select slot for recompression we always first lookup in higher pp buckets (those that hold the largest slots). Which achieves the desired behavior. TEST ==== A very simple demonstration: zram is configured with zstd, and zstd with dict as a recompression stream. A limited (max 4096 pages) recompression is performed then, with a log of sizes of slots that were recompressed. You can see that patched zram selects slots for recompression in significantly different manner, which leads to higher memory savings (see column #2 of mm_stat output). BASE ---- *** initial state of zram device /sys/block/zram0/mm_stat 1750994944 504491413 514203648 0 514203648 1 0 34204 34204 *** recompress idle max_pages=4096 /sys/block/zram0/mm_stat 1750994944 504262229 514953216 0 514203648 1 0 34204 34204 Sizes of selected objects for recompression: ... 45 58 24 226 91 40 24 24 24 424 2104 93 2078 2078 2078 959 154 ... PATCHED ------- *** initial state of zram device /sys/block/zram0/mm_stat 1750982656 504492801 514170880 0 514170880 1 0 34204 34204 *** recompress idle max_pages=4096 /sys/block/zram0/mm_stat 1750982656 503716710 517586944 0 514170880 1 0 34204 34204 Sizes of selected objects for recompression: ... 3680 3694 3667 3590 3614 3553 3537 3548 3550 3542 3543 3537 ... Note, pp-slots are not strictly sorted, there is a PP_BUCKET_SIZE_RANGE variation of sizes within particular bucket. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
konradybcio
pushed a commit
that referenced
this issue
Sep 30, 2024
Writeback suffers from the same problem as recompression did before - target slot selection for writeback is just a simple iteration over zram->table entries (stored pages) which selects suboptimal targets for writeback. This is especially problematic for writeback, because we uncompress objects before writeback so each of them takes 4K out of limited writeback storage. For example, when we take a 48 bytes slot and store it as a 4K object to writeback device we only save 48 bytes of memory (release from zsmalloc pool). We naturally want to pick the largest objects for writeback, because then each writeback will release the largest amount of memory. This patch applies the same solution and strategy as for recompression target selection: pp control (post-process) with 16 buckets of candidate pp slots. Slots are assigned to pp buckets based on sizes - the larger the slot the higher the group index. This gives us sorted by size lists of candidate slots (in linear time), so that among post-processing candidate slots we always select the largest ones first and maximize the memory saving. TEST ==== A very simple demonstration: zram is configured with a writeback device. A limited writeback (wb_limit 2500 pages) is performed then, with a log of sizes of slots that were written back. You can see that patched zram selects slots for recompression in significantly different manner, which leads to higher memory savings (see column #2 of mm_stat output). BASE ---- *** initial state of zram device /sys/block/zram0/mm_stat 1750327296 619765836 631902208 0 631902208 1 0 34278 34278 *** writeback idle wb_limit 2500 /sys/block/zram0/mm_stat 1750327296 617622333 631578624 0 631902208 1 0 34278 34278 Sizes of selected objects for writeback: ... 193 349 46 46 46 46 852 1002 543 162 107 49 34 34 34 ... PATCHED ------- *** initial state of zram device /sys/block/zram0/mm_stat 1750319104 619760957 631992320 0 631992320 1 0 34278 34278 *** writeback idle wb_limit 2500 /sys/block/zram0/mm_stat 1750319104 612672056 626135040 0 631992320 1 0 34278 34278 Sizes of selected objects for writeback: ... 3667 3580 3581 3580 3581 3581 3581 3231 3211 3203 3231 3246 ... Note, pp-slots are not strictly sorted, there is a PP_BUCKET_SIZE_RANGE variation of sizes within particular bucket. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
MarijnS95
pushed a commit
that referenced
this issue
Oct 6, 2024
Remove intel_dp_mst_suspend/resume from runtime suspend resume sequences. It is incorrect as it depends on AUX transfers which itself depend on the device being runtime resumed. This is also why we see a lock_dep splat here. <4> [76.011119] kworker/4:2/192 is trying to acquire lock: <4> [76.011122] ffff8881120b3210 (&mgr->lock#2){+.+.}-{3:3}, at: drm_dp_mst_topology_mgr_suspend+0x33/0xd0 [drm_display_helper] <4> [76.011142] but task is already holding lock: <4> [76.011144] ffffffffa0bc3420 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}, at: xe_pm_runtime_suspend+0x51/0x3f0 [xe] <4> [76.011223] which lock already depends on the new lock. <4> [76.011226] the existing dependency chain (in reverse order) is: <4> [76.011229] -> #2 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}: <4> [76.011233] pm_runtime_lockdep_prime+0x2f/0x50 [xe] <4> [76.011306] xe_pm_runtime_resume_and_get+0x29/0x90 [xe] <4> [76.011377] intel_display_power_get+0x24/0x70 [xe] <4> [76.011466] intel_digital_port_connected_locked+0x4c/0xf0 [xe] <4> [76.011551] intel_dp_aux_xfer+0xb8/0x7c0 [xe] <4> [76.011633] intel_dp_aux_transfer+0x166/0x2e0 [xe] <4> [76.011715] drm_dp_dpcd_access+0x87/0x150 [drm_display_helper] <4> [76.011726] drm_dp_dpcd_probe+0x3d/0xf0 [drm_display_helper] <4> [76.011737] drm_dp_dpcd_read+0xdd/0x130 [drm_display_helper] <4> [76.011747] intel_dp_get_colorimetry_status+0x3a/0x70 [xe] <4> [76.011886] intel_dp_init_connector+0x4ff/0x1030 [xe] <4> [76.011969] intel_ddi_init+0xc5b/0x1030 [xe] <4> [76.012058] intel_bios_for_each_encoder+0x36/0x60 [xe] <4> [76.012145] intel_setup_outputs+0x201/0x460 [xe] <4> [76.012233] intel_display_driver_probe_nogem+0x155/0x1e0 [xe] <4> [76.012320] xe_display_init_noaccel+0x27/0x70 [xe] Signed-off-by: Suraj Kandpal <[email protected]> Reviewed-by: Rodrigo Vivi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
MarijnS95
pushed a commit
that referenced
this issue
Oct 6, 2024
Do not do intel_fbdev_set_suspend during runtime_suspend/resume functions. This cause a big circular lock_dep splat. kworker/0:4/198 is trying to acquire lock: <4> [77.185594] ffffffff83398500 (console_lock){+.+.}-{0:0}, at: intel_fbdev_set_suspend+0x169/0x1f0 [xe] <4> [77.185947] but task is already holding lock: <4> [77.185949] ffffffffa09e9460 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}, at: xe_pm_runtime_suspend+0x51/0x3f0 [xe] <4> [77.186262] which lock already depends on the new lock. <4> [77.186264] the existing dependency chain (in reverse order) is: <4> [77.186266] -> #2 (xe_pm_runtime_lockdep_map){+.+.}-{0:0}: <4> [77.186276] pm_runtime_lockdep_prime+0x2f/0x50 [xe] <4> [77.186572] xe_pm_runtime_resume_and_get+0x29/0x90 [xe] <4> [77.186867] intelfb_create+0x150/0x390 [xe] <4> [77.187197] __drm_fb_helper_initial_config_and_unlock+0x31c/0x5e0 [drm_kms_helper] <4> [77.187243] drm_fb_helper_initial_config+0x3d/0x50 [drm_kms_helper] <4> [77.187274] intel_fbdev_client_hotplug+0xb1/0x140 [xe] <4> [77.187603] drm_client_register+0x87/0xd0 [drm] <4> [77.187704] intel_fbdev_setup+0x51c/0x640 [xe] <4> [77.188033] intel_display_driver_register+0xb7/0xf0 [xe] <4> [77.188438] xe_display_register+0x21/0x40 [xe] <4> [77.188809] xe_device_probe+0xa8d/0xbf0 [xe] <4> [77.189035] xe_pci_probe+0x333/0x5b0 [xe] <4> [77.189330] local_pci_probe+0x48/0xb0 <4> [77.189341] pci_device_probe+0xc8/0x280 <4> [77.189351] really_probe+0xf8/0x390 <4> [77.189362] __driver_probe_device+0x8a/0x170 <4> [77.189373] driver_probe_device+0x23/0xb0 Signed-off-by: Suraj Kandpal <[email protected]> Reviewed-by: Rodrigo Vivi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
MarijnS95
pushed a commit
that referenced
this issue
Oct 6, 2024
On the node of an NFS client, some files saved in the mountpoint of the NFS server were copied to another location of the same NFS server. Accidentally, the nfs42_complete_copies() got a NULL-pointer dereference crash with the following syslog: [232064.838881] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116 [232064.839360] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116 [232066.588183] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058 [232066.588586] Mem abort info: [232066.588701] ESR = 0x0000000096000007 [232066.588862] EC = 0x25: DABT (current EL), IL = 32 bits [232066.589084] SET = 0, FnV = 0 [232066.589216] EA = 0, S1PTW = 0 [232066.589340] FSC = 0x07: level 3 translation fault [232066.589559] Data abort info: [232066.589683] ISV = 0, ISS = 0x00000007 [232066.589842] CM = 0, WnR = 0 [232066.589967] user pgtable: 64k pages, 48-bit VAs, pgdp=00002000956ff400 [232066.590231] [0000000000000058] pgd=08001100ae100003, p4d=08001100ae100003, pud=08001100ae100003, pmd=08001100b3c00003, pte=0000000000000000 [232066.590757] Internal error: Oops: 96000007 [#1] SMP [232066.590958] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm vhost_net vhost vhost_iotlb tap tun ipt_rpfilter xt_multiport ip_set_hash_ip ip_set_hash_net xfrm_interface xfrm6_tunnel tunnel4 tunnel6 esp4 ah4 wireguard libcurve25519_generic veth xt_addrtype xt_set nf_conntrack_netlink ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport dummy ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_filter sch_ingress nfnetlink_cttimeout vport_gre ip_gre ip_tunnel gre vport_geneve geneve vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conncount dm_round_robin dm_service_time dm_multipath xt_nat xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink ocfs2 ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_ssif nbd overlay 8021q garp mrp bonding tls rfkill sunrpc ext4 mbcache jbd2 [232066.591052] vfat fat cas_cache cas_disk ses enclosure scsi_transport_sas sg acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ip_tables vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio dm_mirror dm_region_hash dm_log dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc fuse xfs libcrc32c ast drm_vram_helper qla2xxx drm_kms_helper syscopyarea crct10dif_ce sysfillrect ghash_ce sysimgblt sha2_ce fb_sys_fops cec sha256_arm64 sha1_ce drm_ttm_helper ttm nvme_fc igb sbsa_gwdt nvme_fabrics drm nvme_core i2c_algo_bit i40e scsi_transport_fc megaraid_sas aes_neon_bs [232066.596953] CPU: 6 PID: 4124696 Comm: 10.253.166.125- Kdump: loaded Not tainted 5.15.131-9.cl9_ocfs2.aarch64 #1 [232066.597356] Hardware name: Great Wall .\x93\x8e...RF6260 V5/GWMSSE2GL1T, BIOS T656FBE_V3.0.18 2024-01-06 [232066.597721] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [232066.598034] pc : nfs4_reclaim_open_state+0x220/0x800 [nfsv4] [232066.598327] lr : nfs4_reclaim_open_state+0x12c/0x800 [nfsv4] [232066.598595] sp : ffff8000f568fc70 [232066.598731] x29: ffff8000f568fc70 x28: 0000000000001000 x27: ffff21003db33000 [232066.599030] x26: ffff800005521ae0 x25: ffff0100f98fa3f0 x24: 0000000000000001 [232066.599319] x23: ffff800009920008 x22: ffff21003db33040 x21: ffff21003db33050 [232066.599628] x20: ffff410172fe9e40 x19: ffff410172fe9e00 x18: 0000000000000000 [232066.599914] x17: 0000000000000000 x16: 0000000000000004 x15: 0000000000000000 [232066.600195] x14: 0000000000000000 x13: ffff800008e685a8 x12: 00000000eac0c6e6 [232066.600498] x11: 0000000000000000 x10: 0000000000000008 x9 : ffff8000054e5828 [232066.600784] x8 : 00000000ffffffbf x7 : 0000000000000001 x6 : 000000000a9eb14a [232066.601062] x5 : 0000000000000000 x4 : ffff70ff8a14a800 x3 : 0000000000000058 [232066.601348] x2 : 0000000000000001 x1 : 54dce46366daa6c6 x0 : 0000000000000000 [232066.601636] Call trace: [232066.601749] nfs4_reclaim_open_state+0x220/0x800 [nfsv4] [232066.601998] nfs4_do_reclaim+0x1b8/0x28c [nfsv4] [232066.602218] nfs4_state_manager+0x928/0x10f0 [nfsv4] [232066.602455] nfs4_run_state_manager+0x78/0x1b0 [nfsv4] [232066.602690] kthread+0x110/0x114 [232066.602830] ret_from_fork+0x10/0x20 [232066.602985] Code: 1400000d f9403f20 f9402e61 91016003 (f9402c00) [232066.603284] SMP: stopping secondary CPUs [232066.606936] Starting crashdump kernel... [232066.607146] Bye! Analysing the vmcore, we know that nfs4_copy_state listed by destination nfs_server->ss_copies was added by the field copies in handle_async_copy(), and we found a waiting copy process with the stack as: PID: 3511963 TASK: ffff710028b47e00 CPU: 0 COMMAND: "cp" #0 [ffff8001116ef740] __switch_to at ffff8000081b92f4 #1 [ffff8001116ef760] __schedule at ffff800008dd0650 #2 [ffff8001116ef7c0] schedule at ffff800008dd0a00 #3 [ffff8001116ef7e0] schedule_timeout at ffff800008dd6aa0 #4 [ffff8001116ef860] __wait_for_common at ffff800008dd166c #5 [ffff8001116ef8e0] wait_for_completion_interruptible at ffff800008dd1898 torvalds#6 [ffff8001116ef8f0] handle_async_copy at ffff8000055142f4 [nfsv4] torvalds#7 [ffff8001116ef970] _nfs42_proc_copy at ffff8000055147c8 [nfsv4] torvalds#8 [ffff8001116efa80] nfs42_proc_copy at ffff800005514cf0 [nfsv4] torvalds#9 [ffff8001116efc50] __nfs4_copy_file_range.constprop.0 at ffff8000054ed694 [nfsv4] The NULL-pointer dereference was due to nfs42_complete_copies() listed the nfs_server->ss_copies by the field ss_copies of nfs4_copy_state. So the nfs4_copy_state address ffff0100f98fa3f0 was offset by 0x10 and the data accessed through this pointer was also incorrect. Generally, the ordered list nfs4_state_owner->so_states indicate open(O_RDWR) or open(O_WRITE) states are reclaimed firstly by nfs4_reclaim_open_state(). When destination state reclaim is failed with NFS_STATE_RECOVERY_FAILED and copies are not deleted in nfs_server->ss_copies, the source state may be passed to the nfs42_complete_copies() process earlier, resulting in this crash scene finally. To solve this issue, we add a list_head nfs_server->ss_src_copies for a server-to-server copy specially. Fixes: 0e65a32 ("NFS: handle source server reboot") Signed-off-by: Yanjun Zhang <[email protected]> Reviewed-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
MarijnS95
pushed a commit
that referenced
this issue
Oct 6, 2024
Target slot selection for recompression is just a simple iteration over zram->table entries (stored pages) from slot 0 to max slot. Given that zram->table slots are written in random order and are not sorted by size, a simple iteration over slots selects suboptimal targets for recompression. This is not a problem if we recompress every single zram->table slot, but we never do that in reality. In reality we limit the number of slots we can recompress (via max_pages parameter) and hence proper slot selection becomes very important. The strategy is quite simple, suppose we have two candidate slots for recompression, one of size 48 bytes and one of size 2800 bytes, and we can recompress only one, then it certainly makes more sense to pick 2800 entry for recompression. Because even if we manage to compress 48 bytes objects even further the savings are going to be very small. Potential savings after good re-compression of 2800 bytes objects are much higher. This patch reworks slot selection and introduces the strategy described above: among candidate slots always select the biggest ones first. For that the patch introduces zram_pp_ctl (post-processing) structure which holds NUM_PP_BUCKETS pp buckets of slots. Slots are assigned to a particular group based on their sizes - the larger the size of the slot the higher the group index. This, basically, sorts slots by size in liner time (we still perform just one iteration over zram->table slots). When we select slot for recompression we always first lookup in higher pp buckets (those that hold the largest slots). Which achieves the desired behavior. TEST ==== A very simple demonstration: zram is configured with zstd, and zstd with dict as a recompression stream. A limited (max 4096 pages) recompression is performed then, with a log of sizes of slots that were recompressed. You can see that patched zram selects slots for recompression in significantly different manner, which leads to higher memory savings (see column #2 of mm_stat output). BASE ---- *** initial state of zram device /sys/block/zram0/mm_stat 1750994944 504491413 514203648 0 514203648 1 0 34204 34204 *** recompress idle max_pages=4096 /sys/block/zram0/mm_stat 1750994944 504262229 514953216 0 514203648 1 0 34204 34204 Sizes of selected objects for recompression: ... 45 58 24 226 91 40 24 24 24 424 2104 93 2078 2078 2078 959 154 ... PATCHED ------- *** initial state of zram device /sys/block/zram0/mm_stat 1750982656 504492801 514170880 0 514170880 1 0 34204 34204 *** recompress idle max_pages=4096 /sys/block/zram0/mm_stat 1750982656 503716710 517586944 0 514170880 1 0 34204 34204 Sizes of selected objects for recompression: ... 3680 3694 3667 3590 3614 3553 3537 3548 3550 3542 3543 3537 ... Note, pp-slots are not strictly sorted, there is a PP_BUCKET_SIZE_RANGE variation of sizes within particular bucket. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
MarijnS95
pushed a commit
that referenced
this issue
Oct 6, 2024
Writeback suffers from the same problem as recompression did before - target slot selection for writeback is just a simple iteration over zram->table entries (stored pages) which selects suboptimal targets for writeback. This is especially problematic for writeback, because we uncompress objects before writeback so each of them takes 4K out of limited writeback storage. For example, when we take a 48 bytes slot and store it as a 4K object to writeback device we only save 48 bytes of memory (release from zsmalloc pool). We naturally want to pick the largest objects for writeback, because then each writeback will release the largest amount of memory. This patch applies the same solution and strategy as for recompression target selection: pp control (post-process) with 16 buckets of candidate pp slots. Slots are assigned to pp buckets based on sizes - the larger the slot the higher the group index. This gives us sorted by size lists of candidate slots (in linear time), so that among post-processing candidate slots we always select the largest ones first and maximize the memory saving. TEST ==== A very simple demonstration: zram is configured with a writeback device. A limited writeback (wb_limit 2500 pages) is performed then, with a log of sizes of slots that were written back. You can see that patched zram selects slots for recompression in significantly different manner, which leads to higher memory savings (see column #2 of mm_stat output). BASE ---- *** initial state of zram device /sys/block/zram0/mm_stat 1750327296 619765836 631902208 0 631902208 1 0 34278 34278 *** writeback idle wb_limit 2500 /sys/block/zram0/mm_stat 1750327296 617622333 631578624 0 631902208 1 0 34278 34278 Sizes of selected objects for writeback: ... 193 349 46 46 46 46 852 1002 543 162 107 49 34 34 34 ... PATCHED ------- *** initial state of zram device /sys/block/zram0/mm_stat 1750319104 619760957 631992320 0 631992320 1 0 34278 34278 *** writeback idle wb_limit 2500 /sys/block/zram0/mm_stat 1750319104 612672056 626135040 0 631992320 1 0 34278 34278 Sizes of selected objects for writeback: ... 3667 3580 3581 3580 3581 3581 3581 3231 3211 3203 3231 3246 ... Note, pp-slots are not strictly sorted, there is a PP_BUCKET_SIZE_RANGE variation of sizes within particular bucket. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
sorry for the stupid question.
im trying to run pmos on my maple. but seems like panel/dsi/gpu/mdss part of dt is just disappeared in mainline code. even they already finished (or actually not?) in some branch of this repo, such as 3ae7ae0 . is there any reason that prevented display part of dt merged into upstream (such as they are still WIP)?
i noticed that there is another project on gitlab (https://gitlab.com/msm8998-mainline/linux) trying to run mainline linux on msm8998. and they have some newer commit for yoshino (such as audio) which authored by member of somainline. but i didn't find where these commit came from. are they come from some private repo that include WIP commits? or that's the new place you contribute to?
i tried 5.18.y branch and some other older branch of msm8998-mainline because they are the branch conntains dt of display and have a .config that able to compile without too much edit. but none of them can make the screen/gpu works.
/sys/kernel/debug/devices_deferred
and dmesg showed there is a dependencies cycle (mdss c900000 -> mnoc 1744000 -> mmcc c8c0000 -> dsi0_phy 994400 -> mmcc c8c0000). i spent few month on figure out how to resolve this. unfortunately, i still didn't figure it out until now. can i get some help here? thanksThe text was updated successfully, but these errors were encountered: