Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SubmittingPatches ヘッダ部の JF 時代の翻訳者リストの扱い #2

Open
akiyks opened this issue Apr 29, 2022 · 2 comments
Open

Comments

@akiyks
Copy link
Owner

akiyks commented Apr 29, 2022

@tgkz

HOWTO では、コミット f012733 の RST 化の際に JF に関する情報が削除され、
翻訳者として柴田さんお一人が残る形になっています。
SubmittingPatches でも同様にしてよいのでしょうか?
記載されている Keiichi KII さんの e-mail アドレスは跳ね返される状態です。
JF が機能していないため問い合わせもできませんが、GPL の精神に準じると元の著作権
表記をそのまま残すことが配布の条件になるとも考えられます。
とすると、削除ではなくメインライン化時点のリストをそのまま残し、その旨を説明する
注釈を追加するのが良いのではないかと考えました。このリストは著作権表記に該当
しませんが。

その主旨の変更を先程プッシュした submitting-patches-1st-batch-rc1 ブランチの
最初のコミット 07e477c でやってみました。コメントいただけませんでしょうか?

2つ目のコミット e37a6e5 は、更新されるべきでなかったパスを戻すものです。
部分的なリバートなので、コミットを分けました。

その後のコミットは、追随リストの6番目 0af5270 ("Documentation/SubmittingPatches:
Request summaries for commit references") までに対応する更新です。
WIP-docs-ja-jp-cacheup-submittingpatches ブランチのものより、少し真面目に訳してみました。
翻訳へのレビューは linux-doc への投稿の後でも構いません。

akiyks pushed a commit that referenced this issue May 1, 2022
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.18, take #2

- Take care of faults occuring between the PARange and
  IPA range by injecting an exception

- Fix S2 faults taken from a host EL0 in protected mode

- Work around Oops caused by a PMU access from a 32bit
  guest when PMU has been created. This is a temporary
  bodge until we fix it for good.
@akiyks
Copy link
Owner Author

akiyks commented May 2, 2022

この件は柴田さんには判断できかねるのだと想像します。RSTまでには方針を決めたいです。
RST であれば、この古い翻訳者情報を RST 内のコメントに埋め込んで HTML や PDF から
隠すことも可能です。

submitting-patches-1st-batch-rc1 から先頭の2つのコミットを省いた5個のパッチをシリーズにして
linux-doc に投稿してみます。

akiyks pushed a commit that referenced this issue May 7, 2022
Current DP driver implementation has adding safe mode done at
dp_hpd_plug_handle() which is expected to be executed under event
thread context.

However there is possible circular locking happen (see blow stack trace)
after edp driver call dp_hpd_plug_handle() from dp_bridge_enable() which
is executed under drm_thread context.

After review all possibilities methods and as discussed on
https://patchwork.freedesktop.org/patch/483155/, supporting EDID
compliance tests in the driver is quite hacky. As seen with other
vendor drivers, supporting these will be much easier with IGT. Hence
removing all the related fail safe code for it so that no possibility
of circular lock will happen.
Reviewed-by: Stephen Boyd <[email protected]>
Reviewed-by: Douglas Anderson <[email protected]>
Reviewed-by: Dmitry Baryshkov <[email protected]>

======================================================
 WARNING: possible circular locking dependency detected
 5.15.35-lockdep torvalds#6 Tainted: G        W
 ------------------------------------------------------
 frecon/429 is trying to acquire lock:
 ffffff808dc3c4e8 (&dev->mode_config.mutex){+.+.}-{3:3}, at:
dp_panel_add_fail_safe_mode+0x4c/0xa0

 but task is already holding lock:
 ffffff808dc441e0 (&kms->commit_lock[i]){+.+.}-{3:3}, at: lock_crtcs+0xb4/0x124

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #3 (&kms->commit_lock[i]){+.+.}-{3:3}:
        __mutex_lock_common+0x174/0x1a64
        mutex_lock_nested+0x98/0xac
        lock_crtcs+0xb4/0x124
        msm_atomic_commit_tail+0x330/0x748
        commit_tail+0x19c/0x278
        drm_atomic_helper_commit+0x1dc/0x1f0
        drm_atomic_commit+0xc0/0xd8
        drm_atomic_helper_set_config+0xb4/0x134
        drm_mode_setcrtc+0x688/0x1248
        drm_ioctl_kernel+0x1e4/0x338
        drm_ioctl+0x3a4/0x684
        __arm64_sys_ioctl+0x118/0x154
        invoke_syscall+0x78/0x224
        el0_svc_common+0x178/0x200
        do_el0_svc+0x94/0x13c
        el0_svc+0x5c/0xec
        el0t_64_sync_handler+0x78/0x108
        el0t_64_sync+0x1a4/0x1a8

 -> #2 (crtc_ww_class_mutex){+.+.}-{3:3}:
        __mutex_lock_common+0x174/0x1a64
        ww_mutex_lock+0xb8/0x278
        modeset_lock+0x304/0x4ac
        drm_modeset_lock+0x4c/0x7c
        drmm_mode_config_init+0x4a8/0xc50
        msm_drm_init+0x274/0xac0
        msm_drm_bind+0x20/0x2c
        try_to_bring_up_master+0x3dc/0x470
        __component_add+0x18c/0x3c0
        component_add+0x1c/0x28
        dp_display_probe+0x954/0xa98
        platform_probe+0x124/0x15c
        really_probe+0x1b0/0x5f8
        __driver_probe_device+0x174/0x20c
        driver_probe_device+0x70/0x134
        __device_attach_driver+0x130/0x1d0
        bus_for_each_drv+0xfc/0x14c
        __device_attach+0x1bc/0x2bc
        device_initial_probe+0x1c/0x28
        bus_probe_device+0x94/0x178
        deferred_probe_work_func+0x1a4/0x1f0
        process_one_work+0x5d4/0x9dc
        worker_thread+0x898/0xccc
        kthread+0x2d4/0x3d4
        ret_from_fork+0x10/0x20

 -> #1 (crtc_ww_class_acquire){+.+.}-{0:0}:
        ww_acquire_init+0x1c4/0x2c8
        drm_modeset_acquire_init+0x44/0xc8
        drm_helper_probe_single_connector_modes+0xb0/0x12dc
        drm_mode_getconnector+0x5dc/0xfe8
        drm_ioctl_kernel+0x1e4/0x338
        drm_ioctl+0x3a4/0x684
        __arm64_sys_ioctl+0x118/0x154
        invoke_syscall+0x78/0x224
        el0_svc_common+0x178/0x200
        do_el0_svc+0x94/0x13c
        el0_svc+0x5c/0xec
        el0t_64_sync_handler+0x78/0x108
        el0t_64_sync+0x1a4/0x1a8

 -> #0 (&dev->mode_config.mutex){+.+.}-{3:3}:
        __lock_acquire+0x2650/0x672c
        lock_acquire+0x1b4/0x4ac
        __mutex_lock_common+0x174/0x1a64
        mutex_lock_nested+0x98/0xac
        dp_panel_add_fail_safe_mode+0x4c/0xa0
        dp_hpd_plug_handle+0x1f0/0x280
        dp_bridge_enable+0x94/0x2b8
        drm_atomic_bridge_chain_enable+0x11c/0x168
        drm_atomic_helper_commit_modeset_enables+0x500/0x740
        msm_atomic_commit_tail+0x3e4/0x748
        commit_tail+0x19c/0x278
        drm_atomic_helper_commit+0x1dc/0x1f0
        drm_atomic_commit+0xc0/0xd8
        drm_atomic_helper_set_config+0xb4/0x134
        drm_mode_setcrtc+0x688/0x1248
        drm_ioctl_kernel+0x1e4/0x338
        drm_ioctl+0x3a4/0x684
        __arm64_sys_ioctl+0x118/0x154
        invoke_syscall+0x78/0x224
        el0_svc_common+0x178/0x200
        do_el0_svc+0x94/0x13c
        el0_svc+0x5c/0xec
        el0t_64_sync_handler+0x78/0x108
        el0t_64_sync+0x1a4/0x1a8

Changes in v2:
-- re text commit title
-- remove all fail safe mode

Changes in v3:
-- remove dp_panel_add_fail_safe_mode() from dp_panel.h
-- add Fixes

Changes in v5:
--  [email protected]

Changes in v6:
--  fix Fixes commit ID

Fixes: 8b2c181 ("drm/msm/dp: add fail safe mode outside of event_mutex context")
Reported-by: Douglas Anderson <[email protected]>
Signed-off-by: Kuogee Hsieh <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Rob Clark <[email protected]>
akiyks pushed a commit that referenced this issue May 8, 2022
As reported by Alan, the CFI (Call Frame Information) in the VDSO time
routines is incorrect since commit ce7d805 ("powerpc/vdso: Prepare
for switching VDSO to generic C implementation.").

DWARF has a concept called the CFA (Canonical Frame Address), which on
powerpc is calculated as an offset from the stack pointer (r1). That
means when the stack pointer is changed there must be a corresponding
CFI directive to update the calculation of the CFA.

The current code is missing those directives for the changes to r1,
which prevents gdb from being able to generate a backtrace from inside
VDSO functions, eg:

  Breakpoint 1, 0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb) bt
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  #1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  #2  0x00007fffffffd960 in ?? ()
  #3  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  Backtrace stopped: frame did not save the PC

Alan helpfully describes some rules for correctly maintaining the CFI information:

  1) Every adjustment to the current frame address reg (ie. r1) must be
     described, and exactly at the instruction where r1 changes. Why?
     Because stack unwinding might want to access previous frames.

  2) If a function changes LR or any non-volatile register, the save
     location for those regs must be given. The CFI can be at any
     instruction after the saves up to the point that the reg is
     changed.
     (Exception: LR save should be described before a bl. not after)

  3) If asychronous unwind info is needed then restores of LR and
     non-volatile regs must also be described. The CFI can be at any
     instruction after the reg is restored up to the point where the
     save location is (potentially) trashed.

Fix the inability to backtrace by adding CFI directives describing the
changes to r1, ie. satisfying rule 1.

Also change the information for LR to point to the copy saved on the
stack, not the value in r0 that will be overwritten by the function
call.

Finally, add CFI directives describing the save/restore of r2.

With the fix gdb can correctly back trace and navigate up and down the stack:

  Breakpoint 1, 0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb) bt
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  #1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  #2  0x0000000100015b60 in gettime ()
  #3  0x000000010000c8bc in print_long_format ()
  #4  0x000000010000d180 in print_current_files ()
  #5  0x00000001000054ac in main ()
  (gdb) up
  #1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  (gdb)
  #2  0x0000000100015b60 in gettime ()
  (gdb)
  #3  0x000000010000c8bc in print_long_format ()
  (gdb)
  #4  0x000000010000d180 in print_current_files ()
  (gdb)
  #5  0x00000001000054ac in main ()
  (gdb)
  Initial frame selected; you cannot go up.
  (gdb) down
  #4  0x000000010000d180 in print_current_files ()
  (gdb)
  #3  0x000000010000c8bc in print_long_format ()
  (gdb)
  #2  0x0000000100015b60 in gettime ()
  (gdb)
  #1  0x00007ffff7d8872c in clock_gettime@@GLIBC_2.17 () from /lib64/libc.so.6
  (gdb)
  #0  0x00007ffff7f804dc in __kernel_clock_gettime ()
  (gdb)

Fixes: ce7d805 ("powerpc/vdso: Prepare for switching VDSO to generic C implementation.")
Cc: [email protected] # v5.11+
Reported-by: Alan Modra <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Reviewed-by: Segher Boessenkool <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
@tgkz
Copy link

tgkz commented May 10, 2022

連休でストップしてしまいました

著作権上は翻訳後のテキストには翻訳者に二次著作者としての権利が発生するのだと理解しています。この例ではKiiさんがその権利を持っているので、その名前だけはコメント欄に残しておくのが適切だと思います。残りのレビューアなどは不要と思います。
HOWTOを RSTにした際にはJFのヘッダを削除することで他のファイルのRST化のリファレンスにしたいと思っていたので現在の方向性は良いと思います。

akiyks pushed a commit that referenced this issue May 19, 2022
Now when Intel Elkhart Lake uses again common bit timing and there are
no other users for custom bit timing, we can bring back the changes
done by the commit 0ddd83f ("can: m_can: remove support for
custom bit timing").

This effectively reverts commit ea768b2 ("Revert "can: m_can:
remove support for custom bit timing"") while taking into account
commit ea22ba4 ("can: m_can: make custom bittiming fields const")
and commit 7d4a101 ("can: dev: add sanity check in
can_set_static_ctrlmode()").

Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Jarkko Nikula <[email protected]>
Signed-off-by: Marc Kleine-Budde <[email protected]>
akiyks pushed a commit that referenced this issue May 19, 2022
Do not allow to write timestamps on RX rings if PF is being configured.
When PF is being configured RX rings can be freed or rebuilt. If at the
same time timestamps are updated, the kernel will crash by dereferencing
null RX ring pointer.

PID: 1449   TASK: ff187d28ed658040  CPU: 34  COMMAND: "ice-ptp-0000:51"
 #0 [ff1966a94a713bb0] machine_kexec at ffffffff9d05a0be
 #1 [ff1966a94a713c08] __crash_kexec at ffffffff9d192e9d
 #2 [ff1966a94a713cd0] crash_kexec at ffffffff9d1941bd
 #3 [ff1966a94a713ce8] oops_end at ffffffff9d01bd54
 #4 [ff1966a94a713d08] no_context at ffffffff9d06bda4
 #5 [ff1966a94a713d60] __bad_area_nosemaphore at ffffffff9d06c10c
 torvalds#6 [ff1966a94a713da8] do_page_fault at ffffffff9d06cae4
 torvalds#7 [ff1966a94a713de0] page_fault at ffffffff9da0107e
    [exception RIP: ice_ptp_update_cached_phctime+91]
    RIP: ffffffffc076db8b  RSP: ff1966a94a713e98  RFLAGS: 00010246
    RAX: 16e3db9c6b7ccae4  RBX: ff187d269dd3c180  RCX: ff187d269cd4d018
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ff187d269cfcc644   R8: ff187d339b9641b0   R9: 0000000000000000
    R10: 0000000000000002  R11: 0000000000000000  R12: ff187d269cfcc648
    R13: ffffffff9f128784  R14: ffffffff9d101b70  R15: ff187d269cfcc640
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 torvalds#8 [ff1966a94a713ea0] ice_ptp_periodic_work at ffffffffc076dbef [ice]
 torvalds#9 [ff1966a94a713ee0] kthread_worker_fn at ffffffff9d101c1b
 torvalds#10 [ff1966a94a713f10] kthread at ffffffff9d101b4d
 torvalds#11 [ff1966a94a713f50] ret_from_fork at ffffffff9da0023f

Fixes: 77a7811 ("ice: enable receive hardware timestamping")
Signed-off-by: Arkadiusz Kubalewski <[email protected]>
Reviewed-by: Michal Schmidt <[email protected]>
Tested-by: Dave Cain <[email protected]>
Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
akiyks pushed a commit that referenced this issue May 19, 2022
Currently, software objects of flow steering are created and destroyed
during reload flow. In case a device is unloaded, the following error
is printed during grace period:

 mlx5_core 0000:00:0b.0: mlx5_fw_fatal_reporter_err_work:690:(pid 95):
    Driver is in error state. Unloading

As a solution to fix use-after-free bugs, where we try to access
these objects, when reading the value of flow_steering_mode devlink
param[1], let's split flow steering creation and destruction into two
routines:
    * init and cleanup: memory, cache, and pools allocation/free.
    * create and destroy: namespaces initialization and cleanup.

While at it, re-order the cleanup function to mirror the init function.

[1]
Kasan trace:

[  385.119849 ] BUG: KASAN: use-after-free in mlx5_devlink_fs_mode_get+0x3b/0xa0
[  385.119849 ] Read of size 4 at addr ffff888104b79308 by task bash/291
[  385.119849 ]
[  385.119849 ] CPU: 1 PID: 291 Comm: bash Not tainted 5.17.0-rc1+ #2
[  385.119849 ] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
[  385.119849 ] Call Trace:
[  385.119849 ]  <TASK>
[  385.119849 ]  dump_stack_lvl+0x6e/0x91
[  385.119849 ]  print_address_description.constprop.0+0x1f/0x160
[  385.119849 ]  ? mlx5_devlink_fs_mode_get+0x3b/0xa0
[  385.119849 ]  ? mlx5_devlink_fs_mode_get+0x3b/0xa0
[  385.119849 ]  kasan_report.cold+0x83/0xdf
[  385.119849 ]  ? devlink_param_notify+0x20/0x190
[  385.119849 ]  ? mlx5_devlink_fs_mode_get+0x3b/0xa0
[  385.119849 ]  mlx5_devlink_fs_mode_get+0x3b/0xa0
[  385.119849 ]  devlink_nl_param_fill+0x18a/0xa50
[  385.119849 ]  ? _raw_spin_lock_irqsave+0x8d/0xe0
[  385.119849 ]  ? devlink_flash_update_timeout_notify+0xf0/0xf0
[  385.119849 ]  ? __wake_up_common+0x4b/0x1e0
[  385.119849 ]  ? preempt_count_sub+0x14/0xc0
[  385.119849 ]  ? _raw_spin_unlock_irqrestore+0x28/0x40
[  385.119849 ]  ? __wake_up_common_lock+0xe3/0x140
[  385.119849 ]  ? __wake_up_common+0x1e0/0x1e0
[  385.119849 ]  ? __sanitizer_cov_trace_const_cmp8+0x27/0x80
[  385.119849 ]  ? __rcu_read_unlock+0x48/0x70
[  385.119849 ]  ? kasan_unpoison+0x23/0x50
[  385.119849 ]  ? __kasan_slab_alloc+0x2c/0x80
[  385.119849 ]  ? memset+0x20/0x40
[  385.119849 ]  ? __sanitizer_cov_trace_const_cmp4+0x25/0x80
[  385.119849 ]  devlink_param_notify+0xce/0x190
[  385.119849 ]  devlink_unregister+0x92/0x2b0
[  385.119849 ]  remove_one+0x41/0x140
[  385.119849 ]  pci_device_remove+0x68/0x140
[  385.119849 ]  ? pcibios_free_irq+0x10/0x10
[  385.119849 ]  __device_release_driver+0x294/0x3f0
[  385.119849 ]  device_driver_detach+0x82/0x130
[  385.119849 ]  unbind_store+0x193/0x1b0
[  385.119849 ]  ? subsys_interface_unregister+0x270/0x270
[  385.119849 ]  drv_attr_store+0x4e/0x70
[  385.119849 ]  ? drv_attr_show+0x60/0x60
[  385.119849 ]  sysfs_kf_write+0xa7/0xc0
[  385.119849 ]  kernfs_fop_write_iter+0x23a/0x2f0
[  385.119849 ]  ? sysfs_kf_bin_read+0x160/0x160
[  385.119849 ]  new_sync_write+0x311/0x430
[  385.119849 ]  ? new_sync_read+0x480/0x480
[  385.119849 ]  ? _raw_spin_lock+0x87/0xe0
[  385.119849 ]  ? __sanitizer_cov_trace_cmp4+0x25/0x80
[  385.119849 ]  ? security_file_permission+0x94/0xa0
[  385.119849 ]  vfs_write+0x4c7/0x590
[  385.119849 ]  ksys_write+0xf6/0x1e0
[  385.119849 ]  ? __x64_sys_read+0x50/0x50
[  385.119849 ]  ? fpregs_assert_state_consistent+0x99/0xa0
[  385.119849 ]  do_syscall_64+0x3d/0x90
[  385.119849 ]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  385.119849 ] RIP: 0033:0x7fc36ef38504
[  385.119849 ] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f
80 00 00 00 00 48 8d 05 f9 61 0d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f
05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 49 89 d4 55 48 89 f5 53
[  385.119849 ] RSP: 002b:00007ffde0ff3d08 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  385.119849 ] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc36ef38504
[  385.119849 ] RDX: 000000000000000c RSI: 00007fc370521040 RDI: 0000000000000001
[  385.119849 ] RBP: 00007fc370521040 R08: 00007fc36f00b8c0 R09: 00007fc36ee4b740
[  385.119849 ] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc36f00a760
[  385.119849 ] R13: 000000000000000c R14: 00007fc36f005760 R15: 000000000000000c
[  385.119849 ]  </TASK>
[  385.119849 ]
[  385.119849 ] Allocated by task 65:
[  385.119849 ]  kasan_save_stack+0x1e/0x40
[  385.119849 ]  __kasan_kmalloc+0x81/0xa0
[  385.119849 ]  mlx5_init_fs+0x11b/0x1160
[  385.119849 ]  mlx5_load+0x13c/0x220
[  385.119849 ]  mlx5_load_one+0xda/0x160
[  385.119849 ]  mlx5_recover_device+0xb8/0x100
[  385.119849 ]  mlx5_health_try_recover+0x2f9/0x3a1
[  385.119849 ]  devlink_health_reporter_recover+0x75/0x100
[  385.119849 ]  devlink_health_report+0x26c/0x4b0
[  385.275909 ]  mlx5_fw_fatal_reporter_err_work+0x11e/0x1b0
[  385.275909 ]  process_one_work+0x520/0x970
[  385.275909 ]  worker_thread+0x378/0x950
[  385.275909 ]  kthread+0x1bb/0x200
[  385.275909 ]  ret_from_fork+0x1f/0x30
[  385.275909 ]
[  385.275909 ] Freed by task 65:
[  385.275909 ]  kasan_save_stack+0x1e/0x40
[  385.275909 ]  kasan_set_track+0x21/0x30
[  385.275909 ]  kasan_set_free_info+0x20/0x30
[  385.275909 ]  __kasan_slab_free+0xfc/0x140
[  385.275909 ]  kfree+0xa5/0x3b0
[  385.275909 ]  mlx5_unload+0x2e/0xb0
[  385.275909 ]  mlx5_unload_one+0x86/0xb0
[  385.275909 ]  mlx5_fw_fatal_reporter_err_work.cold+0xca/0xcf
[  385.275909 ]  process_one_work+0x520/0x970
[  385.275909 ]  worker_thread+0x378/0x950
[  385.275909 ]  kthread+0x1bb/0x200
[  385.275909 ]  ret_from_fork+0x1f/0x30
[  385.275909 ]
[  385.275909 ] The buggy address belongs to the object at ffff888104b79300
[  385.275909 ]  which belongs to the cache kmalloc-128 of size 128
[  385.275909 ] The buggy address is located 8 bytes inside of
[  385.275909 ]  128-byte region [ffff888104b79300, ffff888104b79380)
[  385.275909 ] The buggy address belongs to the page:
[  385.275909 ] page:00000000de44dd39 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x104b78
[  385.275909 ] head:00000000de44dd39 order:1 compound_mapcount:0
[  385.275909 ] flags: 0x8000000000010200(slab|head|zone=2)
[  385.275909 ] raw: 8000000000010200 0000000000000000 dead000000000122 ffff8881000428c0
[  385.275909 ] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
[  385.275909 ] page dumped because: kasan: bad access detected
[  385.275909 ]
[  385.275909 ] Memory state around the buggy address:
[  385.275909 ]  ffff888104b79200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc
[  385.275909 ]  ffff888104b79280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  385.275909 ] >ffff888104b79300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  385.275909 ]                       ^
[  385.275909 ]  ffff888104b79380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  385.275909 ]  ffff888104b79400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  385.275909 ]]

Fixes: e890acd ("net/mlx5: Add devlink flow_steering_mode parameter")
Signed-off-by: Shay Drory <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
akiyks pushed a commit that referenced this issue May 21, 2022
The splat below can be seen when running kvm-unit-test:

     =============================
     WARNING: suspicious RCU usage
     5.18.0-rc7 #5 Tainted: G          IOE
     -----------------------------
     /home/kernel/linux/arch/x86/kvm/../../../virt/kvm/eventfd.c:80 RCU-list traversed in non-reader section!!

     other info that might help us debug this:

     rcu_scheduler_active = 2, debug_locks = 1
     4 locks held by qemu-system-x86/35124:
      #0: ffff9725391d80b8 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x77/0x710 [kvm]
      #1: ffffbd25cfb2a0b8 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0xdeb/0x1900 [kvm]
      #2: ffffbd25cfb2b920 (&kvm->irq_srcu){....}-{0:0}, at: kvm_hv_notify_acked_sint+0x79/0x1e0 [kvm]
      #3: ffffbd25cfb2b920 (&kvm->irq_srcu){....}-{0:0}, at: irqfd_resampler_ack+0x5/0x110 [kvm]

     stack backtrace:
     CPU: 2 PID: 35124 Comm: qemu-system-x86 Tainted: G          IOE     5.18.0-rc7 #5
     Call Trace:
      <TASK>
      dump_stack_lvl+0x6c/0x9b
      irqfd_resampler_ack+0xfd/0x110 [kvm]
      kvm_notify_acked_gsi+0x32/0x90 [kvm]
      kvm_hv_notify_acked_sint+0xc5/0x1e0 [kvm]
      kvm_hv_set_msr_common+0xec1/0x1160 [kvm]
      kvm_set_msr_common+0x7c3/0xf60 [kvm]
      vmx_set_msr+0x394/0x1240 [kvm_intel]
      kvm_set_msr_ignored_check+0x86/0x200 [kvm]
      kvm_emulate_wrmsr+0x4f/0x1f0 [kvm]
      vmx_handle_exit+0x6fb/0x7e0 [kvm_intel]
      vcpu_enter_guest+0xe5a/0x1900 [kvm]
      kvm_arch_vcpu_ioctl_run+0x16e/0xac0 [kvm]
      kvm_vcpu_ioctl+0x279/0x710 [kvm]
      __x64_sys_ioctl+0x83/0xb0
      do_syscall_64+0x3b/0x90
      entry_SYSCALL_64_after_hwframe+0x44/0xae

resampler-list is protected by irq_srcu (see kvm_irqfd_assign), so fix
the false positive by using list_for_each_entry_srcu().

Signed-off-by: Wanpeng Li <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
akiyks pushed a commit that referenced this issue May 24, 2022
In idtentry_vc(), vc_switch_off_ist() determines a safe stack to
switch to, off of the IST stack. Annotate the new stack switch with
ENCODE_FRAME_POINTER in case UNWINDER_FRAME_POINTER is used.

A stack walk before looks like this:

  CPU: 0 PID: 0 Comm: swapper Not tainted 5.18.0-rc7+ #2
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl
   dump_stack
   kernel_exc_vmm_communication
   asm_exc_vmm_communication
   ? native_read_msr
   ? __x2apic_disable.part.0
   ? x2apic_setup
   ? cpu_init
   ? trap_init
   ? start_kernel
   ? x86_64_start_reservations
   ? x86_64_start_kernel
   ? secondary_startup_64_no_verify
   </TASK>

and with the fix, the stack dump is exact:

  CPU: 0 PID: 0 Comm: swapper Not tainted 5.18.0-rc7+ #3
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl
   dump_stack
   kernel_exc_vmm_communication
   asm_exc_vmm_communication
  RIP: 0010:native_read_msr
  Code: ...
  < snipped regs >
   ? __x2apic_disable.part.0
   x2apic_setup
   cpu_init
   trap_init
   start_kernel
   x86_64_start_reservations
   x86_64_start_kernel
   secondary_startup_64_no_verify
   </TASK>

  [ bp: Test in a SEV-ES guest and rewrite the commit message to
    explain what exactly this does. ]

Fixes: a13644f ("x86/entry/64: Add entry code for #VC handler")
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Josh Poimboeuf <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
akiyks pushed a commit that referenced this issue May 24, 2022
Problem statement:

Once the user has disabled turbo frequency by

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

the cfs_rq's util_avg becomes quite small when compared with
CPU capacity.

Step to reproduce:

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

# ./x86_cpuload --count 1 --start 3 --timeout 100 --busy 99

would launch 1 thread and bind it to CPU3, lasting for 100 seconds,
with a CPU utilization of 99%. [1]

top result:
%Cpu3  : 98.4 us,  0.0 sy,  0.0 ni,  1.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

check util_avg:
cat /sys/kernel/debug/sched/debug | grep "cfs_rq\[3\]" -A 20 | grep util_avg
  .util_avg                      : 611

So the util_avg/cpu capacity is 611/1024, which is much smaller than
98.4% shown in the top result.

This might impact some logic in the scheduler. For example,
group_is_overloaded() would compare the group_capacity and group_util
in the sched group, to check if this sched group is overloaded or not.
With this gap, even when there is a nearly 100% workload, the sched
group will not be regarded as overloaded. Besides group_is_overloaded(),
there are also other victims. There is a ongoing work that aims to
optimize the task wakeup in a LLC domain. The main idea is to stop
searching idle CPUs if the sched domain is overloaded[2]. This proposal
also relies on the util_avg/CPU capacity to decide whether the LLC
domain is overloaded.

Analysis:

CPU frequency invariance has caused this difference. In summary,
the util_sum of cfs rq would decay quite fast when the CPU is in
idle, when the CPU frequency invariance is enabled.

The detail is as followed:

As depicted in update_rq_clock_pelt(), when the frequency invariance
is enabled, there would be two clock variables on each rq, clock_task
and clock_pelt:

   The clock_pelt scales the time to reflect the effective amount of
   computation done during the running delta time but then syncs back to
   clock_task when rq is idle.

   absolute time    | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
   @ max frequency  ------******---------------******---------------
   @ half frequency ------************---------************---------
   clock pelt       | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16

The fast decay of util_sum during idle is due to:

 1. rq->clock_pelt is always behind rq->clock_task
 2. rq->last_update is updated to rq->clock_pelt' after invoking
    ___update_load_sum()
 3. Then the CPU becomes idle, the rq->clock_pelt' would be suddenly
    increased a lot to rq->clock_task
 4. Enters ___update_load_sum() again, the idle period is calculated by
    rq->clock_task - rq->last_update, AKA, rq->clock_task - rq->clock_pelt'.
    The lower the CPU frequency is, the larger the delta =
    rq->clock_task - rq->clock_pelt' will be. Since the idle period will be
    used to decay the util_sum only, the util_sum drops significantly during
    idle period.

Proposal:

This symptom is not only caused by disabling turbo frequency, but it
would also appear if the user limits the max frequency at runtime.

Because, if the frequency is always lower than the max frequency,
CPU frequency invariance would decay the util_sum quite fast during
idle.

As some end users would disable turbo after boot up, this patch aims to
present this symptom and deals with turbo scenarios for now.

It might be ideal if CPU frequency invariance is aware of the max CPU
frequency (user specified) at runtime in the future.

Link: https://github.com/yu-chen-surf/x86_cpuload.git #1
Link: https://lore.kernel.org/lkml/[email protected]/ #2
Signed-off-by: Chen Yu <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
akiyks pushed a commit that referenced this issue May 27, 2022
Add the description for the i.MX8MP media blk-ctrl.

Signed-off-by: Paul Elder <[email protected]>
Reviewed-by: Laurent Pinchart <[email protected]>
Signed-off-by: Laurent Pinchart <[email protected]>
Reviewed-by: Marek Vasut <[email protected]>
Tested-by: Marek Vasut <[email protected]> # MX8MP LCDIF #1 and #2
Signed-off-by: Lucas Stach <[email protected]>
Signed-off-by: Shawn Guo <[email protected]>
akiyks pushed a commit that referenced this issue May 27, 2022
When shmem_reconfigure() calls __percpu_counter_compare(), the second
parameter is unsigned long long.  But in the definition of
__percpu_counter_compare(), the second parameter is s64.  So when
__percpu_counter_compare() executes abs(count - rhs), UBSAN shows the
following warning:

================================================================================
UBSAN: Undefined behaviour in lib/percpu_counter.c:209:6
signed integer overflow:
0 - -9223372036854775808 cannot be represented in type 'long long int'
CPU: 1 PID: 9636 Comm: syz-executor.2 Tainted: G                 ---------r-  - 4.18.0 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
 __dump_stack home/install/linux-rh-3-10/lib/dump_stack.c:77 [inline]
 dump_stack+0x125/0x1ae home/install/linux-rh-3-10/lib/dump_stack.c:117
 ubsan_epilogue+0xe/0x81 home/install/linux-rh-3-10/lib/ubsan.c:159
 handle_overflow+0x19d/0x1ec home/install/linux-rh-3-10/lib/ubsan.c:190
 __percpu_counter_compare+0x124/0x140 home/install/linux-rh-3-10/lib/percpu_counter.c:209
 percpu_counter_compare home/install/linux-rh-3-10/./include/linux/percpu_counter.h:50 [inline]
 shmem_remount_fs+0x1ce/0x6b0 home/install/linux-rh-3-10/mm/shmem.c:3530
 do_remount_sb+0x11b/0x530 home/install/linux-rh-3-10/fs/super.c:888
 do_remount home/install/linux-rh-3-10/fs/namespace.c:2344 [inline]
 do_mount+0xf8d/0x26b0 home/install/linux-rh-3-10/fs/namespace.c:2844
 ksys_mount+0xad/0x120 home/install/linux-rh-3-10/fs/namespace.c:3075
 __do_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3089 [inline]
 __se_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3086 [inline]
 __x64_sys_mount+0xbf/0x160 home/install/linux-rh-3-10/fs/namespace.c:3086
 do_syscall_64+0xca/0x5c0 home/install/linux-rh-3-10/arch/x86/entry/common.c:298
 entry_SYSCALL_64_after_hwframe+0x6a/0xdf
RIP: 0033:0x46b5e9
Code: 5d db fa ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b db fa ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f54d5f22c68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 000000000077bf60 RCX: 000000000046b5e9
RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000000
RBP: 000000000077bf60 R08: 0000000020000140 R09: 0000000000000000
R10: 00000000026740a4 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd1fb1592f R14: 00007f54d5f239c0 R15: 000000000077bf6c
================================================================================

[[email protected]: tweak error message text]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luo Meng <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Yu Kuai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
akiyks pushed a commit that referenced this issue May 27, 2022
pty_write() invokes kmalloc() which may invoke a normal printk() to print
failure message.  This can cause a deadlock in the scenario reported by
syz-bot below:

       CPU0              CPU1                    CPU2
       ----              ----                    ----
                         lock(console_owner);
                                                 lock(&port_lock_key);
  lock(&port->lock);
                         lock(&port_lock_key);
                                                 lock(&port->lock);
  lock(console_owner);

As commit dbdda84 ("printk: Add console owner and waiter logic to
load balance console writes") said, such deadlock can be prevented by
using printk_deferred() in kmalloc() (which is invoked in the section
guarded by the port->lock).  But there are too many printk() on the
kmalloc() path, and kmalloc() can be called from anywhere, so changing
printk() to printk_deferred() is too complicated and inelegant.

Therefore, this patch chooses to specify __GFP_NOWARN to kmalloc(), so
that printk() will not be called, and this deadlock problem can be
avoided.

Syzbot reported the following lockdep error:

======================================================
WARNING: possible circular locking dependency detected
5.4.143-00237-g08ccc19a-dirty torvalds#10 Not tainted
------------------------------------------------------
syz-executor.4/29420 is trying to acquire lock:
ffffffff8aedb2a0 (console_owner){....}-{0:0}, at: console_trylock_spinning kernel/printk/printk.c:1752 [inline]
ffffffff8aedb2a0 (console_owner){....}-{0:0}, at: vprintk_emit+0x2ca/0x470 kernel/printk/printk.c:2023

but task is already holding lock:
ffff8880119c9158 (&port->lock){-.-.}-{2:2}, at: pty_write+0xf4/0x1f0 drivers/tty/pty.c:120

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&port->lock){-.-.}-{2:2}:
       __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
       _raw_spin_lock_irqsave+0x35/0x50 kernel/locking/spinlock.c:159
       tty_port_tty_get drivers/tty/tty_port.c:288 [inline]          		<-- lock(&port->lock);
       tty_port_default_wakeup+0x1d/0xb0 drivers/tty/tty_port.c:47
       serial8250_tx_chars+0x530/0xa80 drivers/tty/serial/8250/8250_port.c:1767
       serial8250_handle_irq.part.0+0x31f/0x3d0 drivers/tty/serial/8250/8250_port.c:1854
       serial8250_handle_irq drivers/tty/serial/8250/8250_port.c:1827 [inline] 	<-- lock(&port_lock_key);
       serial8250_default_handle_irq+0xb2/0x220 drivers/tty/serial/8250/8250_port.c:1870
       serial8250_interrupt+0xfd/0x200 drivers/tty/serial/8250/8250_core.c:126
       __handle_irq_event_percpu+0x109/0xa50 kernel/irq/handle.c:156
       [...]

-> #1 (&port_lock_key){-.-.}-{2:2}:
       __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
       _raw_spin_lock_irqsave+0x35/0x50 kernel/locking/spinlock.c:159
       serial8250_console_write+0x184/0xa40 drivers/tty/serial/8250/8250_port.c:3198
										<-- lock(&port_lock_key);
       call_console_drivers kernel/printk/printk.c:1819 [inline]
       console_unlock+0x8cb/0xd00 kernel/printk/printk.c:2504
       vprintk_emit+0x1b5/0x470 kernel/printk/printk.c:2024			<-- lock(console_owner);
       vprintk_func+0x8d/0x250 kernel/printk/printk_safe.c:394
       printk+0xba/0xed kernel/printk/printk.c:2084
       register_console+0x8b3/0xc10 kernel/printk/printk.c:2829
       univ8250_console_init+0x3a/0x46 drivers/tty/serial/8250/8250_core.c:681
       console_init+0x49d/0x6d3 kernel/printk/printk.c:2915
       start_kernel+0x5e9/0x879 init/main.c:713
       secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241

-> #0 (console_owner){....}-{0:0}:
       [...]
       lock_acquire+0x127/0x340 kernel/locking/lockdep.c:4734
       console_trylock_spinning kernel/printk/printk.c:1773 [inline]		<-- lock(console_owner);
       vprintk_emit+0x307/0x470 kernel/printk/printk.c:2023
       vprintk_func+0x8d/0x250 kernel/printk/printk_safe.c:394
       printk+0xba/0xed kernel/printk/printk.c:2084
       fail_dump lib/fault-inject.c:45 [inline]
       should_fail+0x67b/0x7c0 lib/fault-inject.c:144
       __should_failslab+0x152/0x1c0 mm/failslab.c:33
       should_failslab+0x5/0x10 mm/slab_common.c:1224
       slab_pre_alloc_hook mm/slab.h:468 [inline]
       slab_alloc_node mm/slub.c:2723 [inline]
       slab_alloc mm/slub.c:2807 [inline]
       __kmalloc+0x72/0x300 mm/slub.c:3871
       kmalloc include/linux/slab.h:582 [inline]
       tty_buffer_alloc+0x23f/0x2a0 drivers/tty/tty_buffer.c:175
       __tty_buffer_request_room+0x156/0x2a0 drivers/tty/tty_buffer.c:273
       tty_insert_flip_string_fixed_flag+0x93/0x250 drivers/tty/tty_buffer.c:318
       tty_insert_flip_string include/linux/tty_flip.h:37 [inline]
       pty_write+0x126/0x1f0 drivers/tty/pty.c:122				<-- lock(&port->lock);
       n_tty_write+0xa7a/0xfc0 drivers/tty/n_tty.c:2356
       do_tty_write drivers/tty/tty_io.c:961 [inline]
       tty_write+0x512/0x930 drivers/tty/tty_io.c:1045
       __vfs_write+0x76/0x100 fs/read_write.c:494
       [...]

other info that might help us debug this:

Chain exists of:
  console_owner --> &port_lock_key --> &port->lock

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: b6da31b ("tty: Fix data race in tty_insert_flip_string_fixed_flag")
Signed-off-by: Qi Zheng <[email protected]>
Acked-by: Jiri Slaby <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Cc: Akinobu Mita <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Steven Rostedt (Google) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
akiyks pushed a commit that referenced this issue May 27, 2022
When user_dlm_destroy_lock failed, it didn't clean up the flags it set
before exit.  For USER_LOCK_IN_TEARDOWN, if this function fails because of
lock is still in used, next time when unlink invokes this function, it
will return succeed, and then unlink will remove inode and dentry if lock
is not in used(file closed), but the dlm lock is still linked in dlm lock
resource, then when bast come in, it will trigger a panic due to
user-after-free.  See the following panic call trace.  To fix this,
USER_LOCK_IN_TEARDOWN should be reverted if fail.  And also error should
be returned if USER_LOCK_IN_TEARDOWN is set to let user know that unlink
fail.

For the case of ocfs2_dlm_unlock failure, besides USER_LOCK_IN_TEARDOWN,
USER_LOCK_BUSY is also required to be cleared.  Even though spin lock is
released in between, but USER_LOCK_IN_TEARDOWN is still set, for
USER_LOCK_BUSY, if before every place that waits on this flag,
USER_LOCK_IN_TEARDOWN is checked to bail out, that will make sure no flow
waits on the busy flag set by user_dlm_destroy_lock(), then we can
simplely revert USER_LOCK_BUSY when ocfs2_dlm_unlock fails.  Fix
user_dlm_cluster_lock() which is the only function not following this.

[  941.336392] (python,26174,16):dlmfs_unlink:562 ERROR: unlink
004fb0000060000b5a90b8c847b72e1, error -16 from destroy
[  989.757536] ------------[ cut here ]------------
[  989.757709] kernel BUG at fs/ocfs2/dlmfs/userdlm.c:173!
[  989.757876] invalid opcode: 0000 [#1] SMP
[  989.758027] Modules linked in: ksplice_2zhuk2jr_ib_ipoib_new(O)
ksplice_2zhuk2jr(O) mptctl mptbase xen_netback xen_blkback xen_gntalloc
xen_gntdev xen_evtchn cdc_ether usbnet mii ocfs2 jbd2 rpcsec_gss_krb5
auth_rpcgss nfsv4 nfsv3 nfs_acl nfs fscache lockd grace ocfs2_dlmfs
ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bnx2fc
fcoe libfcoe libfc scsi_transport_fc sunrpc ipmi_devintf bridge stp llc
rds_rdma rds bonding ib_sdp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm falcon_lsm_serviceable(PE) falcon_nf_netcontain(PE)
mlx4_vnic falcon_kal(E) falcon_lsm_pinned_13402(E) mlx4_ib ib_sa ib_mad
ib_core ib_addr xenfs xen_privcmd dm_multipath iTCO_wdt iTCO_vendor_support
pcspkr sb_edac edac_core i2c_i801 lpc_ich mfd_core ipmi_ssif i2c_core ipmi_si
ipmi_msghandler
[  989.760686]  ioatdma sg ext3 jbd mbcache sd_mod ahci libahci ixgbe dca ptp
pps_core vxlan udp_tunnel ip6_udp_tunnel megaraid_sas mlx4_core crc32c_intel
be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ipv6 cxgb3 mdio
libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi wmi
dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
ksplice_2zhuk2jr_ib_ipoib_old]
[  989.761987] CPU: 10 PID: 19102 Comm: dlm_thread Tainted: P           OE
4.1.12-124.57.1.el6uek.x86_64 #2
[  989.762290] Hardware name: Oracle Corporation ORACLE SERVER
X5-2/ASM,MOTHERBOARD,1U, BIOS 30350100 06/17/2021
[  989.762599] task: ffff880178af6200 ti: ffff88017f7c8000 task.ti:
ffff88017f7c8000
[  989.762848] RIP: e030:[<ffffffffc07d4316>]  [<ffffffffc07d4316>]
__user_dlm_queue_lockres.part.4+0x76/0x80 [ocfs2_dlmfs]
[  989.763185] RSP: e02b:ffff88017f7cbcb8  EFLAGS: 00010246
[  989.763353] RAX: 0000000000000000 RBX: ffff880174d48008 RCX:
0000000000000003
[  989.763565] RDX: 0000000000120012 RSI: 0000000000000003 RDI:
ffff880174d48170
[  989.763778] RBP: ffff88017f7cbcc8 R08: ffff88021f4293b0 R09:
0000000000000000
[  989.763991] R10: ffff880179c8c000 R11: 0000000000000003 R12:
ffff880174d48008
[  989.764204] R13: 0000000000000003 R14: ffff880179c8c000 R15:
ffff88021db7a000
[  989.764422] FS:  0000000000000000(0000) GS:ffff880247480000(0000)
knlGS:ffff880247480000
[  989.764685] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[  989.764865] CR2: ffff8000007f6800 CR3: 0000000001ae0000 CR4:
0000000000042660
[  989.765081] Stack:
[  989.765167]  0000000000000003 ffff880174d48040 ffff88017f7cbd18
ffffffffc07d455f
[  989.765442]  ffff88017f7cbd88 ffffffff816fb639 ffff88017f7cbd38
ffff8800361b5600
[  989.765717]  ffff88021db7a000 ffff88021f429380 0000000000000003
ffffffffc0453020
[  989.765991] Call Trace:
[  989.766093]  [<ffffffffc07d455f>] user_bast+0x5f/0xf0 [ocfs2_dlmfs]
[  989.766287]  [<ffffffff816fb639>] ? schedule_timeout+0x169/0x2d0
[  989.766475]  [<ffffffffc0453020>] ? o2dlm_lock_ast_wrapper+0x20/0x20
[ocfs2_stack_o2cb]
[  989.766738]  [<ffffffffc045303a>] o2dlm_blocking_ast_wrapper+0x1a/0x20
[ocfs2_stack_o2cb]
[  989.767010]  [<ffffffffc0864ec6>] dlm_do_local_bast+0x46/0xe0 [ocfs2_dlm]
[  989.767217]  [<ffffffffc084f5cc>] ? dlm_lockres_calc_usage+0x4c/0x60
[ocfs2_dlm]
[  989.767466]  [<ffffffffc08501f1>] dlm_thread+0xa31/0x1140 [ocfs2_dlm]
[  989.767662]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.767834]  [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[  989.768006]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.768178]  [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[  989.768349]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.768521]  [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[  989.768693]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.768893]  [<ffffffff816f78ce>] ? __schedule+0x23e/0x810
[  989.769067]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.769241]  [<ffffffff810ce4d0>] ? wait_woken+0x90/0x90
[  989.769411]  [<ffffffffc084f7c0>] ? dlm_kick_thread+0x80/0x80 [ocfs2_dlm]
[  989.769617]  [<ffffffff810a8bbb>] kthread+0xcb/0xf0
[  989.769774]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.769945]  [<ffffffff816f78da>] ? __schedule+0x24a/0x810
[  989.770117]  [<ffffffff810a8af0>] ? kthread_create_on_node+0x180/0x180
[  989.770321]  [<ffffffff816fdaa1>] ret_from_fork+0x61/0x90
[  989.770492]  [<ffffffff810a8af0>] ? kthread_create_on_node+0x180/0x180
[  989.770689] Code: d0 00 00 00 f0 45 7d c0 bf 00 20 00 00 48 89 83 c0 00 00
00 48 89 83 c8 00 00 00 e8 55 c1 8c c0 83 4b 04 10 48 83 c4 08 5b 5d c3 <0f>
0b 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 41 54 53 48 83
[  989.771892] RIP  [<ffffffffc07d4316>]
__user_dlm_queue_lockres.part.4+0x76/0x80 [ocfs2_dlmfs]
[  989.772174]  RSP <ffff88017f7cbcb8>
[  989.772704] ---[ end trace ebd1e38cebcc93a8 ]---
[  989.772907] Kernel panic - not syncing: Fatal exception
[  989.773173] Kernel Offset: disabled

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Junxiao Bi <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Joseph Qi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Gang He <[email protected]>
Cc: Jun Piao <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
akiyks pushed a commit that referenced this issue May 28, 2022
The irqchip ops are called with a raw spinlock held, so the subsequent
regmap usage cannot use a plain spinlock.

spi-hid-apple-of spi0.0: spihid_apple_of_probe:74

=============================
[ BUG: Invalid wait context ]
5.18.0-asahi-00176-g0fa3ab03bdea #1337 Not tainted
-----------------------------
kworker/u20:3/86 is trying to lock:
ffff8000166b5018 (pinctrl_apple_gpio:462:(&regmap_config)->lock){....}-{3:3}, at: regmap_lock_spinlock+0x18/0x30
other info that might help us debug this:
context-{5:5}
7 locks held by kworker/u20:3/86:
 #0: ffff800017725d48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c8/0x670
 #1: ffff80001e33bdd0 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work+0x1c8/0x670
 #2: ffff800017d629a0 (&dev->mutex){....}-{4:4}, at: __device_attach+0x30/0x17c
 #3: ffff80002414e618 (&ctlr->add_lock){+.+.}-{4:4}, at: spi_add_device+0x40/0x80
 #4: ffff800024116990 (&dev->mutex){....}-{4:4}, at: __device_attach+0x30/0x17c
 #5: ffff800022d4be58 (request_class){+.+.}-{4:4}, at: __setup_irq+0xa8/0x720
 torvalds#6: ffff800022d4bcc8 (lock_class){....}-{2:2}, at: __setup_irq+0xcc/0x720

Fixes: a0f160f ("pinctrl: add pinctrl/GPIO driver for Apple SoCs")
Signed-off-by: Hector Martin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Linus Walleij <[email protected]>
akiyks pushed a commit that referenced this issue May 30, 2022
We see the following GPF when register_ftrace_direct fails:

[ ] general protection fault, probably for non-canonical address \
  0x200000000000010: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[...]
[ ] RIP: 0010:ftrace_find_rec_direct+0x53/0x70
[ ] Code: 48 c1 e0 03 48 03 42 08 48 8b 10 31 c0 48 85 d2 74 [...]
[ ] RSP: 0018:ffffc9000138bc10 EFLAGS: 00010206
[ ] RAX: 0000000000000000 RBX: ffffffff813e0df0 RCX: 000000000000003b
[ ] RDX: 0200000000000000 RSI: 000000000000000c RDI: ffffffff813e0df0
[ ] RBP: ffffffffa00a3000 R08: ffffffff81180ce0 R09: 0000000000000001
[ ] R10: ffffc9000138bc18 R11: 0000000000000001 R12: ffffffff813e0df0
[ ] R13: ffffffff813e0df0 R14: ffff888171b56400 R15: 0000000000000000
[ ] FS:  00007fa9420c7780(0000) GS:ffff888ff6a00000(0000) knlGS:000000000
[ ] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ ] CR2: 000000000770d000 CR3: 0000000107d50003 CR4: 0000000000370ee0
[ ] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ ] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ ] Call Trace:
[ ]  <TASK>
[ ]  register_ftrace_direct+0x54/0x290
[ ]  ? render_sigset_t+0xa0/0xa0
[ ]  bpf_trampoline_update+0x3f5/0x4a0
[ ]  ? 0xffffffffa00a3000
[ ]  bpf_trampoline_link_prog+0xa9/0x140
[ ]  bpf_tracing_prog_attach+0x1dc/0x450
[ ]  bpf_raw_tracepoint_open+0x9a/0x1e0
[ ]  ? find_held_lock+0x2d/0x90
[ ]  ? lock_release+0x150/0x430
[ ]  __sys_bpf+0xbd6/0x2700
[ ]  ? lock_is_held_type+0xd8/0x130
[ ]  __x64_sys_bpf+0x1c/0x20
[ ]  do_syscall_64+0x3a/0x80
[ ]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ ] RIP: 0033:0x7fa9421defa9
[ ] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 9 f8 [...]
[ ] RSP: 002b:00007ffed743bd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
[ ] RAX: ffffffffffffffda RBX: 00000000069d2480 RCX: 00007fa9421defa9
[ ] RDX: 0000000000000078 RSI: 00007ffed743bd80 RDI: 0000000000000011
[ ] RBP: 00007ffed743be00 R08: 0000000000bb7270 R09: 0000000000000000
[ ] R10: 00000000069da210 R11: 0000000000000246 R12: 0000000000000001
[ ] R13: 00007ffed743c4b0 R14: 00000000069d2480 R15: 0000000000000001
[ ]  </TASK>
[ ] Modules linked in: klp_vm(OK)
[ ] ---[ end trace 0000000000000000 ]---

One way to trigger this is:
  1. load a livepatch that patches kernel function xxx;
  2. run bpftrace -e 'kfunc:xxx {}', this will fail (expected for now);
  3. repeat #2 => gpf.

This is because the entry is added to direct_functions, but not removed.
Fix this by remove the entry from direct_functions when
register_ftrace_direct fails.

Also remove the last trailing space from ftrace.c, so we don't have to
worry about it anymore.

Link: https://lkml.kernel.org/r/[email protected]

Cc: [email protected]
Fixes: 763e34e ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
akiyks pushed a commit that referenced this issue Jun 1, 2022
Let's avoid false-alarmed lockdep warning.

[   58.914674] [T1501146] -> #2 (&sb->s_type->i_mutex_key#20){+.+.}-{3:3}:
[   58.915975] [T1501146] system_server:        down_write+0x7c/0xe0
[   58.916738] [T1501146] system_server:        f2fs_quota_sync+0x60/0x1a8
[   58.917563] [T1501146] system_server:        block_operations+0x16c/0x43c
[   58.918410] [T1501146] system_server:        f2fs_write_checkpoint+0x114/0x318
[   58.919312] [T1501146] system_server:        f2fs_issue_checkpoint+0x178/0x21c
[   58.920214] [T1501146] system_server:        f2fs_sync_fs+0x48/0x6c
[   58.920999] [T1501146] system_server:        f2fs_do_sync_file+0x334/0x738
[   58.921862] [T1501146] system_server:        f2fs_sync_file+0x30/0x48
[   58.922667] [T1501146] system_server:        __arm64_sys_fsync+0x84/0xf8
[   58.923506] [T1501146] system_server:        el0_svc_common.llvm.12821150825140585682+0xd8/0x20c
[   58.924604] [T1501146] system_server:        do_el0_svc+0x28/0xa0
[   58.925366] [T1501146] system_server:        el0_svc+0x24/0x38
[   58.926094] [T1501146] system_server:        el0_sync_handler+0x88/0xec
[   58.926920] [T1501146] system_server:        el0_sync+0x1b4/0x1c0

[   58.927681] [T1501146] -> #1 (&sbi->cp_global_sem){+.+.}-{3:3}:
[   58.928889] [T1501146] system_server:        down_write+0x7c/0xe0
[   58.929650] [T1501146] system_server:        f2fs_write_checkpoint+0xbc/0x318
[   58.930541] [T1501146] system_server:        f2fs_issue_checkpoint+0x178/0x21c
[   58.931443] [T1501146] system_server:        f2fs_sync_fs+0x48/0x6c
[   58.932226] [T1501146] system_server:        sync_filesystem+0xac/0x130
[   58.933053] [T1501146] system_server:        generic_shutdown_super+0x38/0x150
[   58.933958] [T1501146] system_server:        kill_block_super+0x24/0x58
[   58.934791] [T1501146] system_server:        kill_f2fs_super+0xcc/0x124
[   58.935618] [T1501146] system_server:        deactivate_locked_super+0x90/0x120
[   58.936529] [T1501146] system_server:        deactivate_super+0x74/0xac
[   58.937356] [T1501146] system_server:        cleanup_mnt+0x128/0x168
[   58.938150] [T1501146] system_server:        __cleanup_mnt+0x18/0x28
[   58.938944] [T1501146] system_server:        task_work_run+0xb8/0x14c
[   58.939749] [T1501146] system_server:        do_notify_resume+0x114/0x1e8
[   58.940595] [T1501146] system_server:        work_pending+0xc/0x5f0

[   58.941375] [T1501146] -> #0 (&sbi->gc_lock){+.+.}-{3:3}:
[   58.942519] [T1501146] system_server:        __lock_acquire+0x1270/0x2868
[   58.943366] [T1501146] system_server:        lock_acquire+0x114/0x294
[   58.944169] [T1501146] system_server:        down_write+0x7c/0xe0
[   58.944930] [T1501146] system_server:        f2fs_issue_checkpoint+0x13c/0x21c
[   58.945831] [T1501146] system_server:        f2fs_sync_fs+0x48/0x6c
[   58.946614] [T1501146] system_server:        f2fs_do_sync_file+0x334/0x738
[   58.947472] [T1501146] system_server:        f2fs_ioc_commit_atomic_write+0xc8/0x14c
[   58.948439] [T1501146] system_server:        __f2fs_ioctl+0x674/0x154c
[   58.949253] [T1501146] system_server:        f2fs_ioctl+0x54/0x88
[   58.950018] [T1501146] system_server:        __arm64_sys_ioctl+0xa8/0x110
[   58.950865] [T1501146] system_server:        el0_svc_common.llvm.12821150825140585682+0xd8/0x20c
[   58.951965] [T1501146] system_server:        do_el0_svc+0x28/0xa0
[   58.952727] [T1501146] system_server:        el0_svc+0x24/0x38
[   58.953454] [T1501146] system_server:        el0_sync_handler+0x88/0xec
[   58.954279] [T1501146] system_server:        el0_sync+0x1b4/0x1c0

Cc: [email protected]
Signed-off-by: Jaegeuk Kim <[email protected]>
akiyks pushed a commit that referenced this issue Jun 1, 2022
Send along the already-allocated fattr along with nfs4_fs_locations, and
drop the memcpy of fattr.  We end up growing two more allocations, but this
fixes up a crash as:

PID: 790    TASK: ffff88811b43c000  CPU: 0   COMMAND: "ls"
 #0 [ffffc90000857920] panic at ffffffff81b9bfde
 #1 [ffffc900008579c0] do_trap at ffffffff81023a9b
 #2 [ffffc90000857a10] do_error_trap at ffffffff81023b78
 #3 [ffffc90000857a58] exc_stack_segment at ffffffff81be1f45
 #4 [ffffc90000857a80] asm_exc_stack_segment at ffffffff81c009de
 #5 [ffffc90000857b08] nfs_lookup at ffffffffa0302322 [nfs]
 torvalds#6 [ffffc90000857b70] __lookup_slow at ffffffff813a4a5f
 torvalds#7 [ffffc90000857c60] walk_component at ffffffff813a86c4
 torvalds#8 [ffffc90000857cb8] path_lookupat at ffffffff813a9553
 torvalds#9 [ffffc90000857cf0] filename_lookup at ffffffff813ab86b

Suggested-by: Trond Myklebust <[email protected]>
Fixes: 9558a00 ("NFS: Remove the label from the nfs4_lookup_res struct")
Signed-off-by: Benjamin Coddington <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
akiyks pushed a commit that referenced this issue Jun 2, 2022
A rare BUG_ON triggered in assoc_array_gc:

    [3430308.818153] kernel BUG at lib/assoc_array.c:1609!

Which corresponded to the statement currently at line 1593 upstream:

    BUG_ON(assoc_array_ptr_is_meta(p));

Using the data from the core dump, I was able to generate a userspace
reproducer[1] and determine the cause of the bug.

[1]: https://github.com/brenns10/kernel_stuff/tree/master/assoc_array_gc

After running the iterator on the entire branch, an internal tree node
looked like the following:

    NODE (nr_leaves_on_branch: 3)
      SLOT [0] NODE (2 leaves)
      SLOT [1] NODE (1 leaf)
      SLOT [2..f] NODE (empty)

In the userspace reproducer, the pr_devel output when compressing this
node was:

    -- compress node 0x5607cc089380 --
    free=0, leaves=0
    [0] retain node 2/1 [nx 0]
    [1] fold node 1/1 [nx 0]
    [2] fold node 0/1 [nx 2]
    [3] fold node 0/2 [nx 2]
    [4] fold node 0/3 [nx 2]
    [5] fold node 0/4 [nx 2]
    [6] fold node 0/5 [nx 2]
    [7] fold node 0/6 [nx 2]
    [8] fold node 0/7 [nx 2]
    [9] fold node 0/8 [nx 2]
    [10] fold node 0/9 [nx 2]
    [11] fold node 0/10 [nx 2]
    [12] fold node 0/11 [nx 2]
    [13] fold node 0/12 [nx 2]
    [14] fold node 0/13 [nx 2]
    [15] fold node 0/14 [nx 2]
    after: 3

At slot 0, an internal node with 2 leaves could not be folded into the
node, because there was only one available slot (slot 0). Thus, the
internal node was retained. At slot 1, the node had one leaf, and was
able to be folded in successfully. The remaining nodes had no leaves,
and so were removed. By the end of the compression stage, there were 14
free slots, and only 3 leaf nodes. The tree was ascended and then its
parent node was compressed. When this node was seen, it could not be
folded, due to the internal node it contained.

The invariant for compression in this function is: whenever
nr_leaves_on_branch < ASSOC_ARRAY_FAN_OUT, the node should contain all
leaf nodes. The compression step currently cannot guarantee this, given
the corner case shown above.

To fix this issue, retry compression whenever we have retained a node,
and yet nr_leaves_on_branch < ASSOC_ARRAY_FAN_OUT. This second
compression will then allow the node in slot 1 to be folded in,
satisfying the invariant. Below is the output of the reproducer once the
fix is applied:

    -- compress node 0x560e9c562380 --
    free=0, leaves=0
    [0] retain node 2/1 [nx 0]
    [1] fold node 1/1 [nx 0]
    [2] fold node 0/1 [nx 2]
    [3] fold node 0/2 [nx 2]
    [4] fold node 0/3 [nx 2]
    [5] fold node 0/4 [nx 2]
    [6] fold node 0/5 [nx 2]
    [7] fold node 0/6 [nx 2]
    [8] fold node 0/7 [nx 2]
    [9] fold node 0/8 [nx 2]
    [10] fold node 0/9 [nx 2]
    [11] fold node 0/10 [nx 2]
    [12] fold node 0/11 [nx 2]
    [13] fold node 0/12 [nx 2]
    [14] fold node 0/13 [nx 2]
    [15] fold node 0/14 [nx 2]
    internal nodes remain despite enough space, retrying
    -- compress node 0x560e9c562380 --
    free=14, leaves=1
    [0] fold node 2/15 [nx 0]
    after: 3

Changes
=======
DH:
 - Use false instead of 0.
 - Reorder the inserted lines in a couple of places to put retained before
   next_slot.

ver #2)
 - Fix typo in pr_devel, correct comparison to "<="

Fixes: 3cb9895 ("Add a generic associative array implementation.")
Cc: <[email protected]>
Signed-off-by: Stephen Brennan <[email protected]>
Signed-off-by: David Howells <[email protected]>
cc: Andrew Morton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]/ # v1
Link: https://lore.kernel.org/r/[email protected]/ # v2
Reviewed-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
akiyks pushed a commit that referenced this issue Jun 2, 2022
Since 3e135cd ("netfilter: nft_dynset: dynamic stateful expression
instantiation"), it is possible to attach stateful expressions to set
elements.

cd5125d ("netfilter: nf_tables: split set destruction in deactivate
and destroy phase") introduces conditional destruction on the object to
accomodate transaction semantics.

nft_expr_init() calls expr->ops->init() first, then check for
NFT_STATEFUL_EXPR, this stills allows to initialize a non-stateful
lookup expressions which points to a set, which might lead to UAF since
the set is not properly detached from the set->binding for this case.
Anyway, this combination is non-sense from nf_tables perspective.

This patch fixes this problem by checking for NFT_STATEFUL_EXPR before
expr->ops->init() is called.

The reporter provides a KASAN splat and a poc reproducer (similar to
those autogenerated by syzbot to report use-after-free errors). It is
unknown to me if they are using syzbot or if they use similar automated
tool to locate the bug that they are reporting.

For the record, this is the KASAN splat.

[   85.431824] ==================================================================
[   85.432901] BUG: KASAN: use-after-free in nf_tables_bind_set+0x81b/0xa20
[   85.433825] Write of size 8 at addr ffff8880286f0e98 by task poc/776
[   85.434756]
[   85.434999] CPU: 1 PID: 776 Comm: poc Tainted: G        W         5.18.0+ #2
[   85.436023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014

Fixes: 0b2d8a7 ("netfilter: nf_tables: add helper functions for expression handling")
Reported-and-tested-by: Aaron Adams <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
akiyks pushed a commit that referenced this issue Jun 2, 2022
…nel/git/at91/linux into arm/late

AT91 SoC #2 for 5.19:

- One Kconfig fix for random build error

* tag 'at91-soc-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/at91/linux:
  ARM: at91: pm: Fix rand build error

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
akiyks pushed a commit that referenced this issue Jun 2, 2022
…el/git/at91/linux into arm/late

AT91 DT #2 for 5.19:

- at91: more DT compliance updates for RTC and RTT nodes
- at91: sama7g5: add microphone support

* tag 'at91-dt-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/at91/linux:
  ARM: dts: at91: sama7g5ek: add node for PDMC0
  ARM: dts: at91: sama7g5: add nodes for PDMC
  ARM: dts: at91: Use the generic "rtc" node name for the rtt IPs
  ARM: dts: at91: Add the required 'atmel, rtt-rtc-time-reg' property

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
akiyks pushed a commit that referenced this issue Jun 10, 2022
While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled.  This was causing the
following complaint[1] from gcc v12:

  In file included from include/linux/string.h:253,
                   from include/linux/ceph/ceph_debug.h:7,
                   from fs/ceph/inode.c:2:
  In function 'fortify_memset_chk',
      inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
      inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
  include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
    242 |                         __write_overflow_field(p_size_field, size);
        |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode).  The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.

Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).

Most of the changes were done with:

  perl -p -i -e 's/vfs_inode/netfs.inode/'g \
        `git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`

Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.

Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].

Version #2:
 - Fix a couple of missed name changes due to a disabled cifs option.
 - Rename nfs_i_context to nfs_inode
 - Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
   structs.

[ This also undoes commit 507160f ("netfs: gcc-12: temporarily
  disable '-Wattribute-warning' for now") that is no longer needed ]

Fixes: bc899ee ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <[email protected]>
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Xiubo Li <[email protected]>
cc: Jonathan Corbet <[email protected]>
cc: Eric Van Hensbergen <[email protected]>
cc: Latchesar Ionkov <[email protected]>
cc: Dominique Martinet <[email protected]>
cc: Christian Schoenebeck <[email protected]>
cc: Marc Dionne <[email protected]>
cc: Ilya Dryomov <[email protected]>
cc: Steve French <[email protected]>
cc: William Kucharski <[email protected]>
cc: "Matthew Wilcox (Oracle)" <[email protected]>
cc: Dave Chinner <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]/ [1]
Link: https://lore.kernel.org/r/[email protected]/ [2]
Link: https://lore.kernel.org/r/[email protected]/ [3]
Link: https://lore.kernel.org/r/[email protected]/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <[email protected]>
akiyks pushed a commit that referenced this issue Jun 23, 2022
Hulk Robot reports incorrect sp->rx_count_cooked value in decode_std_command().
This should be caused by the subtracting from sp->rx_count_cooked before.
It seems that sp->rx_count_cooked value is changed to 0, which bypassed the
previous judgment.

The situation is shown below:

         (Thread 1)			|  (Thread 2)
decode_std_command()		| resync_tnc()
...					|
if (rest == 2)			|
	sp->rx_count_cooked -= 2;	|
else if (rest == 3)			| ...
					| sp->rx_count_cooked = 0;
	sp->rx_count_cooked -= 1;	|
for (i = 0; i < sp->rx_count_cooked; i++) // report error
	checksum += sp->cooked_buf[i];

sp->rx_count_cooked is a shared variable but is not protected by a lock.
The same applies to sp->rx_count. This patch adds a lock to fix the bug.

The fail log is shown below:
=======================================================================
UBSAN: array-index-out-of-bounds in drivers/net/hamradio/6pack.c:925:31
index 400 is out of range for type 'unsigned char [400]'
CPU: 3 PID: 7433 Comm: kworker/u10:1 Not tainted 5.18.0-rc5-00163-g4b97bac0756a #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Workqueue: events_unbound flush_to_ldisc
Call Trace:
 <TASK>
 dump_stack_lvl+0xcd/0x134
 ubsan_epilogue+0xb/0x50
 __ubsan_handle_out_of_bounds.cold+0x62/0x6c
 sixpack_receive_buf+0xfda/0x1330
 tty_ldisc_receive_buf+0x13e/0x180
 tty_port_default_receive_buf+0x6d/0xa0
 flush_to_ldisc+0x213/0x3f0
 process_one_work+0x98f/0x1620
 worker_thread+0x665/0x1080
 kthread+0x2e9/0x3a0
 ret_from_fork+0x1f/0x30
 ...

Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Xu Jia <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
akiyks pushed a commit that referenced this issue Jun 24, 2022
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.19, take #2

- Fix a regression with pKVM when kmemleak is enabled

- Add Oliver Upton as an official KVM/arm64 reviewer
akiyks pushed a commit that referenced this issue Jul 1, 2022
This was missed in c3ed222 ("NFSv4: Fix free of uninitialized
nfs4_label on referral lookup.") and causes a panic when mounting
with '-o trunkdiscovery':

PID: 1604   TASK: ffff93dac3520000  CPU: 3   COMMAND: "mount.nfs"
 #0 [ffffb79140f738f8] machine_kexec at ffffffffaec64bee
 #1 [ffffb79140f73950] __crash_kexec at ffffffffaeda67fd
 #2 [ffffb79140f73a18] crash_kexec at ffffffffaeda76ed
 #3 [ffffb79140f73a30] oops_end at ffffffffaec2658d
 #4 [ffffb79140f73a50] general_protection at ffffffffaf60111e
    [exception RIP: nfs_fattr_init+0x5]
    RIP: ffffffffc0c18265  RSP: ffffb79140f73b08  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff93dac304a800  RCX: 0000000000000000
    RDX: ffffb79140f73bb0  RSI: ffff93dadc8cbb40  RDI: d03ee11cfaf6bd50
    RBP: ffffb79140f73be8   R8: ffffffffc0691560   R9: 0000000000000006
    R10: ffff93db3ffd3df8  R11: 0000000000000000  R12: ffff93dac4040000
    R13: ffff93dac2848e00  R14: ffffb79140f73b60  R15: ffffb79140f73b30
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb79140f73b08] _nfs41_proc_get_locations at ffffffffc0c73d53 [nfsv4]
 torvalds#6 [ffffb79140f73bf0] nfs4_proc_get_locations at ffffffffc0c83e90 [nfsv4]
 torvalds#7 [ffffb79140f73c60] nfs4_discover_trunking at ffffffffc0c83fb7 [nfsv4]
 torvalds#8 [ffffb79140f73cd8] nfs_probe_fsinfo at ffffffffc0c0f95f [nfs]
 torvalds#9 [ffffb79140f73da0] nfs_probe_server at ffffffffc0c1026a [nfs]
    RIP: 00007f6254fce26e  RSP: 00007ffc69496ac8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f6254fce26e
    RDX: 00005600220a82a0  RSI: 00005600220a64d0  RDI: 00005600220a6520
    RBP: 00007ffc69496c50   R8: 00005600220a8710   R9: 003035322e323231
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007ffc69496c50
    R13: 00005600220a8440  R14: 0000000000000010  R15: 0000560020650ef9
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

Fixes: c3ed222 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.")
Signed-off-by: Scott Mayhew <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
akiyks pushed a commit that referenced this issue Jul 12, 2022
…nline extents

When doing a direct IO read or write, we always return -ENOTBLK when we
find a compressed extent (or an inline extent) so that we fallback to
buffered IO. This however is not ideal in case we are in a NOWAIT context
(io_uring for example), because buffered IO can block and we currently
have no support for NOWAIT semantics for buffered IO, so if we need to
fallback to buffered IO we should first signal the caller that we may
need to block by returning -EAGAIN instead.

This behaviour can also result in short reads being returned to user
space, which although it's not incorrect and user space should be able
to deal with partial reads, it's somewhat surprising and even some popular
applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't
deal with short reads properly (or at all).

The short read case happens when we try to read from a range that has a
non-compressed and non-inline extent followed by a compressed extent.
After having read the first extent, when we find the compressed extent we
return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to
treat the request as a short read, returning 0 (success) and waiting for
previously submitted bios to complete (this happens at
fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at
btrfs_file_read_iter(), we call filemap_read() to use buffered IO to
read the remaining data, and pass it the number of bytes we were able to
read with direct IO. Than at filemap_read() if we get a page fault error
when accessing the read buffer, we return a partial read instead of an
-EFAULT error, because the number of bytes previously read is greater
than zero.

So fix this by returning -EAGAIN for NOWAIT direct IO when we find a
compressed or an inline extent.

Reported-by: Dominique MARTINET <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582
Tested-by: Dominique MARTINET <[email protected]>
CC: [email protected] # 5.10+
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
akiyks pushed a commit that referenced this issue Jul 14, 2022
…tion

Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.

Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:

 #1> mkdir -p /sys/fs/cgroup/misc/a/b
 #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
 #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
 #4> PID=$!
 #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
 torvalds#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs

the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.

After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during torvalds#6, the following happens:

 1. cgroup_migrate_add_src() is called on B for the leader.

 2. cgroup_migrate_add_src() is called on A for the other threads.

 3. cgroup_migrate_prepare_dst() is called. It scans the src list.

 4. It notices that B wants to migrate to A, so it tries to A to the dst
    list but realizes that its ->mg_preload_node is already busy.

 5. and then it notices A wants to migrate to A as it's an identity
    migration, it culls it by list_del_init()'ing its ->mg_preload_node and
    putting references accordingly.

 6. The rest of migration takes place with B on the src list but nothing on
    the dst list.

This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.

This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.

This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Mukesh Ojha <[email protected]>
Reported-by: shisiyuan <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de9 ("cgroup: prepare migration path for unified hierarchy")
Cc: [email protected] # v3.16+
akiyks pushed a commit that referenced this issue Jul 15, 2022
…ernel/git/at91/linux into arm/fixes

AT91 fixes for 5.19 #2

It contains 2 DT fixes:
- one for SAMA5D2 to fix the i2s1 assigned-clock-parents property
- one for kswitch-d10 (LAN966 based) enforcing proper settings
  on GPIO pins

* tag 'at91-fixes-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/at91/linux:
  ARM: dts: at91: sama5d2: Fix typo in i2s1 node
  ARM: dts: kswitch-d10: use open drain mode for coma-mode pins

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
akiyks pushed a commit that referenced this issue Aug 23, 2024
Prevent the call trace below from happening, by not allowing IPsec
creation over a slave, if master device doesn't support IPsec.

WARNING: CPU: 44 PID: 16136 at kernel/locking/rwsem.c:240 down_read+0x75/0x94
Modules linked in: esp4_offload esp4 act_mirred act_vlan cls_flower sch_ingress mlx5_vdpa vringh vhost_iotlb vdpa mst_pciconf(OE) nfsv3 nfs_acl nfs lockd grace fscache netfs xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill cuse fuse rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_ipoib iw_cm ib_cm ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul mlx5_ib ghash_clmulni_intel sha1_ssse3 dell_smbios ib_uverbs aesni_intel crypto_simd dcdbas wmi_bmof dell_wmi_descriptor cryptd pcspkr ib_core acpi_ipmi sp5100_tco ccp i2c_piix4 ipmi_si ptdma k10temp ipmi_devintf ipmi_msghandler acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect mlx5_core sysimgblt fb_sys_fops cec
 ahci libahci mlxfw drm pci_hyperv_intf libata tg3 sha256_ssse3 tls megaraid_sas i2c_algo_bit psample wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mst_pci]
CPU: 44 PID: 16136 Comm: kworker/44:3 Kdump: loaded Tainted: GOE 5.15.0-20240509.el8uek.uek7_u3_update_v6.6_ipsec_bf.x86_64 #2
Hardware name: Dell Inc. PowerEdge R7525/074H08, BIOS 2.0.3 01/15/2021
Workqueue: events xfrm_state_gc_task
RIP: 0010:down_read+0x75/0x94
Code: 00 48 8b 45 08 65 48 8b 14 25 80 fc 01 00 83 e0 02 48 09 d0 48 83 c8 01 48 89 45 08 5d 31 c0 89 c2 89 c6 89 c7 e9 cb 88 3b 00 <0f> 0b 48 8b 45 08 a8 01 74 b2 a8 02 75 ae 48 89 c2 48 83 ca 02 f0
RSP: 0018:ffffb26387773da8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffffa08b658af900 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ff886bc5e1366f2f RDI: 0000000000000000
RBP: ffffa08b658af940 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0a9bfb31540
R13: ffffa0a9bfb37900 R14: 0000000000000000 R15: ffffa0a9bfb37905
FS:  0000000000000000(0000) GS:ffffa0a9bfb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a45ed814e8 CR3: 000000109038a000 CR4: 0000000000350ee0
Call Trace:
 <TASK>
 ? show_trace_log_lvl+0x1d6/0x2f9
 ? show_trace_log_lvl+0x1d6/0x2f9
 ? mlx5_devcom_for_each_peer_begin+0x29/0x60 [mlx5_core]
 ? down_read+0x75/0x94
 ? __warn+0x80/0x113
 ? down_read+0x75/0x94
 ? report_bug+0xa4/0x11d
 ? handle_bug+0x35/0x8b
 ? exc_invalid_op+0x14/0x75
 ? asm_exc_invalid_op+0x16/0x1b
 ? down_read+0x75/0x94
 ? down_read+0xe/0x94
 mlx5_devcom_for_each_peer_begin+0x29/0x60 [mlx5_core]
 mlx5_ipsec_fs_roce_tx_destroy+0xb1/0x130 [mlx5_core]
 tx_destroy+0x1b/0xc0 [mlx5_core]
 tx_ft_put+0x53/0xc0 [mlx5_core]
 mlx5e_xfrm_free_state+0x45/0x90 [mlx5_core]
 ___xfrm_state_destroy+0x10f/0x1a2
 xfrm_state_gc_task+0x81/0xa9
 process_one_work+0x1f1/0x3c6
 worker_thread+0x53/0x3e4
 ? process_one_work.cold+0x46/0x3c
 kthread+0x127/0x144
 ? set_kthread_struct+0x60/0x52
 ret_from_fork+0x22/0x2d
 </TASK>
---[ end trace 5ef7896144d398e1 ]---

Fixes: dfbd229 ("net/mlx5: Configure IPsec steering for egress RoCEv2 MPV traffic")
Reviewed-by: Leon Romanovsky <[email protected]>
Signed-off-by: Patrisious Haddad <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
akiyks pushed a commit that referenced this issue Aug 23, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 disable BH when collecting stats via hardware offload to ensure
         concurrent updates from packet path do not result in losing stats.
         From Sebastian Andrzej Siewior.

Patch #2 uses write seqcount to reset counters serialize against reader.
         Also from Sebastian Andrzej Siewior.

Patch #3 ensures vlan header is in place before accessing its fields,
         according to KMSAN splat triggered by syzbot.

* tag 'nf-24-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: flowtable: validate vlan header
  netfilter: nft_counter: Synchronize nft_counter_reset() against reader.
  netfilter: nft_counter: Disable BH in nft_counter_offload_stats().
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
akiyks pushed a commit that referenced this issue Aug 24, 2024
…/kernel/git/kvmarm/kvmarm into for-next/fixes

KVM/arm64 fixes for 6.11, round #2

 - Don't drop references on LPIs that weren't visited by the
   vgic-debug iterator

 - Cure lock ordering issue when unregistering vgic redistributors

 - Fix for misaligned stage-2 mappings when VMs are backed by hugetlb
   pages

 - Treat SGI registers as UNDEFINED if a VM hasn't been configured for
   GICv3

* tag 'kvmarm-fixes-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm:
  KVM: arm64: Make ICC_*SGI*_EL1 undef in the absence of a vGICv3
  KVM: arm64: Ensure canonical IPA is hugepage-aligned when handling fault
  KVM: arm64: vgic: Don't hold config_lock while unregistering redistributors
  KVM: arm64: vgic-debug: Don't put unmarked LPIs
  KVM: arm64: vgic: Hold config_lock while tearing down a CPU interface
  KVM: selftests: arm64: Correct feature test for S1PIE in get-reg-list
  KVM: arm64: Tidying up PAuth code in KVM
  KVM: arm64: vgic-debug: Exit the iterator properly w/o LPI
  KVM: arm64: Enforce dependency on an ARMv8.4-aware toolchain
  docs: KVM: Fix register ID of SPSR_FIQ
  KVM: arm64: vgic: fix unexpected unlock sparse warnings
  KVM: arm64: fix kdoc warnings in W=1 builds
  KVM: arm64: fix override-init warnings in W=1 builds
  KVM: arm64: free kvm->arch.nested_mmus with kvfree()
akiyks pushed a commit that referenced this issue Aug 30, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 sets on NFT_PKTINFO_L4PROTO for UDP packets less than 4 bytes
payload from netdev/egress by subtracting skb_network_offset() when
validating IPv4 packet length, otherwise 'meta l4proto udp' never
matches.

Patch #2 subtracts skb_network_offset() when validating IPv6 packet
length for netdev/egress.

netfilter pull request 24-08-28

* tag 'nf-24-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables_ipv6: consider network offset in netdev/egress validation
  netfilter: nf_tables: restore IP sanity checks for netdev/egress
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
akiyks pushed a commit that referenced this issue Sep 6, 2024
Currently, netif_queue_set_napi() is called from ice_vsi_rebuild() that is
not rtnl-locked when called from the reset. This creates the need to take
the rtnl_lock just for a single function and complicates the
synchronization with .ndo_bpf. At the same time, there no actual need to
fill napi-to-queue information at this exact point.

Fill napi-to-queue information when opening the VSI and clear it when the
VSI is being closed. Those routines are already rtnl-locked.

Also, rewrite napi-to-queue assignment in a way that prevents inclusion of
XDP queues, as this leads to out-of-bounds writes, such as one below.

[  +0.000004] BUG: KASAN: slab-out-of-bounds in netif_queue_set_napi+0x1c2/0x1e0
[  +0.000012] Write of size 8 at addr ffff889881727c80 by task bash/7047
[  +0.000006] CPU: 24 PID: 7047 Comm: bash Not tainted 6.10.0-rc2+ #2
[  +0.000004] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[  +0.000003] Call Trace:
[  +0.000003]  <TASK>
[  +0.000002]  dump_stack_lvl+0x60/0x80
[  +0.000007]  print_report+0xce/0x630
[  +0.000007]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[  +0.000007]  ? __virt_addr_valid+0x1c9/0x2c0
[  +0.000005]  ? netif_queue_set_napi+0x1c2/0x1e0
[  +0.000003]  kasan_report+0xe9/0x120
[  +0.000004]  ? netif_queue_set_napi+0x1c2/0x1e0
[  +0.000004]  netif_queue_set_napi+0x1c2/0x1e0
[  +0.000005]  ice_vsi_close+0x161/0x670 [ice]
[  +0.000114]  ice_dis_vsi+0x22f/0x270 [ice]
[  +0.000095]  ice_pf_dis_all_vsi.constprop.0+0xae/0x1c0 [ice]
[  +0.000086]  ice_prepare_for_reset+0x299/0x750 [ice]
[  +0.000087]  pci_dev_save_and_disable+0x82/0xd0
[  +0.000006]  pci_reset_function+0x12d/0x230
[  +0.000004]  reset_store+0xa0/0x100
[  +0.000006]  ? __pfx_reset_store+0x10/0x10
[  +0.000002]  ? __pfx_mutex_lock+0x10/0x10
[  +0.000004]  ? __check_object_size+0x4c1/0x640
[  +0.000007]  kernfs_fop_write_iter+0x30b/0x4a0
[  +0.000006]  vfs_write+0x5d6/0xdf0
[  +0.000005]  ? fd_install+0x180/0x350
[  +0.000005]  ? __pfx_vfs_write+0x10/0xA10
[  +0.000004]  ? do_fcntl+0x52c/0xcd0
[  +0.000004]  ? kasan_save_track+0x13/0x60
[  +0.000003]  ? kasan_save_free_info+0x37/0x60
[  +0.000006]  ksys_write+0xfa/0x1d0
[  +0.000003]  ? __pfx_ksys_write+0x10/0x10
[  +0.000002]  ? __x64_sys_fcntl+0x121/0x180
[  +0.000004]  ? _raw_spin_lock+0x87/0xe0
[  +0.000005]  do_syscall_64+0x80/0x170
[  +0.000007]  ? _raw_spin_lock+0x87/0xe0
[  +0.000004]  ? __pfx__raw_spin_lock+0x10/0x10
[  +0.000003]  ? file_close_fd_locked+0x167/0x230
[  +0.000005]  ? syscall_exit_to_user_mode+0x7d/0x220
[  +0.000005]  ? do_syscall_64+0x8c/0x170
[  +0.000004]  ? do_syscall_64+0x8c/0x170
[  +0.000003]  ? do_syscall_64+0x8c/0x170
[  +0.000003]  ? fput+0x1a/0x2c0
[  +0.000004]  ? filp_close+0x19/0x30
[  +0.000004]  ? do_dup2+0x25a/0x4c0
[  +0.000004]  ? __x64_sys_dup2+0x6e/0x2e0
[  +0.000002]  ? syscall_exit_to_user_mode+0x7d/0x220
[  +0.000004]  ? do_syscall_64+0x8c/0x170
[  +0.000003]  ? __count_memcg_events+0x113/0x380
[  +0.000005]  ? handle_mm_fault+0x136/0x820
[  +0.000005]  ? do_user_addr_fault+0x444/0xa80
[  +0.000004]  ? clear_bhb_loop+0x25/0x80
[  +0.000004]  ? clear_bhb_loop+0x25/0x80
[  +0.000002]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  +0.000005] RIP: 0033:0x7f2033593154

Fixes: 080b0c8 ("ice: Fix ASSERT_RTNL() warning during certain scenarios")
Fixes: 91fdbce ("ice: Add support in the driver for associating queue with napi")
Reviewed-by: Wojciech Drewek <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Reviewed-by: Amritha Nambiar <[email protected]>
Signed-off-by: Larysa Zaremba <[email protected]>
Reviewed-by: Maciej Fijalkowski <[email protected]>
Tested-by: George Kuruvinakunnel <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
akiyks pushed a commit that referenced this issue Sep 6, 2024
Chi Zhiling reported:

  We found a null pointer accessing in tracefs[1], the reason is that the
  variable 'ei_child' is set to LIST_POISON1, that means the list was
  removed in eventfs_remove_rec. so when access the ei_child->is_freed, the
  panic triggered.

  by the way, the following script can reproduce this panic

  loop1 (){
      while true
      do
          echo "p:kp submit_bio" > /sys/kernel/debug/tracing/kprobe_events
          echo "" > /sys/kernel/debug/tracing/kprobe_events
      done
  }
  loop2 (){
      while true
      do
          tree /sys/kernel/debug/tracing/events/kprobes/
      done
  }
  loop1 &
  loop2

  [1]:
  [ 1147.959632][T17331] Unable to handle kernel paging request at virtual address dead000000000150
  [ 1147.968239][T17331] Mem abort info:
  [ 1147.971739][T17331]   ESR = 0x0000000096000004
  [ 1147.976172][T17331]   EC = 0x25: DABT (current EL), IL = 32 bits
  [ 1147.982171][T17331]   SET = 0, FnV = 0
  [ 1147.985906][T17331]   EA = 0, S1PTW = 0
  [ 1147.989734][T17331]   FSC = 0x04: level 0 translation fault
  [ 1147.995292][T17331] Data abort info:
  [ 1147.998858][T17331]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
  [ 1148.005023][T17331]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
  [ 1148.010759][T17331]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
  [ 1148.016752][T17331] [dead000000000150] address between user and kernel address ranges
  [ 1148.024571][T17331] Internal error: Oops: 0000000096000004 [#1] SMP
  [ 1148.030825][T17331] Modules linked in: team_mode_loadbalance team nlmon act_gact cls_flower sch_ingress bonding tls macvlan dummy ib_core bridge stp llc veth amdgpu amdxcp mfd_core gpu_sched drm_exec drm_buddy radeon crct10dif_ce video drm_suballoc_helper ghash_ce drm_ttm_helper sha2_ce ttm sha256_arm64 i2c_algo_bit sha1_ce sbsa_gwdt cp210x drm_display_helper cec sr_mod cdrom drm_kms_helper binfmt_misc sg loop fuse drm dm_mod nfnetlink ip_tables autofs4 [last unloaded: tls]
  [ 1148.072808][T17331] CPU: 3 PID: 17331 Comm: ls Tainted: G        W         ------- ----  6.6.43 #2
  [ 1148.081751][T17331] Source Version: 21b3b386e948bedd29369af66f3e98ab01b1c650
  [ 1148.088783][T17331] Hardware name: Greatwall GW-001M1A-FTF/GW-001M1A-FTF, BIOS KunLun BIOS V4.0 07/16/2020
  [ 1148.098419][T17331] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  [ 1148.106060][T17331] pc : eventfs_iterate+0x2c0/0x398
  [ 1148.111017][T17331] lr : eventfs_iterate+0x2fc/0x398
  [ 1148.115969][T17331] sp : ffff80008d56bbd0
  [ 1148.119964][T17331] x29: ffff80008d56bbf0 x28: ffff001ff5be2600 x27: 0000000000000000
  [ 1148.127781][T17331] x26: ffff001ff52ca4e0 x25: 0000000000009977 x24: dead000000000100
  [ 1148.135598][T17331] x23: 0000000000000000 x22: 000000000000000b x21: ffff800082645f10
  [ 1148.143415][T17331] x20: ffff001fddf87c70 x19: ffff80008d56bc90 x18: 0000000000000000
  [ 1148.151231][T17331] x17: 0000000000000000 x16: 0000000000000000 x15: ffff001ff52ca4e0
  [ 1148.159048][T17331] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
  [ 1148.166864][T17331] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff8000804391d0
  [ 1148.174680][T17331] x8 : 0000000180000000 x7 : 0000000000000018 x6 : 0000aaab04b92862
  [ 1148.182498][T17331] x5 : 0000aaab04b92862 x4 : 0000000080000000 x3 : 0000000000000068
  [ 1148.190314][T17331] x2 : 000000000000000f x1 : 0000000000007ea8 x0 : 0000000000000001
  [ 1148.198131][T17331] Call trace:
  [ 1148.201259][T17331]  eventfs_iterate+0x2c0/0x398
  [ 1148.205864][T17331]  iterate_dir+0x98/0x188
  [ 1148.210036][T17331]  __arm64_sys_getdents64+0x78/0x160
  [ 1148.215161][T17331]  invoke_syscall+0x78/0x108
  [ 1148.219593][T17331]  el0_svc_common.constprop.0+0x48/0xf0
  [ 1148.224977][T17331]  do_el0_svc+0x24/0x38
  [ 1148.228974][T17331]  el0_svc+0x40/0x168
  [ 1148.232798][T17331]  el0t_64_sync_handler+0x120/0x130
  [ 1148.237836][T17331]  el0t_64_sync+0x1a4/0x1a8
  [ 1148.242182][T17331] Code: 54ffff6c f9400676 910006d6 f900067 (b9405300)
  [ 1148.248955][T17331] ---[ end trace 0000000000000000 ]---

The issue is that list_del() is used on an SRCU protected list variable
before the synchronization occurs. This can poison the list pointers while
there is a reader iterating the list.

This is simply fixed by using list_del_rcu() that is specifically made for
this purpose.

Link: https://lore.kernel.org/linux-trace-kernel/[email protected]/

Cc: [email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Link: https://lore.kernel.org/[email protected]
Fixes: 43aa6f9 ("eventfs: Get rid of dentry pointers without refcounts")
Reported-by: Chi Zhiling <[email protected]>
Tested-by: Chi Zhiling <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
akiyks pushed a commit that referenced this issue Sep 6, 2024
…nux/kernel/git/ukleinek/linux

Pull pwm fix from Uwe Kleine-König:
 "Fix an off-by-one in the stm32 driver.

  Hardware engineers tend to start counting at 1 while the software guys
  usually start with 0. This isn't so nice because that results in
  drivers where pwm device #2 needs to use the hardware registers with
  index 3.

  This was noticed by Fabrice Gasnier.

  A small patch fixing that mismatch is the only change included here"

* tag 'pwm/for-6.11-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux:
  pwm: stm32: Use the right CCxNP bit in stm32_pwm_enable()
akiyks pushed a commit that referenced this issue Sep 8, 2024
If smb2_compound_op() is called with a valid @CFILE and returned
-EINVAL, we need to call cifs_get_writable_path() before retrying it
as the reference of @CFILE was already dropped by previous call.

This fixes the following KASAN splat when running fstests generic/013
against Windows Server 2022:

  CIFS: Attempting to mount //w22-fs0/scratch
  run fstests generic/013 at 2024-09-02 19:48:59
  ==================================================================
  BUG: KASAN: slab-use-after-free in detach_if_pending+0xab/0x200
  Write of size 8 at addr ffff88811f1a3730 by task kworker/3:2/176

  CPU: 3 UID: 0 PID: 176 Comm: kworker/3:2 Not tainted 6.11.0-rc6 #2
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40
  04/01/2014
  Workqueue: cifsoplockd cifs_oplock_break [cifs]
  Call Trace:
   <TASK>
   dump_stack_lvl+0x5d/0x80
   ? detach_if_pending+0xab/0x200
   print_report+0x156/0x4d9
   ? detach_if_pending+0xab/0x200
   ? __virt_addr_valid+0x145/0x300
   ? __phys_addr+0x46/0x90
   ? detach_if_pending+0xab/0x200
   kasan_report+0xda/0x110
   ? detach_if_pending+0xab/0x200
   detach_if_pending+0xab/0x200
   timer_delete+0x96/0xe0
   ? __pfx_timer_delete+0x10/0x10
   ? rcu_is_watching+0x20/0x50
   try_to_grab_pending+0x46/0x3b0
   __cancel_work+0x89/0x1b0
   ? __pfx___cancel_work+0x10/0x10
   ? kasan_save_track+0x14/0x30
   cifs_close_deferred_file+0x110/0x2c0 [cifs]
   ? __pfx_cifs_close_deferred_file+0x10/0x10 [cifs]
   ? __pfx_down_read+0x10/0x10
   cifs_oplock_break+0x4c1/0xa50 [cifs]
   ? __pfx_cifs_oplock_break+0x10/0x10 [cifs]
   ? lock_is_held_type+0x85/0xf0
   ? mark_held_locks+0x1a/0x90
   process_one_work+0x4c6/0x9f0
   ? find_held_lock+0x8a/0xa0
   ? __pfx_process_one_work+0x10/0x10
   ? lock_acquired+0x220/0x550
   ? __list_add_valid_or_report+0x37/0x100
   worker_thread+0x2e4/0x570
   ? __kthread_parkme+0xd1/0xf0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0x17f/0x1c0
   ? kthread+0xda/0x1c0
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x31/0x60
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>

  Allocated by task 1118:
   kasan_save_stack+0x30/0x50
   kasan_save_track+0x14/0x30
   __kasan_kmalloc+0xaa/0xb0
   cifs_new_fileinfo+0xc8/0x9d0 [cifs]
   cifs_atomic_open+0x467/0x770 [cifs]
   lookup_open.isra.0+0x665/0x8b0
   path_openat+0x4c3/0x1380
   do_filp_open+0x167/0x270
   do_sys_openat2+0x129/0x160
   __x64_sys_creat+0xad/0xe0
   do_syscall_64+0xbb/0x1d0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  Freed by task 83:
   kasan_save_stack+0x30/0x50
   kasan_save_track+0x14/0x30
   kasan_save_free_info+0x3b/0x70
   poison_slab_object+0xe9/0x160
   __kasan_slab_free+0x32/0x50
   kfree+0xf2/0x300
   process_one_work+0x4c6/0x9f0
   worker_thread+0x2e4/0x570
   kthread+0x17f/0x1c0
   ret_from_fork+0x31/0x60
   ret_from_fork_asm+0x1a/0x30

  Last potentially related work creation:
   kasan_save_stack+0x30/0x50
   __kasan_record_aux_stack+0xad/0xc0
   insert_work+0x29/0xe0
   __queue_work+0x5ea/0x760
   queue_work_on+0x6d/0x90
   _cifsFileInfo_put+0x3f6/0x770 [cifs]
   smb2_compound_op+0x911/0x3940 [cifs]
   smb2_set_path_size+0x228/0x270 [cifs]
   cifs_set_file_size+0x197/0x460 [cifs]
   cifs_setattr+0xd9c/0x14b0 [cifs]
   notify_change+0x4e3/0x740
   do_truncate+0xfa/0x180
   vfs_truncate+0x195/0x200
   __x64_sys_truncate+0x109/0x150
   do_syscall_64+0xbb/0x1d0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 71f15c9 ("smb: client: retry compound request without reusing lease")
Cc: [email protected]
Signed-off-by: Paulo Alcantara (Red Hat) <[email protected]>
Cc: David Howells <[email protected]>
Signed-off-by: Steve French <[email protected]>
akiyks pushed a commit that referenced this issue Sep 13, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following batch contains two fixes from Florian Westphal:

Patch #1 fixes a sk refcount leak in nft_socket on mismatch.

Patch #2 fixes cgroupsv2 matching from containers due to incorrect
	 level in subtree.

netfilter pull request 24-09-12

* tag 'nf-24-09-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_socket: make cgroupsv2 matching work with namespaces
  netfilter: nft_socket: fix sk refcount leaks
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
akiyks pushed a commit that referenced this issue Sep 16, 2024
Daniel Machon says:

====================
net: microchip: add FDMA library and use it for Sparx5

This patch series is the first of a 2-part series, that adds a new
common FDMA library for Microchip switch chips Sparx5 and lan966x. These
chips share the same FDMA engine, and as such will benefit from a
common library with a common implementation.  This also has the benefit
of removing a lot open-coded bookkeeping and duplicate code for the two
drivers.

Additionally, upstreaming efforts for a third chip, lan969x, will begin
in the near future. This chip will use the new library too.

In this first series, the FDMA library is introduced and used by the
Sparx5 switch driver.

 ###################
 # Example of use: #
 ###################

- Initialize the rx and tx fdma structs with values for: number of
  DCB's, number of DB's, channel ID, DB size (data buffer size), and
  total size of the requested memory. Also provide two callbacks:
  nextptr_cb() and dataptr_cb() for getting the nextptr and dataptr.

- Allocate memory using fdma_alloc_phys() or fdma_alloc_coherent().

- Initialize the DCB's with fdma_dcb_init().

- Add new DCB's with fdma_dcb_add().

- Free memory with fdma_free_phys() or fdma_free_coherent().

 #####################
 # Patch  breakdown: #
 #####################

Patch #1:  introduces library and selects it for Sparx5.

Patch #2:  includes the fdma_api.h header and removes old symbols.

Patch #3:  replaces old rx and tx variables with equivalent ones from the
           fdma struct. Only the variables that can be changed without
           breaking traffic is changed in this patch.

Patch #4:  uses the library for allocation of rx buffers. This requires
           quite a bit of refactoring in this single patch.

Patch #5:  uses the library for adding DCB's in the rx path.

Patch torvalds#6:  uses the library for freeing rx buffers.

Patch torvalds#7:  uses the library helpers in the rx path.

Patch torvalds#8:  uses the library for allocation of tx buffers. This requires
           quite a bit of refactoring in this single patch.

Patch torvalds#9:  uses the library for adding DCB's in the tx path.

Patch torvalds#10: uses the library helpers in the tx path.

Patch torvalds#11: ditches the existing linked list for storing buffer addresses,
           and instead uses offsets into contiguous memory.

Patch torvalds#12: modifies existing rx and tx functions to be direction
           independent.
====================

Signed-off-by: David S. Miller <[email protected]>
akiyks pushed a commit that referenced this issue Sep 16, 2024
…rnel/git/netfilter/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

Patch #1 adds ctnetlink support for kernel side filtering for
	 deletions, from Changliang Wu.

Patch #2 updates nft_counter support to Use u64_stats_t,
	 from Sebastian Andrzej Siewior.

Patch #3 uses kmemdup_array() in all xtables frontends,
	 from Yan Zhen.

Patch #4 is a oneliner to use ERR_CAST() in nf_conntrack instead
	 opencoded casting, from Shen Lichuan.

Patch #5 removes unused argument in nftables .validate interface,
	 from Florian Westphal.

Patch torvalds#6 is a oneliner to correct a typo in nftables kdoc,
	 from Simon Horman.

Patch torvalds#7 fixes missing kdoc in nftables, also from Simon.

Patch torvalds#8 updates nftables to handle timeout less than CONFIG_HZ.

Patch torvalds#9 rejects element expiration if timeout is zero,
	 otherwise it is silently ignored.

Patch torvalds#10 disallows element expiration larger than timeout.

Patch torvalds#11 removes unnecessary READ_ONCE annotation while mutex is held.

Patch torvalds#12 adds missing READ_ONCE/WRITE_ONCE annotation in dynset.

Patch torvalds#13 annotates data-races around element expiration.

Patch torvalds#14 allocates timeout and expiration in one single set element
	  extension, they are tighly couple, no reason to keep them
	  separated anymore.

Patch torvalds#15 updates nftables to interpret zero timeout element as never
	  times out. Note that it is already possible to declare sets
	  with elements that never time out but this generalizes to all
	  kind of set with timeouts.

Patch torvalds#16 supports for element timeout and expiration updates.

* tag 'nf-next-24-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_tables: set element timeout update support
  netfilter: nf_tables: zero timeout means element never times out
  netfilter: nf_tables: consolidate timeout extension for elements
  netfilter: nf_tables: annotate data-races around element expiration
  netfilter: nft_dynset: annotate data-races around set timeout
  netfilter: nf_tables: remove annotation to access set timeout while holding lock
  netfilter: nf_tables: reject expiration higher than timeout
  netfilter: nf_tables: reject element expiration with no timeout
  netfilter: nf_tables: elements with timeout below CONFIG_HZ never expire
  netfilter: nf_tables: Add missing Kernel doc
  netfilter: nf_tables: Correct spelling in nf_tables.h
  netfilter: nf_tables: drop unused 3rd argument from validate callback ops
  netfilter: conntrack: Convert to use ERR_CAST()
  netfilter: Use kmemdup_array instead of kmemdup for multiple allocation
  netfilter: nft_counter: Use u64_stats_t for statistic.
  netfilter: ctnetlink: support CTA_FILTER for flush
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
akiyks pushed a commit that referenced this issue Sep 16, 2024
Daniel Machon says:

====================
net: lan966x: use the newly introduced FDMA library

This patch series is the second of a 2-part series [1], that adds a new
common FDMA library for Microchip switch chips Sparx5 and lan966x. These
chips share the same FDMA engine, and as such will benefit from a common
library with a common implementation.  This also has the benefit of
removing a lot of open-coded bookkeeping and duplicate code for the two
drivers.

In this second series, the FDMA library will be taken into use by the
lan966x switch driver.

 ###################
 # Example of use: #
 ###################

- Initialize the rx and tx fdma structs with values for: number of
  DCB's, number of DB's, channel ID, DB size (data buffer size), and
  total size of the requested memory. Also provide two callbacks:
  nextptr_cb() and dataptr_cb() for getting the nextptr and dataptr.

- Allocate memory using fdma_alloc_phys() or fdma_alloc_coherent().

- Initialize the DCB's with fdma_dcb_init().

- Add new DCB's with fdma_dcb_add().

- Free memory with fdma_free_phys() or fdma_free_coherent().

 #####################
 # Patch  breakdown: #
 #####################

Patch #1:  select FDMA library for lan966x.

Patch #2:  includes the fdma_api.h header and removes old symbols.

Patch #3:  replaces old rx and tx variables with equivalent ones from the
           fdma struct. Only the variables that can be changed without
           breaking traffic is changed in this patch.

Patch #4:  uses the library for allocation of rx buffers. This requires
           quite a bit of refactoring in this single patch.

Patch #5:  uses the library for adding DCB's in the rx path.

Patch torvalds#6:  uses the library for freeing rx buffers.

Patch torvalds#7:  uses the library for allocation of tx buffers. This requires
           quite a bit of refactoring in this single patch.

Patch torvalds#8:  uses the library for adding DCB's in the tx path.

Patch torvalds#9:  uses the library helpers in the tx path.

Patch torvalds#10: ditch last_in_use variable and use library instead.

Patch torvalds#11: uses library helpers throughout.

Patch torvalds#12: refactor lan966x_fdma_reload() function.

[1] https://lore.kernel.org/netdev/[email protected]/

Signed-off-by: Daniel Machon <[email protected]>
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
akiyks pushed a commit that referenced this issue Sep 16, 2024
Mina Almasry says:

====================
Device Memory TCP

Device memory TCP (devmem TCP) is a proposal for transferring data
to and/or from device memory efficiently, without bouncing the data
to a host memory buffer.

* Problem:

A large amount of data transfers have device memory as the source
and/or destination. Accelerators drastically increased the volume
of such transfers. Some examples include:

- ML accelerators transferring large amounts of training data from storage
  into GPU/TPU memory. In some cases ML training setup time can be as long
  as 50% of TPU compute time, improving data transfer throughput &
  efficiency can help improving GPU/TPU utilization.

- Distributed training, where ML accelerators, such as GPUs on different
  hosts, exchange data among them.

- Distributed raw block storage applications transfer large amounts of
  data with remote SSDs, much of this data does not require host
  processing.

Today, the majority of the Device-to-Device data transfers the network
are implemented as the following low level operations: Device-to-Host
copy, Host-to-Host network transfer, and Host-to-Device copy.

The implementation is suboptimal, especially for bulk data transfers,
and can put significant strains on system resources, such as host memory
bandwidth, PCIe bandwidth, etc. One important reason behind the current
state is the kernel’s lack of semantics to express device to network
transfers.

* Proposal:

In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:

1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.

Packet _payloads_ go directly from the NIC to device memory for receive
and from device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP
stack normally. The NIC _must_ support header split to achieve this.

Advantages:

- Alleviate host memory bandwidth pressure, compared to existing
 network-transfer + device-copy semantics.

- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
  of the PCIe tree, compared to traditional path which sends data through
  the root complex.

* Patch overview:

** Part 1: netlink API

Gives user ability to bind dma-buf to an RX queue.

** Part 2: scatterlist support

Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers,
and page pool) operate on pages. We have 2 options:

1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to process scatterlist.

Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.

** part 3: page pool support

We piggy back on page pool memory providers proposal:
https://github.com/kuba-moo/linux/tree/pp-providers

It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory
TCP changes from the driver.

** part 4: support for unreadable skb frags

Page pool iovs are not accessible by the host; we implement changes
throughput the networking stack to correctly handle skbs with unreadable
frags.

** Part 5: recvmsg() APIs

We define user APIs for the user to send and receive device memory.

Not included with this series is the GVE devmem TCP support, just to
simplify the review. Code available here if desired:
https://github.com/mina/linux/tree/tcpdevmem

This series is built on top of net-next with Jakub's pp-providers changes
cherry-picked.

* NIC dependencies:

1. (strict) Devmem TCP require the NIC to support header split, i.e. the
   capability to split incoming packets into a header + payload and to put
   each into a separate buffer. Devmem TCP works by using device memory
   for the packet payload, and host memory for the packet headers.

2. (optional) Devmem TCP works better with flow steering support & RSS
   support, i.e. the NIC's ability to steer flows into certain rx queues.
   This allows the sysadmin to enable devmem TCP on a subset of the rx
   queues, and steer devmem TCP traffic onto these queues and non devmem
   TCP elsewhere.

The NIC I have access to with these properties is the GVE with DQO support
running in Google Cloud, but any NIC that supports these features would
suffice. I may be able to help reviewers bring up devmem TCP on their NICs.

* Testing:

The series includes a udmabuf kselftest that show a simple use case of
devmem TCP and validates the entire data path end to end without
a dependency on a specific dmabuf provider.

** Test Setup

Kernel: net-next with this series and memory provider API cherry-picked
locally.

Hardware: Google Cloud A3 VMs.

NIC: GVE with header split & RSS & flow steering support.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
akiyks pushed a commit that referenced this issue Sep 16, 2024
Ido Schimmel says:

====================
net: fib_rules: Add DSCP selector support

Currently, the kernel rejects IPv4 FIB rules that try to match on the
upper three DSCP bits:

 # ip -4 rule add tos 0x1c table 100
 # ip -4 rule add tos 0x3c table 100
 Error: Invalid tos.

The reason for that is that historically users of the FIB lookup API
only populated the lower three DSCP bits in the TOS field of the IPv4
flow key ('flowi4_tos'), which fits the TOS definition from the initial
IPv4 specification (RFC 791).

This is not very useful nowadays and instead some users want to be able
to match on the six bits DSCP field, which replaced the TOS and IP
precedence fields over 25 years ago (RFC 2474). In addition, the current
behavior differs between IPv4 and IPv6 which does allow users to match
on the entire DSCP field using the TOS selector.

Recent patchsets made sure that callers of the FIB lookup API now
populate the entire DSCP field in the IPv4 flow key. Therefore, it is
now possible to extend FIB rules to match on DSCP.

This is done by adding a new DSCP attribute which is implemented for
both IPv4 and IPv6 to provide user space programs a consistent behavior
between both address families.

The behavior of the old TOS selector is unchanged and IPv4 FIB rules
using it will only match on the lower three DSCP bits. The kernel will
reject rules that try to use both selectors.

Patch #1 adds the new DSCP attribute but rejects its usage.

Patches #2-#3 implement IPv4 and IPv6 support.

Patch #4 allows user space to use the new attribute.

Patches #5-torvalds#6 add selftests.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
akiyks pushed a commit that referenced this issue Sep 16, 2024
Nelson Escobar says:

====================
enic: Report per queue stats

Patch #1: Use a macro instead of static const variables for array sizes.  I
          didn't want to add more static const variables in the next patch
          so clean up the existing ones first.

Patch #2: Collect per queue statistics

Patch #3: Report per queue stats in netdev qstats

Patch #4: Report some per queue stats in ethtool

 # NETIF="eno6" tools/testing/selftests/drivers/net/stats.py
KTAP version 1
1..5
ok 1 stats.check_pause # XFAIL pause not supported by the device
ok 2 stats.check_fec # XFAIL FEC not supported by the device
ok 3 stats.pkt_byte_sum
ok 4 stats.qstat_by_ifindex
ok 5 stats.check_down

 # tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
     --dump qstats-get --json '{"ifindex": "34"}'
[{'ifindex': 34,
  'rx-bytes': 66762680,
  'rx-csum-unnecessary': 1009345,
  'rx-hw-drop-overruns': 0,
  'rx-hw-drops': 0,
  'rx-packets': 1009673,
  'tx-bytes': 137936674899,
  'tx-csum-none': 125,
  'tx-hw-gso-packets': 2408712,
  'tx-needs-csum': 2431531,
  'tx-packets': 15475466,
  'tx-stop': 0,
  'tx-wake': 0}]

v2: https://lore.kernel.org/[email protected]
v1: https://lore.kernel.org/[email protected]
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
akiyks pushed a commit that referenced this issue Sep 17, 2024
Once the kthread is running and available
(i.e. @printk_kthreads_running is set), the kthread becomes
responsible for flushing any pending messages which are added
in NBCON_PRIO_NORMAL context. Namely the legacy
console_flush_all() and device_release() no longer flush the
console. And nbcon_atomic_flush_pending() used by
nbcon_cpu_emergency_exit() no longer flushes messages added
after the emergency messages.

The console context is safe when used by the kthread only when
one of the following conditions are true:

  1. Other caller acquires the console context with
     NBCON_PRIO_NORMAL with preemption disabled. It will
     release the context before rescheduling.

  2. Other caller acquires the console context with
     NBCON_PRIO_NORMAL under the device_lock.

  3. The kthread is the only context which acquires the console
     with NBCON_PRIO_NORMAL.

This is satisfied for all atomic printing call sites:

nbcon_legacy_emit_next_record() (#1)

nbcon_atomic_flush_pending_con() (#1)

nbcon_device_release() (#2)

It is even double guaranteed when @printk_kthreads_running
is set because then _only_ the kthread will print for
NBCON_PRIO_NORMAL. (#3)

Signed-off-by: John Ogness <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Petr Mladek <[email protected]>
akiyks pushed a commit that referenced this issue Sep 18, 2024
I have done a lot of analysis for these type of devices and collaborated
quite a bit with Nick Weihs (author of the first patch submitted for this
including adding samsung_helper.c). More information can be found in the
issue on Github [1] including additional rationale and testing.

The existing implementation includes a large number of equalizer coef
values that are not necessary to actually init and enable the speaker
amps, as well as create a somewhat worse sound profile. Users have
reported "muffled" or "muddy" sound; more information about this including
my analysis of the differences can be found in the linked Github issue.

This patch refactors the "v2" version of ALC298_FIXUP_SAMSUNG_AMP to a much
simpler implementation which removes the new samsung_helper.c, reuses more
of the existing patch_realtek.c, and sends significantly fewer unnecessary
coef values (including removing all of these EQ-specific coef values).

A pcm_playback_hook is used to dynamically enable and disable the speaker
amps only when there will be audio playback; this is to match the behavior
of how the driver for these devices is working in Windows, and is
suspected but not yet tested or confirmed to help with power consumption.

Support for models with 2 speaker amps vs 4 speaker amps is controlled by
a specific quirk name for both types. A new int num_speaker_amps has been
added to alc_spec so that the hooks can know how many speaker amps to
enable or disable. This design was chosen to limit the number of places
that subsystem ids will need to be maintained: like this, they can be
maintained only once in the quirk table and there will not be another
separate list of subsystem ids to maintain elsewhere in the code.

Also updated the quirk name from ALC298_FIXUP_SAMSUNG_AMP2 to
ALC298_FIXUP_SAMSUNG_AMP_V2_.. as this is not a quirk for "Amp #2" on
ALC298 but is instead a different version of how to handle it.

More devices have been added (see Github issue for testing confirmation),
as well as a small cleanup to existing names.

[1]: thesofproject#4055 (comment)

Signed-off-by: Joshua Grisham <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
akiyks pushed a commit that referenced this issue Sep 21, 2024
Add nested locking with I_MUTEX_XATTR subclass to avoid lockdep warning
while handling xattr inode on file open syscall at ext4_xattr_inode_iget.

Backtrace
EXT4-fs (loop0): Ignoring removed oldalloc option
======================================================
WARNING: possible circular locking dependency detected
5.10.0-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor543/2794 is trying to acquire lock:
ffff8880215e1a48 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}, at: inode_lock include/linux/fs.h:782 [inline]
ffff8880215e1a48 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}, at: ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425

but task is already holding lock:
ffff8880215e3278 (&ei->i_data_sem/3){++++}-{3:3}, at: ext4_setattr+0x136d/0x19c0 fs/ext4/inode.c:5559

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&ei->i_data_sem/3){++++}-{3:3}:
       lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566
       down_write+0x93/0x180 kernel/locking/rwsem.c:1564
       ext4_update_i_disksize fs/ext4/ext4.h:3267 [inline]
       ext4_xattr_inode_write fs/ext4/xattr.c:1390 [inline]
       ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1538 [inline]
       ext4_xattr_set_entry+0x331a/0x3d80 fs/ext4/xattr.c:1662
       ext4_xattr_ibody_set+0x124/0x390 fs/ext4/xattr.c:2228
       ext4_xattr_set_handle+0xc27/0x14e0 fs/ext4/xattr.c:2385
       ext4_xattr_set+0x219/0x390 fs/ext4/xattr.c:2498
       ext4_xattr_user_set+0xc9/0xf0 fs/ext4/xattr_user.c:40
       __vfs_setxattr+0x404/0x450 fs/xattr.c:177
       __vfs_setxattr_noperm+0x11d/0x4f0 fs/xattr.c:208
       __vfs_setxattr_locked+0x1f9/0x210 fs/xattr.c:266
       vfs_setxattr+0x112/0x2c0 fs/xattr.c:283
       setxattr+0x1db/0x3e0 fs/xattr.c:548
       path_setxattr+0x15a/0x240 fs/xattr.c:567
       __do_sys_setxattr fs/xattr.c:582 [inline]
       __se_sys_setxattr fs/xattr.c:578 [inline]
       __x64_sys_setxattr+0xc5/0xe0 fs/xattr.c:578
       do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62
       entry_SYSCALL_64_after_hwframe+0x61/0xcb

-> #0 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}:
       check_prev_add kernel/locking/lockdep.c:2988 [inline]
       check_prevs_add kernel/locking/lockdep.c:3113 [inline]
       validate_chain+0x1695/0x58f0 kernel/locking/lockdep.c:3729
       __lock_acquire+0x12fd/0x20d0 kernel/locking/lockdep.c:4955
       lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566
       down_write+0x93/0x180 kernel/locking/rwsem.c:1564
       inode_lock include/linux/fs.h:782 [inline]
       ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425
       ext4_xattr_inode_get+0x138/0x410 fs/ext4/xattr.c:485
       ext4_xattr_move_to_block fs/ext4/xattr.c:2580 [inline]
       ext4_xattr_make_inode_space fs/ext4/xattr.c:2682 [inline]
       ext4_expand_extra_isize_ea+0xe70/0x1bb0 fs/ext4/xattr.c:2774
       __ext4_expand_extra_isize+0x304/0x3f0 fs/ext4/inode.c:5898
       ext4_try_to_expand_extra_isize fs/ext4/inode.c:5941 [inline]
       __ext4_mark_inode_dirty+0x591/0x810 fs/ext4/inode.c:6018
       ext4_setattr+0x1400/0x19c0 fs/ext4/inode.c:5562
       notify_change+0xbb6/0xe60 fs/attr.c:435
       do_truncate+0x1de/0x2c0 fs/open.c:64
       handle_truncate fs/namei.c:2970 [inline]
       do_open fs/namei.c:3311 [inline]
       path_openat+0x29f3/0x3290 fs/namei.c:3425
       do_filp_open+0x20b/0x450 fs/namei.c:3452
       do_sys_openat2+0x124/0x460 fs/open.c:1207
       do_sys_open fs/open.c:1223 [inline]
       __do_sys_open fs/open.c:1231 [inline]
       __se_sys_open fs/open.c:1227 [inline]
       __x64_sys_open+0x221/0x270 fs/open.c:1227
       do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62
       entry_SYSCALL_64_after_hwframe+0x61/0xcb

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ei->i_data_sem/3);
                               lock(&ea_inode->i_rwsem#7/1);
                               lock(&ei->i_data_sem/3);
  lock(&ea_inode->i_rwsem#7/1);

 *** DEADLOCK ***

5 locks held by syz-executor543/2794:
 #0: ffff888026fbc448 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write+0x4a/0x2a0 fs/namespace.c:365
 #1: ffff8880215e3488 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: inode_lock include/linux/fs.h:782 [inline]
 #1: ffff8880215e3488 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: do_truncate+0x1cf/0x2c0 fs/open.c:62
 #2: ffff8880215e3310 (&ei->i_mmap_sem){++++}-{3:3}, at: ext4_setattr+0xec4/0x19c0 fs/ext4/inode.c:5519
 #3: ffff8880215e3278 (&ei->i_data_sem/3){++++}-{3:3}, at: ext4_setattr+0x136d/0x19c0 fs/ext4/inode.c:5559
 #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_write_trylock_xattr fs/ext4/xattr.h:162 [inline]
 #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_try_to_expand_extra_isize fs/ext4/inode.c:5938 [inline]
 #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: __ext4_mark_inode_dirty+0x4fb/0x810 fs/ext4/inode.c:6018

stack backtrace:
CPU: 1 PID: 2794 Comm: syz-executor543 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x177/0x211 lib/dump_stack.c:118
 print_circular_bug+0x146/0x1b0 kernel/locking/lockdep.c:2002
 check_noncircular+0x2cc/0x390 kernel/locking/lockdep.c:2123
 check_prev_add kernel/locking/lockdep.c:2988 [inline]
 check_prevs_add kernel/locking/lockdep.c:3113 [inline]
 validate_chain+0x1695/0x58f0 kernel/locking/lockdep.c:3729
 __lock_acquire+0x12fd/0x20d0 kernel/locking/lockdep.c:4955
 lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566
 down_write+0x93/0x180 kernel/locking/rwsem.c:1564
 inode_lock include/linux/fs.h:782 [inline]
 ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425
 ext4_xattr_inode_get+0x138/0x410 fs/ext4/xattr.c:485
 ext4_xattr_move_to_block fs/ext4/xattr.c:2580 [inline]
 ext4_xattr_make_inode_space fs/ext4/xattr.c:2682 [inline]
 ext4_expand_extra_isize_ea+0xe70/0x1bb0 fs/ext4/xattr.c:2774
 __ext4_expand_extra_isize+0x304/0x3f0 fs/ext4/inode.c:5898
 ext4_try_to_expand_extra_isize fs/ext4/inode.c:5941 [inline]
 __ext4_mark_inode_dirty+0x591/0x810 fs/ext4/inode.c:6018
 ext4_setattr+0x1400/0x19c0 fs/ext4/inode.c:5562
 notify_change+0xbb6/0xe60 fs/attr.c:435
 do_truncate+0x1de/0x2c0 fs/open.c:64
 handle_truncate fs/namei.c:2970 [inline]
 do_open fs/namei.c:3311 [inline]
 path_openat+0x29f3/0x3290 fs/namei.c:3425
 do_filp_open+0x20b/0x450 fs/namei.c:3452
 do_sys_openat2+0x124/0x460 fs/open.c:1207
 do_sys_open fs/open.c:1223 [inline]
 __do_sys_open fs/open.c:1231 [inline]
 __se_sys_open fs/open.c:1227 [inline]
 __x64_sys_open+0x221/0x270 fs/open.c:1227
 do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f0cde4ea229
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 21 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd81d1c978 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
RAX: ffffffffffffffda RBX: 0030656c69662f30 RCX: 00007f0cde4ea229
RDX: 0000000000000089 RSI: 00000000000a0a00 RDI: 00000000200001c0
RBP: 2f30656c69662f2e R08: 0000000000208000 R09: 0000000000208000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd81d1c9c0
R13: 00007ffd81d1ca00 R14: 0000000000080000 R15: 0000000000000003
EXT4-fs error (device loop0): ext4_expand_extra_isize_ea:2730: inode torvalds#13: comm syz-executor543: corrupted in-inode xattr

Signed-off-by: Wojciech Gładysz <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
akiyks pushed a commit that referenced this issue Sep 21, 2024
…-level'

Eduard Zingerman says:

====================
bpf: track find_equal_scalars history on per-instruction level

This is a fix for precision tracking bug reported in [0].
It supersedes my previous attempt to fix similar issue in commit [1].
Here is a minimized test case from [0]:

    0:  call bpf_get_prandom_u32;
    1:  r7 = r0;
    2:  r8 = r0;
    3:  call bpf_get_prandom_u32;
    4:  if r0 > 1 goto +0;
    /* --- checkpoint #1: r7.id=1, r8.id=1 --- */
    5:  if r8 >= r0 goto 9f;
    6:  r8 += r8;
    /* --- checkpoint #2: r7.id=1, r8.id=0 --- */
    7:  if r7 == 0 goto 9f;
    8:  r0 /= 0;
    /* --- checkpoint #3 --- */
    9:  r0 = 42;
    10: exit;

W/o this fix verifier incorrectly assumes that instruction at label
(8) is unreachable. The issue is caused by failure to infer
precision mark for r0 at checkpoint #1:
- first verification path is:
  - (0-4): r0 range [0,1];
  - (5): r8 range [0,0], propagated to r7;
  - (6): r8.id is reset;
  - (7): jump is predicted to happen;
  - (9-10): safe exit.
- when jump at (7) is predicted mark_chain_precision() for r7 is
  called and backtrack_insn() proceeds as follows:
  - at (7) r7 is marked as precise;
  - at (5) r8 is not currently tracked and thus r0 is not marked;
  - at (4-5) boundary logic from [1] is triggered and r7,r8 are marked
    as precise;
  - => r0 precision mark is missed.
- when second branch of (4) is considered, verifier prunes the state
  because r0 is not marked as precise in the visited state.

Basically, backtracking logic fails to notice that at (5)
range information is gained for both r7 and r8, and thus both
r8 and r0 have to be marked as precise.
This happens because [1] can only account for such range
transfers at parent/child state boundaries.

The solution suggested by Andrii Nakryiko in [0] is to use jump
history to remember which registers gained range as a result of
find_equal_scalars() [renamed to sync_linked_regs()] and use
this information in backtrack_insn().
Which is what this patch-set does.

The patch-set uses u64 value as a vector of 10-bit values that
identify registers gaining range in find_equal_scalars().
This amounts to maximum of 6 possible values.
To check if such capacity is sufficient I've instrumented kernel
to track a histogram for maximal amount of registers that gain range
in find_equal_scalars per program verification [2].
Measurements done for verifier selftests and Cilium bpf object files
from [3] show that number of such registers is *always* <= 4 and
in 98% of cases it is <= 2.

When tested on a subset of selftests identified by
selftests/bpf/veristat.cfg and Cilium bpf object files from [3]
this patch-set has minimal verification performance impact:

File                      Program                   Insns   (DIFF)  States (DIFF)
------------------------  ------------------------  --------------  -------------
bpf_host.o                tail_handle_nat_fwd_ipv4    -75 (-0.61%)    -3 (-0.39%)
pyperf600_nounroll.bpf.o  on_event                  +1673 (+0.33%)    +3 (+0.01%)

[0] https://lore.kernel.org/bpf/CAEf4BzZ0xidVCqB47XnkXcNhkPWF6_nTV7yt+_Lf0kcFEut2Mg@mail.gmail.com/
[1] commit 904e6dd ("bpf: Use scalar ids in mark_chain_precision()")
[2] https://github.com/eddyz87/bpf/tree/find-equal-scalars-in-jump-history-with-stats
[3] https://github.com/anakryiko/cilium

Changes:
- v2 -> v3:
  A number of stylistic changes suggested by Andrii:
  - renamings:
    - struct reg_or_spill   -> linked_reg;
    - find_equal_scalars()  -> collect_linked_regs;
    - copy_known_reg()      -> sync_linked_regs;
  - collect_linked_regs() now returns linked regs set of
    size 2 or larger;
  - dropped usage of bit fields in struct linked_reg;
  - added a patch changing references to find_equal_scalars() in
    selftests comments.
- v1 -> v2:
  - patch "bpf: replace env->cur_hist_ent with a getter function" is
    dropped (Andrii);
  - added structure linked_regs and helper functions to [de]serialize
    u64 value as such structure (Andrii);
  - bt_set_equal_scalars() renamed to bt_sync_linked_regs(), moved to
    start and end of backtrack_insn() in order to untie linked
    register logic from conditional jumps backtracking.
    Andrii requested a more radical change of moving linked registers
    processing to bt_set_xxx() functions, I did an experiment in this
    direction:
    https://github.com/eddyz87/bpf/tree/find-equal-scalars-in-jump-history--linked-regs-in-bt-set-reg
    the end result of the experiment seems much uglier than version
    presented in v2.

Revisions:
- v1: https://lore.kernel.org/bpf/[email protected]/
- v2: https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrii Nakryiko <[email protected]>
akiyks pushed a commit that referenced this issue Sep 21, 2024
Eduard Zingerman says:

====================
__jited test tag to check disassembly after jit

Some of the logic in the BPF jits might be non-trivial.
It might be useful to allow testing this logic by comparing
generated native code with expected code template.
This patch set adds a macro __jited() that could be used for
test_loader based tests in a following manner:

    SEC("tp")
    __arch_x86_64
    __jited("   endbr64")
    __jited("   nopl    (%rax,%rax)")
    __jited("   xorq    %rax, %rax")
    ...
    __naked void some_test(void) { ... }

Also add a test for jit code generated for tail calls handling to
demonstrate the feature.

The feature uses LLVM libraries to do the disassembly.
At selftests compilation time Makefile detects if these libraries are
available. When libraries are not available tests using __jit_x86()
are skipped.
Current CI environment does not include llvm development libraries,
but changes to add these are trivial.

This was previously discussed here:
https://lore.kernel.org/bpf/[email protected]/

Patch-set includes a few auxiliary steps:
- patches #2 and #3 fix a few bugs in test_loader behaviour;
- patch #4 replaces __regex macro with ability to specify regular
  expressions in __msg and __xlated using "{{" "}}" escapes;
- patch torvalds#8 updates __xlated to match disassembly lines consequently,
  same way as __jited does.

Changes v2->v3:
- changed macro name from __jit_x86 to __jited with __arch_* to
  specify disassembly arch (Yonghong);
- __jited matches disassembly lines consequently with "..."
  allowing to skip some number of lines (Andrii);
- __xlated matches disassembly lines consequently, same as __jited;
- "{{...}}" regex brackets instead of __regex macro;
- bug fixes for old commits.

Changes v1->v2:
- stylistic changes suggested by Yonghong;
- fix for -Wformat-truncation related warning when compiled with
  llvm15 (Yonghong).

v1: https://lore.kernel.org/bpf/[email protected]/
v2: https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
akiyks pushed a commit that referenced this issue Sep 21, 2024
…ptions

Patch series "mm: split PTE/PMD PT table Kconfig cleanups+clarifications".

This series is a follow up to the fixes:
	"[PATCH v1 0/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking"

When working on the fixes, I wondered why 8xx is fine (-> never uses split
PT locks) and how PT locking even works properly with PMD page table
sharing (-> always requires split PMD PT locks).

Let's improve the split PT lock detection, make hugetlb properly depend on
it and make 8xx bail out if it would ever get enabled by accident.

As an alternative to patch #3 we could extend the Kconfig
SPLIT_PTE_PTLOCKS option from patch #2 -- but enforcing it closer to the
code that actually implements it feels a bit nicer for documentation
purposes, and there is no need to actually disable it because it should
always be disabled (!SMP).

Did a bunch of cross-compilations to make sure that split PTE/PMD PT locks
are still getting used where we would expect them.

[1] https://lkml.kernel.org/r/[email protected]


This patch (of 3):

Let's clean that up a bit and prepare for depending on
CONFIG_SPLIT_PMD_PTLOCKS in other Kconfig options.

More cleanups would be reasonable (like the arch-specific "depends on" for
CONFIG_SPLIT_PTE_PTLOCKS), but we'll leave that for another day.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Russell King (Oracle) <[email protected]>
Reviewed-by: Qi Zheng <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: "Naveen N. Rao" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
akiyks pushed a commit that referenced this issue Sep 21, 2024
commit 823430c ("memory tier: consolidate the initialization of
memory tiers") introduces a locking change that use guard(mutex) to
instead of mutex_lock/unlock() for memory_tier_lock.  It unexpectedly
expanded the locked region to include the hotplug_memory_notifier(), as a
result, it triggers an locking dependency detected of ABBA deadlock. 
Exclude hotplug_memory_notifier() from the locked region to fixing it.

The deadlock scenario is that when a memory online event occurs, the
execution of memory notifier will access the read lock of the
memory_chain.rwsem, then the reigistration of the memory notifier in
memory_tier_init() acquires the write lock of the memory_chain.rwsem while
holding memory_tier_lock.  Then the memory online event continues to
invoke the memory hotplug callback registered by memory_tier_init(). 
Since this callback tries to acquire the memory_tier_lock, a deadlock
occurs.

In fact, this deadlock can't happen because memory_tier_init() always
executes before memory online events happen due to the subsys_initcall()
has an higher priority than module_init().

[  133.491106] WARNING: possible circular locking dependency detected
[  133.493656] 6.11.0-rc2+ torvalds#146 Tainted: G           O     N
[  133.504290] ------------------------------------------------------
[  133.515194] (udev-worker)/1133 is trying to acquire lock:
[  133.525715] ffffffff87044e28 (memory_tier_lock){+.+.}-{3:3}, at: memtier_hotplug_callback+0x383/0x4b0
[  133.536449]
[  133.536449] but task is already holding lock:
[  133.549847] ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
[  133.556781]
[  133.556781] which lock already depends on the new lock.
[  133.556781]
[  133.569957]
[  133.569957] the existing dependency chain (in reverse order) is:
[  133.577618]
[  133.577618] -> #1 ((memory_chain).rwsem){++++}-{3:3}:
[  133.584997]        down_write+0x97/0x210
[  133.588647]        blocking_notifier_chain_register+0x71/0xd0
[  133.592537]        register_memory_notifier+0x26/0x30
[  133.596314]        memory_tier_init+0x187/0x300
[  133.599864]        do_one_initcall+0x117/0x5d0
[  133.603399]        kernel_init_freeable+0xab0/0xeb0
[  133.606986]        kernel_init+0x28/0x2f0
[  133.610312]        ret_from_fork+0x59/0x90
[  133.613652]        ret_from_fork_asm+0x1a/0x30
[  133.617012]
[  133.617012] -> #0 (memory_tier_lock){+.+.}-{3:3}:
[  133.623390]        __lock_acquire+0x2efd/0x5c60
[  133.626730]        lock_acquire+0x1ce/0x580
[  133.629757]        __mutex_lock+0x15c/0x1490
[  133.632731]        mutex_lock_nested+0x1f/0x30
[  133.635717]        memtier_hotplug_callback+0x383/0x4b0
[  133.638748]        notifier_call_chain+0xbf/0x370
[  133.641647]        blocking_notifier_call_chain+0x76/0xb0
[  133.644636]        memory_notify+0x2e/0x40
[  133.647427]        online_pages+0x597/0x720
[  133.650246]        memory_subsys_online+0x4f6/0x7f0
[  133.653107]        device_online+0x141/0x1d0
[  133.655831]        online_memory_block+0x4d/0x60
[  133.658616]        walk_memory_blocks+0xc0/0x120
[  133.661419]        add_memory_resource+0x51d/0x6c0
[  133.664202]        add_memory_driver_managed+0xf5/0x180
[  133.667060]        dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
[  133.669949]        dax_bus_probe+0x147/0x230
[  133.672687]        really_probe+0x27f/0xac0
[  133.675463]        __driver_probe_device+0x1f3/0x460
[  133.678493]        driver_probe_device+0x56/0x1b0
[  133.681366]        __driver_attach+0x277/0x570
[  133.684149]        bus_for_each_dev+0x145/0x1e0
[  133.686937]        driver_attach+0x49/0x60
[  133.689673]        bus_add_driver+0x2f3/0x6b0
[  133.692421]        driver_register+0x170/0x4b0
[  133.695118]        __dax_driver_register+0x141/0x1b0
[  133.697910]        dax_kmem_init+0x54/0xff0 [kmem]
[  133.700794]        do_one_initcall+0x117/0x5d0
[  133.703455]        do_init_module+0x277/0x750
[  133.706054]        load_module+0x5d1d/0x74f0
[  133.708602]        init_module_from_file+0x12c/0x1a0
[  133.711234]        idempotent_init_module+0x3f1/0x690
[  133.713937]        __x64_sys_finit_module+0x10e/0x1a0
[  133.716492]        x64_sys_call+0x184d/0x20d0
[  133.719053]        do_syscall_64+0x6d/0x140
[  133.721537]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  133.724239]
[  133.724239] other info that might help us debug this:
[  133.724239]
[  133.730832]  Possible unsafe locking scenario:
[  133.730832]
[  133.735298]        CPU0                    CPU1
[  133.737759]        ----                    ----
[  133.740165]   rlock((memory_chain).rwsem);
[  133.742623]                                lock(memory_tier_lock);
[  133.745357]                                lock((memory_chain).rwsem);
[  133.748141]   lock(memory_tier_lock);
[  133.750489]
[  133.750489]  *** DEADLOCK ***
[  133.750489]
[  133.756742] 6 locks held by (udev-worker)/1133:
[  133.759179]  #0: ffff888207be6158 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x26c/0x570
[  133.762299]  #1: ffffffff875b5868 (device_hotplug_lock){+.+.}-{3:3}, at: lock_device_hotplug+0x20/0x30
[  133.765565]  #2: ffff88820cf6a108 (&dev->mutex){....}-{3:3}, at: device_online+0x2f/0x1d0
[  133.768978]  #3: ffffffff86d08ff0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x17/0x30
[  133.772312]  #4: ffffffff8702dfb0 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x23/0x30
[  133.775544]  #5: ffffffff875d3310 ((memory_chain).rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x60/0xb0
[  133.779113]
[  133.779113] stack backtrace:
[  133.783728] CPU: 5 UID: 0 PID: 1133 Comm: (udev-worker) Tainted: G           O     N 6.11.0-rc2+ torvalds#146
[  133.787220] Tainted: [O]=OOT_MODULE, [N]=TEST
[  133.789948] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[  133.793291] Call Trace:
[  133.795826]  <TASK>
[  133.798284]  dump_stack_lvl+0xea/0x150
[  133.801025]  dump_stack+0x19/0x20
[  133.803609]  print_circular_bug+0x477/0x740
[  133.806341]  check_noncircular+0x2f4/0x3e0
[  133.809056]  ? __pfx_check_noncircular+0x10/0x10
[  133.811866]  ? __pfx_lockdep_lock+0x10/0x10
[  133.814670]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[  133.817610]  __lock_acquire+0x2efd/0x5c60
[  133.820339]  ? __pfx___lock_acquire+0x10/0x10
[  133.823128]  ? __dax_driver_register+0x141/0x1b0
[  133.825926]  ? do_one_initcall+0x117/0x5d0
[  133.828648]  lock_acquire+0x1ce/0x580
[  133.831349]  ? memtier_hotplug_callback+0x383/0x4b0
[  133.834293]  ? __pfx_lock_acquire+0x10/0x10
[  133.837134]  __mutex_lock+0x15c/0x1490
[  133.839829]  ? memtier_hotplug_callback+0x383/0x4b0
[  133.842753]  ? memtier_hotplug_callback+0x383/0x4b0
[  133.845602]  ? __this_cpu_preempt_check+0x21/0x30
[  133.848438]  ? __pfx___mutex_lock+0x10/0x10
[  133.851200]  ? __pfx_lock_acquire+0x10/0x10
[  133.853935]  ? global_dirty_limits+0xc0/0x160
[  133.856699]  ? __sanitizer_cov_trace_switch+0x58/0xa0
[  133.859564]  mutex_lock_nested+0x1f/0x30
[  133.862251]  ? mutex_lock_nested+0x1f/0x30
[  133.864964]  memtier_hotplug_callback+0x383/0x4b0
[  133.867752]  notifier_call_chain+0xbf/0x370
[  133.870550]  ? writeback_set_ratelimit+0xe8/0x160
[  133.873372]  blocking_notifier_call_chain+0x76/0xb0
[  133.876311]  memory_notify+0x2e/0x40
[  133.879013]  online_pages+0x597/0x720
[  133.881686]  ? irqentry_exit+0x3e/0xa0
[  133.884397]  ? __pfx_online_pages+0x10/0x10
[  133.887244]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[  133.890299]  ? mhp_init_memmap_on_memory+0x7a/0x1c0
[  133.893203]  memory_subsys_online+0x4f6/0x7f0
[  133.896099]  ? __pfx_memory_subsys_online+0x10/0x10
[  133.899039]  ? xa_load+0x16d/0x2e0
[  133.901667]  ? __pfx_xa_load+0x10/0x10
[  133.904366]  ? __pfx_memory_subsys_online+0x10/0x10
[  133.907218]  device_online+0x141/0x1d0
[  133.909845]  online_memory_block+0x4d/0x60
[  133.912494]  walk_memory_blocks+0xc0/0x120
[  133.915104]  ? __pfx_online_memory_block+0x10/0x10
[  133.917776]  add_memory_resource+0x51d/0x6c0
[  133.920404]  ? __pfx_add_memory_resource+0x10/0x10
[  133.923104]  ? _raw_write_unlock+0x31/0x60
[  133.925781]  ? register_memory_resource+0x119/0x180
[  133.928450]  add_memory_driver_managed+0xf5/0x180
[  133.931036]  dev_dax_kmem_probe+0x7f7/0xb40 [kmem]
[  133.933665]  ? __pfx_dev_dax_kmem_probe+0x10/0x10 [kmem]
[  133.936332]  ? __pfx___up_read+0x10/0x10
[  133.938878]  dax_bus_probe+0x147/0x230
[  133.941332]  ? __pfx_dax_bus_probe+0x10/0x10
[  133.943954]  really_probe+0x27f/0xac0
[  133.946387]  ? __sanitizer_cov_trace_const_cmp1+0x1e/0x30
[  133.949106]  __driver_probe_device+0x1f3/0x460
[  133.951704]  ? parse_option_str+0x149/0x190
[  133.954241]  driver_probe_device+0x56/0x1b0
[  133.956749]  __driver_attach+0x277/0x570
[  133.959228]  ? __pfx___driver_attach+0x10/0x10
[  133.961776]  bus_for_each_dev+0x145/0x1e0
[  133.964367]  ? __pfx_bus_for_each_dev+0x10/0x10
[  133.967019]  ? __kasan_check_read+0x15/0x20
[  133.969543]  ? _raw_spin_unlock+0x31/0x60
[  133.972132]  driver_attach+0x49/0x60
[  133.974536]  bus_add_driver+0x2f3/0x6b0
[  133.977044]  driver_register+0x170/0x4b0
[  133.979480]  __dax_driver_register+0x141/0x1b0
[  133.982126]  ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[  133.984724]  dax_kmem_init+0x54/0xff0 [kmem]
[  133.987284]  ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[  133.989965]  do_one_initcall+0x117/0x5d0
[  133.992506]  ? __pfx_do_one_initcall+0x10/0x10
[  133.995185]  ? __kasan_kmalloc+0x88/0xa0
[  133.997748]  ? kasan_poison+0x3e/0x60
[  134.000288]  ? kasan_unpoison+0x2c/0x60
[  134.002762]  ? kasan_poison+0x3e/0x60
[  134.005202]  ? __asan_register_globals+0x62/0x80
[  134.007753]  ? __pfx_dax_kmem_init+0x10/0x10 [kmem]
[  134.010439]  do_init_module+0x277/0x750
[  134.012953]  load_module+0x5d1d/0x74f0
[  134.015406]  ? __pfx_load_module+0x10/0x10
[  134.017887]  ? __pfx_ima_post_read_file+0x10/0x10
[  134.020470]  ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[  134.023127]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[  134.025767]  ? security_kernel_post_read_file+0xa2/0xd0
[  134.028429]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[  134.031162]  ? kernel_read_file+0x503/0x820
[  134.033645]  ? __pfx_kernel_read_file+0x10/0x10
[  134.036232]  ? __pfx___lock_acquire+0x10/0x10
[  134.038766]  init_module_from_file+0x12c/0x1a0
[  134.041291]  ? init_module_from_file+0x12c/0x1a0
[  134.043936]  ? __pfx_init_module_from_file+0x10/0x10
[  134.046516]  ? __this_cpu_preempt_check+0x21/0x30
[  134.049091]  ? __kasan_check_read+0x15/0x20
[  134.051551]  ? do_raw_spin_unlock+0x60/0x210
[  134.054077]  idempotent_init_module+0x3f1/0x690
[  134.056643]  ? __pfx_idempotent_init_module+0x10/0x10
[  134.059318]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[  134.061995]  ? __fget_light+0x17d/0x210
[  134.064428]  __x64_sys_finit_module+0x10e/0x1a0
[  134.066976]  x64_sys_call+0x184d/0x20d0
[  134.069405]  do_syscall_64+0x6d/0x140
[  134.071926]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

[[email protected]: add mutex_lock/unlock() pair back]
  Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 823430c ("memory tier: consolidate the initialization of memory tiers")
Signed-off-by: Yanfei Xu <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Ho-Ren (Jack) Chuang <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
akiyks pushed a commit that referenced this issue Sep 21, 2024
Patch series "zram: introduce custom comp backends API", v7.

This series introduces support for run-time compression algorithms tuning,
so users, for instance, can adjust compression/acceleration levels and
provide pre-trained compression/decompression dictionaries which certain
algorithms support.

At this point we stop supporting (old/deprecated) comp API.  We may add
new acomp API support in the future, but before that zram needs to undergo
some major rework (we are not ready for async compression).

Some benchmarks for reference (look at column #2)

*** init zstd
/sys/block/zram0/mm_stat
1750659072 504622188 514355200        0 514355200        1        0    34204    34204

*** init zstd dict=/home/ss/zstd-dict-amd64
/sys/block/zram0/mm_stat
1750650880 465908890 475398144        0 475398144        1        0    34185    34185

*** init zstd level=8 dict=/home/ss/zstd-dict-amd64
/sys/block/zram0/mm_stat
1750654976 430803319 439873536        0 439873536        1        0    34185    34185

*** init lz4
/sys/block/zram0/mm_stat
1750646784 664266564 677060608        0 677060608        1        0    34288    34288

*** init lz4 dict=/home/ss/lz4-dict-amd64
/sys/block/zram0/mm_stat
1750650880 619990300 632102912        0 632102912        1        0    34278    34278

*** init lz4hc
/sys/block/zram0/mm_stat
1750630400 609023822 621232128        0 621232128        1        0    34288    34288

*** init lz4hc dict=/home/ss/lz4-dict-amd64
/sys/block/zram0/mm_stat
1750659072 505133172 515231744        0 515231744        1        0    34278    34278


Recompress
init zram zstd (prio=0), zstd level=5 (prio 1), zstd with dict (prio 2)

*** zstd
/sys/block/zram0/mm_stat
1750982656 504630584 514269184        0 514269184        1        0    34204    34204

*** idle recompress priority=1 (zstd level=5)
/sys/block/zram0/mm_stat
1750982656 488645601 525438976        0 514269184        1        0    34204    34204

*** idle recompress priority=2 (zstd dict)
/sys/block/zram0/mm_stat
1750982656 460869640 517914624        0 514269184        1        0    34185    34204


This patch (of 24):

We need to export a number of API functions that enable advanced zstd
usage - C/D dictionaries, dictionaries sharing between contexts, etc.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Nick Terrell <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
akiyks pushed a commit that referenced this issue Sep 21, 2024
Hou Tao says:

====================
The tiny patch set aims to fix two problems found during the development
of supporting dynptr key in hash table. Patch #1 fixes the missed
btf_record_free() when map creation fails and patch #2 fixes the missed
kfree() when there is no special field in the passed btf.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
akiyks pushed a commit that referenced this issue Sep 23, 2024
Leak sanitizer will report memory leaks from python and the leak
sanitizer output causes tests to fail. For example:

  ```
  $ perf test 98 -v
   98: perf script tests:
  --- start ---
  test child forked, pid 1272962
  DB test
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.046 MB /tmp/perf-test-script.x0EktdCel8/perf.data (8 samples) ]
  call_path_table((1, 0, 0, 0)
  call_path_table((2, 1, 0, 140339508617447)
  call_path_table((3, 2, 2, 0)
  call_path_table((4, 3, 3, 0)
  call_path_table((5, 4, 4, 0)
  call_path_table((6, 5, 5, 0)
  call_path_table((7, 6, 6, 0)
  call_path_table((8, 7, 7, 0)
  call_path_table((9, 8, 8, 0)
  call_path_table((10, 9, 9, 0)
  call_path_table((11, 10, 10, 0)
  call_path_table((12, 11, 11, 0)
  call_path_table((13, 12, 1, 0)
  sample_table((1, 1, 1, 1, 1, 1, 1, 8, -2058824120, 588306954119000, -1, 0, 0, 0, 0, 1, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
  sample_table((2, 1, 1, 1, 1, 1, 1, 8, -2058824120, 588306954137053, -1, 0, 0, 0, 0, 1, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
  sample_table((3, 1, 1, 1, 1, 1, 1, 8, -2058824120, 588306954140089, -1, 0, 0, 0, 0, 9, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
  sample_table((4, 1, 1, 1, 1, 1, 1, 8, -2058824120, 588306954142376, -1, 0, 0, 0, 0, 155, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
  sample_table((5, 1, 1, 1, 1, 1, 1, 8, -2058824120, 588306954144045, -1, 0, 0, 0, 0, 2493, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
  sample_table((6, 1, 1, 1, 1, 1, 12, 77, -2046828595, 588306954145722, -1, 0, 0, 0, 0, 47555, 0, 0, 128933429281, 0, 0, 13, 0, 0, 0, -1, -1))
  call_path_table((14, 9, 14, 0)
  call_path_table((15, 14, 15, 0)
  call_path_table((16, 15, 0, -1040969624)
  call_path_table((17, 16, 16, 0)
  call_path_table((18, 17, 17, 0)
  call_path_table((19, 18, 18, 0)
  call_path_table((20, 19, 19, 0)
  call_path_table((21, 20, 13, 0)
  sample_table((7, 1, 1, 1, 2, 1, 13, 46, -2053700898, 588306954157436, -1, 0, 0, 0, 0, 964078, 0, 0, 128933429281, 0, 0, 21, 0, 0, 0, -1, -1))
  call_path_table((22, 1, 21, 0)
  call_path_table((23, 22, 22, 0)
  call_path_table((24, 23, 23, 0)
  call_path_table((25, 24, 24, 0)
  call_path_table((26, 25, 25, 0)
  call_path_table((27, 26, 26, 0)
  call_path_table((28, 27, 27, 0)
  call_path_table((29, 28, 28, 0)
  call_path_table((30, 29, 29, 0)
  call_path_table((31, 30, 30, 0)
  call_path_table((32, 31, 31, 0)
  call_path_table((33, 32, 32, 0)
  call_path_table((34, 33, 33, 0)
  call_path_table((35, 34, 20, 0)
  sample_table((8, 1, 1, 1, 2, 1, 20, 49, -2046878127, 588306954378624, -1, 0, 0, 0, 0, 2534317, 0, 0, 128933429281, 0, 0, 35, 0, 0, 0, -1, -1))

  =================================================================
  ==1272975==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 13628 byte(s) in 6 object(s) allocated from:
      #0 0x56354f60c092 in malloc (/tmp/perf/perf+0x29c092)
      #1 0x7ff25c7d02e7 in _PyObject_Malloc /build/python3.11/../Objects/obmalloc.c:2003:11
      #2 0x7ff25c7d02e7 in _PyObject_Malloc /build/python3.11/../Objects/obmalloc.c:1996:1

  SUMMARY: AddressSanitizer: 13628 byte(s) leaked in 6 allocation(s).
  --- Cleaning up ---
  ---- end(-1) ----
   98: perf script tests                                               : FAILED!
  ```

Disable leak sanitizer when running specific perf+python tests to
avoid this. This causes the tests to pass when run with leak
sanitizer.

Reviewed-by: Aditya Gupta <[email protected]>
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Richter <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
akiyks pushed a commit that referenced this issue Sep 23, 2024
iter_finish_branch_entry() doesn't put the branch_info from/to map
elements creating memory leaks. This can be seen with:

```
$ perf record -e cycles -b perf test -w noploop
$ perf report -D
...
Direct leak of 984344 byte(s) in 123043 object(s) allocated from:
    #0 0x7fb2654f3bd7 in malloc libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x564d3400d10b in map__get util/map.h:186
    #2 0x564d3400d10b in ip__resolve_ams util/machine.c:1981
    #3 0x564d34014d81 in sample__resolve_bstack util/machine.c:2151
    #4 0x564d34094790 in iter_prepare_branch_entry util/hist.c:898
    #5 0x564d34098fa4 in hist_entry_iter__add util/hist.c:1238
    torvalds#6 0x564d33d1f0c7 in process_sample_event tools/perf/builtin-report.c:334
    torvalds#7 0x564d34031eb7 in perf_session__deliver_event util/session.c:1655
    torvalds#8 0x564d3403ba52 in do_flush util/ordered-events.c:245
    torvalds#9 0x564d3403ba52 in __ordered_events__flush util/ordered-events.c:324
    torvalds#10 0x564d3402d32e in perf_session__process_user_event util/session.c:1708
    torvalds#11 0x564d34032480 in perf_session__process_event util/session.c:1877
    torvalds#12 0x564d340336ad in reader__read_event util/session.c:2399
    torvalds#13 0x564d34033fdc in reader__process_events util/session.c:2448
    torvalds#14 0x564d34033fdc in __perf_session__process_events util/session.c:2495
    torvalds#15 0x564d34033fdc in perf_session__process_events util/session.c:2661
    torvalds#16 0x564d33d27113 in __cmd_report tools/perf/builtin-report.c:1065
    torvalds#17 0x564d33d27113 in cmd_report tools/perf/builtin-report.c:1805
    torvalds#18 0x564d33e0ccb7 in run_builtin tools/perf/perf.c:350
    torvalds#19 0x564d33e0d45e in handle_internal_command tools/perf/perf.c:403
    torvalds#20 0x564d33cdd827 in run_argv tools/perf/perf.c:447
    torvalds#21 0x564d33cdd827 in main tools/perf/perf.c:561
...
```

Clearing up the map_symbols properly creates maps reference count
issues so resolve those. Resolving this issue doesn't improve peak
heap consumption for the test above.

Committer testing:

  $ sudo dnf install libasan
  $ make -k CORESIGHT=1 EXTRA_CFLAGS="-fsanitize=address" CC=clang O=/tmp/build/$(basename $PWD)/ -C tools/perf install-bin

Reviewed-by: Kan Liang <[email protected]>
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sun Haiyong <[email protected]>
Cc: Yanteng Si <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
akiyks pushed a commit that referenced this issue Sep 23, 2024
AddressSanitizer found a use-after-free bug in the symbol code which
manifested as 'perf top' segfaulting.

  ==1238389==ERROR: AddressSanitizer: heap-use-after-free on address 0x60b00c48844b at pc 0x5650d8035961 bp 0x7f751aaecc90 sp 0x7f751aaecc80
  READ of size 1 at 0x60b00c48844b thread T193
      #0 0x5650d8035960 in _sort__sym_cmp util/sort.c:310
      #1 0x5650d8043744 in hist_entry__cmp util/hist.c:1286
      #2 0x5650d8043951 in hists__findnew_entry util/hist.c:614
      #3 0x5650d804568f in __hists__add_entry util/hist.c:754
      #4 0x5650d8045bf9 in hists__add_entry util/hist.c:772
      #5 0x5650d8045df1 in iter_add_single_normal_entry util/hist.c:997
      torvalds#6 0x5650d8043326 in hist_entry_iter__add util/hist.c:1242
      torvalds#7 0x5650d7ceeefe in perf_event__process_sample /home/matt/src/linux/tools/perf/builtin-top.c:845
      torvalds#8 0x5650d7ceeefe in deliver_event /home/matt/src/linux/tools/perf/builtin-top.c:1208
      torvalds#9 0x5650d7fdb51b in do_flush util/ordered-events.c:245
      torvalds#10 0x5650d7fdb51b in __ordered_events__flush util/ordered-events.c:324
      torvalds#11 0x5650d7ced743 in process_thread /home/matt/src/linux/tools/perf/builtin-top.c:1120
      torvalds#12 0x7f757ef1f133 in start_thread nptl/pthread_create.c:442
      torvalds#13 0x7f757ef9f7db in clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

When updating hist maps it's also necessary to update the hist symbol
reference because the old one gets freed in map__put().

While this bug was probably introduced with 5c24b67 ("perf
tools: Replace map->referenced & maps->removed_maps with map->refcnt"),
the symbol objects were leaked until c087e94 ("perf machine:
Fix refcount usage when processing PERF_RECORD_KSYMBOL") was merged so
the bug was masked.

Fixes: c087e94 ("perf machine: Fix refcount usage when processing PERF_RECORD_KSYMBOL")
Reported-by: Yunzhao Li <[email protected]>
Signed-off-by: Matt Fleming (Cloudflare) <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: [email protected]
Cc: Namhyung Kim <[email protected]>
Cc: Riccardo Mancini <[email protected]>
Cc: [email protected] # v5.13+
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
akiyks pushed a commit that referenced this issue Sep 23, 2024
The fields in the hist_entry are filled on-demand which means they only
have meaningful values when relevant sort keys are used.

So if neither of 'dso' nor 'sym' sort keys are used, the map/symbols in
the hist entry can be garbage.  So it shouldn't access it
unconditionally.

I got a segfault, when I wanted to see cgroup profiles.

  $ sudo perf record -a --all-cgroups --synth=cgroup true

  $ sudo perf report -s cgroup

  Program received signal SIGSEGV, Segmentation fault.
  0x00005555557a8d90 in map__dso (map=0x0) at util/map.h:48
  48		return RC_CHK_ACCESS(map)->dso;
  (gdb) bt
  #0  0x00005555557a8d90 in map__dso (map=0x0) at util/map.h:48
  #1  0x00005555557aa39b in map__load (map=0x0) at util/map.c:344
  #2  0x00005555557aa592 in map__find_symbol (map=0x0, addr=140736115941088) at util/map.c:385
  #3  0x00005555557ef000 in hists__findnew_entry (hists=0x555556039d60, entry=0x7fffffffa4c0, al=0x7fffffffa8c0, sample_self=true)
      at util/hist.c:644
  #4  0x00005555557ef61c in __hists__add_entry (hists=0x555556039d60, al=0x7fffffffa8c0, sym_parent=0x0, bi=0x0, mi=0x0, ki=0x0,
      block_info=0x0, sample=0x7fffffffaa90, sample_self=true, ops=0x0) at util/hist.c:761
  #5  0x00005555557ef71f in hists__add_entry (hists=0x555556039d60, al=0x7fffffffa8c0, sym_parent=0x0, bi=0x0, mi=0x0, ki=0x0,
      sample=0x7fffffffaa90, sample_self=true) at util/hist.c:779
  torvalds#6  0x00005555557f00fb in iter_add_single_normal_entry (iter=0x7fffffffa900, al=0x7fffffffa8c0) at util/hist.c:1015
  torvalds#7  0x00005555557f09a7 in hist_entry_iter__add (iter=0x7fffffffa900, al=0x7fffffffa8c0, max_stack_depth=127, arg=0x7fffffffbce0)
      at util/hist.c:1260
  torvalds#8  0x00005555555ba7ce in process_sample_event (tool=0x7fffffffbce0, event=0x7ffff7c14128, sample=0x7fffffffaa90, evsel=0x555556039ad0,
      machine=0x5555560388e8) at builtin-report.c:334
  torvalds#9  0x00005555557b30c8 in evlist__deliver_sample (evlist=0x555556039010, tool=0x7fffffffbce0, event=0x7ffff7c14128,
      sample=0x7fffffffaa90, evsel=0x555556039ad0, machine=0x5555560388e8) at util/session.c:1232
  torvalds#10 0x00005555557b32bc in machines__deliver_event (machines=0x5555560388e8, evlist=0x555556039010, event=0x7ffff7c14128,
      sample=0x7fffffffaa90, tool=0x7fffffffbce0, file_offset=110888, file_path=0x555556038ff0 "perf.data") at util/session.c:1271
  torvalds#11 0x00005555557b3848 in perf_session__deliver_event (session=0x5555560386d0, event=0x7ffff7c14128, tool=0x7fffffffbce0,
      file_offset=110888, file_path=0x555556038ff0 "perf.data") at util/session.c:1354
  torvalds#12 0x00005555557affaf in ordered_events__deliver_event (oe=0x555556038e60, event=0x555556135aa0) at util/session.c:132
  torvalds#13 0x00005555557bb605 in do_flush (oe=0x555556038e60, show_progress=false) at util/ordered-events.c:245
  torvalds#14 0x00005555557bb95c in __ordered_events__flush (oe=0x555556038e60, how=OE_FLUSH__ROUND, timestamp=0) at util/ordered-events.c:324
  torvalds#15 0x00005555557bba46 in ordered_events__flush (oe=0x555556038e60, how=OE_FLUSH__ROUND) at util/ordered-events.c:342
  torvalds#16 0x00005555557b1b3b in perf_event__process_finished_round (tool=0x7fffffffbce0, event=0x7ffff7c15bb8, oe=0x555556038e60)
      at util/session.c:780
  torvalds#17 0x00005555557b3b27 in perf_session__process_user_event (session=0x5555560386d0, event=0x7ffff7c15bb8, file_offset=117688,
      file_path=0x555556038ff0 "perf.data") at util/session.c:1406

As you can see the entry->ms.map was NULL even if he->ms.map has a
value.  This is because 'sym' sort key is not given, so it cannot assume
whether he->ms.sym and entry->ms.sym is the same.  I only checked the
'sym' sort key here as it implies 'dso' behavior (so maps are the same).

Fixes: ac01c8c ("perf hist: Update hist symbol when updating maps")
Signed-off-by: Namhyung Kim <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Matt Fleming <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
akiyks pushed a commit that referenced this issue Sep 25, 2024
Case #1:
SQLite App		GC Thread		Kworker		Shrinker
- f2fs_ioc_start_atomic_write

- f2fs_ioc_commit_atomic_write
 - f2fs_commit_atomic_write
  - filemap_write_and_wait_range
  : write atomic_file's data to cow_inode
								echo 3 > drop_caches
								to drop atomic_file's
								cache.
			- f2fs_gc
			 - gc_data_segment
			  - move_data_page
			   - set_page_dirty

						- writepages
						 - f2fs_do_write_data_page
						 : overwrite atomic_file's data
						   to cow_inode
  - f2fs_down_write(&fi->i_gc_rwsem[WRITE])
  - __f2fs_commit_atomic_write
  - f2fs_up_write(&fi->i_gc_rwsem[WRITE])

Case #2:
SQLite App		GC Thread		Kworker
- f2fs_ioc_start_atomic_write

						- __writeback_single_inode
						 - do_writepages
						  - f2fs_write_cache_pages
						   - f2fs_write_single_data_page
						    - f2fs_do_write_data_page
						    : write atomic_file's data to cow_inode
			- f2fs_gc
			 - gc_data_segment
			  - move_data_page
			   - set_page_dirty

						- writepages
						 - f2fs_do_write_data_page
						 : overwrite atomic_file's data to cow_inode
- f2fs_ioc_commit_atomic_write

In above cases racing in between atomic_write and GC, previous
data in atomic_file may be overwrited to cow_file, result in
data corruption.

This patch introduces PAGE_PRIVATE_ATOMIC_WRITE bit flag in page.private,
and use it to indicate that there is last dirty data in atomic file,
and the data should be writebacked into cow_file, if the flag is not
tagged in page, we should never write data across files.

Fixes: 3db1de0 ("f2fs: change the current atomic write way")
Cc: Daeho Jeong <[email protected]>
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
akiyks pushed a commit that referenced this issue Sep 27, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

v2: with kdoc fixes per Paolo Abeni.

The following patchset contains Netfilter fixes for net:

Patch #1 and #2 handle an esoteric scenario: Given two tasks sending UDP
packets to one another, two packets of the same flow in each direction
handled by different CPUs that result in two conntrack objects in NEW
state, where reply packet loses race. Then, patch #3 adds a testcase for
this scenario. Series from Florian Westphal.

1) NAT engine can falsely detect a port collision if it happens to pick
   up a reply packet as NEW rather than ESTABLISHED. Add extra code to
   detect this and suppress port reallocation in this case.

2) To complete the clash resolution in the reply direction, extend conntrack
   logic to detect clashing conntrack in the reply direction to existing entry.

3) Adds a test case.

Then, an assorted list of fixes follow:

4) Add a selftest for tproxy, from Antonio Ojea.

5) Guard ctnetlink_*_size() functions under
   #if defined(CONFIG_NETFILTER_NETLINK_GLUE_CT) || defined(CONFIG_NF_CONNTRACK_EVENTS)
   From Andy Shevchenko.

6) Use -m socket --transparent in iptables tproxy documentation.
   From XIE Zhibang.

7) Call kfree_rcu() when releasing flowtable hooks to address race with
   netlink dump path, from Phil Sutter.

8) Fix compilation warning in nf_reject with CONFIG_BRIDGE_NETFILTER=n.
   From Simon Horman.

9) Guard ctnetlink_label_size() under CONFIG_NF_CONNTRACK_EVENTS which
   is its only user, to address a compilation warning. From Simon Horman.

10) Use rcu-protected list iteration over basechain hooks from netlink
    dump path.

11) Fix memcg for nf_tables, use GFP_KERNEL_ACCOUNT is not complete.

12) Remove old nfqueue conntrack clash resolution. Instead trying to
    use same destination address consistently which requires double DNAT,
    use the existing clash resolution which allows clashing packets
    go through with different destination. Antonio Ojea originally
    reported an issue from the postrouting chain, I proposed a fix:
    https://lore.kernel.org/netfilter-devel/ZuwSwAqKgCB2a51-@calendula/T/
    which he reported it did not work for him.

13) Adds a selftest for patch 12.

14) Fixes ipvs.sh selftest.

netfilter pull request 24-09-26

* tag 'nf-24-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: netfilter: Avoid hanging ipvs.sh
  kselftest: add test for nfqueue induced conntrack race
  netfilter: nfnetlink_queue: remove old clash resolution logic
  netfilter: nf_tables: missing objects with no memcg accounting
  netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path
  netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS
  netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n
  netfilter: nf_tables: Keep deleted flowtable hooks until after RCU
  docs: tproxy: ignore non-transparent sockets in iptables
  netfilter: ctnetlink: Guard possible unused functions
  selftests: netfilter: nft_tproxy.sh: add tcp tests
  selftests: netfilter: add reverse-clash resolution test case
  netfilter: conntrack: add clash resolution for reverse collisions
  netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants