Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmc: handling of REQ_FUA #1

Open
svenkatr opened this issue Jun 14, 2012 · 0 comments
Open

mmc: handling of REQ_FUA #1

svenkatr opened this issue Jun 14, 2012 · 0 comments
Assignees

Comments

@svenkatr
Copy link
Owner

Handling of REQ_FUA -> should flush the cache when such request is received.

@ghost ghost assigned svenkatr Jun 14, 2012
svenkatr pushed a commit that referenced this issue Jun 22, 2012
usb_ep_ops.disable must clear external copy of the endpoint descriptor,
otherwise musb crashes after loading/unloading several gadget modules
in a row:

Unable to handle kernel paging request at virtual address bf013730
pgd = c0004000
[bf013730] *pgd=8f26d811, *pte=00000000, *ppte=00000000
Internal error: Oops: 7 [#1]
Modules linked in: g_cdc [last unloaded: g_file_storage]
CPU: 0    Not tainted  (3.2.17 #647)
PC is at musb_gadget_enable+0x4c/0x24c
LR is at _raw_spin_lock_irqsave+0x4c/0x58
[<c027c030>] (musb_gadget_enable+0x4c/0x24c) from [<bf01b760>] (gether_connect+0x3c/0x19c [g_cdc])
[<bf01b760>] (gether_connect+0x3c/0x19c [g_cdc]) from [<bf01ba1c>] (ecm_set_alt+0x15c/0x180 [g_cdc])
[<bf01ba1c>] (ecm_set_alt+0x15c/0x180 [g_cdc]) from [<bf01ecd4>] (composite_setup+0x85c/0xac4 [g_cdc])
[<bf01ecd4>] (composite_setup+0x85c/0xac4 [g_cdc]) from [<c027b744>] (musb_g_ep0_irq+0x844/0x924)
[<c027b744>] (musb_g_ep0_irq+0x844/0x924) from [<c027a97c>] (musb_interrupt+0x79c/0x864)
[<c027a97c>] (musb_interrupt+0x79c/0x864) from [<c027aaa8>] (generic_interrupt+0x64/0x7c)
[<c027aaa8>] (generic_interrupt+0x64/0x7c) from [<c00797cc>] (handle_irq_event_percpu+0x28/0x178)
...

Cc: [email protected] # v3.1+
Signed-off-by: Grazvydas Ignotas <[email protected]>
Signed-off-by: Felipe Balbi <[email protected]>
svenkatr pushed a commit that referenced this issue Jun 22, 2012
Remove spinlock as atomic_t can be used instead. Note we use only 16
lower bits, upper bits are changed but we impilcilty cast to u16.

This fix possible deadlock on IBSS mode reproted by lockdep:

=================================
[ INFO: inconsistent lock state ]
3.4.0-wl+ #4 Not tainted
---------------------------------
inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kworker/u:2/30374 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&(&intf->seqlock)->rlock){+.?...}, at: [<f9979a20>] rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
{IN-SOFTIRQ-W} state was registered at:
  [<c04978ab>] __lock_acquire+0x47b/0x1050
  [<c0498504>] lock_acquire+0x84/0xf0
  [<c0835733>] _raw_spin_lock+0x33/0x40
  [<f9979a20>] rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
  [<f9979f2a>] rt2x00queue_write_tx_frame+0x1a/0x300 [rt2x00lib]
  [<f997834f>] rt2x00mac_tx+0x7f/0x380 [rt2x00lib]
  [<f98fe363>] __ieee80211_tx+0x1b3/0x300 [mac80211]
  [<f98ffdf5>] ieee80211_tx+0x105/0x130 [mac80211]
  [<f99000dd>] ieee80211_xmit+0xad/0x100 [mac80211]
  [<f9900519>] ieee80211_subif_start_xmit+0x2d9/0x930 [mac80211]
  [<c0782e87>] dev_hard_start_xmit+0x307/0x660
  [<c079bb71>] sch_direct_xmit+0xa1/0x1e0
  [<c0784bb3>] dev_queue_xmit+0x183/0x730
  [<c078c27a>] neigh_resolve_output+0xfa/0x1e0
  [<c07b436a>] ip_finish_output+0x24a/0x460
  [<c07b4897>] ip_output+0xb7/0x100
  [<c07b2d60>] ip_local_out+0x20/0x60
  [<c07e01ff>] igmpv3_sendpack+0x4f/0x60
  [<c07e108f>] igmp_ifc_timer_expire+0x29f/0x330
  [<c04520fc>] run_timer_softirq+0x15c/0x2f0
  [<c0449e3e>] __do_softirq+0xae/0x1e0
irq event stamp: 18380437
hardirqs last  enabled at (18380437): [<c0526027>] __slab_alloc.clone.3+0x67/0x5f0
hardirqs last disabled at (18380436): [<c0525ff3>] __slab_alloc.clone.3+0x33/0x5f0
softirqs last  enabled at (18377616): [<c0449eb3>] __do_softirq+0x123/0x1e0
softirqs last disabled at (18377611): [<c041278d>] do_softirq+0x9d/0xe0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&intf->seqlock)->rlock);
  <Interrupt>
    lock(&(&intf->seqlock)->rlock);

 *** DEADLOCK ***

4 locks held by kworker/u:2/30374:
 #0:  (wiphy_name(local->hw.wiphy)){++++.+}, at: [<c045cf99>] process_one_work+0x109/0x3f0
 #1:  ((&sdata->work)){+.+.+.}, at: [<c045cf99>] process_one_work+0x109/0x3f0
 #2:  (&ifibss->mtx){+.+.+.}, at: [<f98f005b>] ieee80211_ibss_work+0x1b/0x470 [mac80211]
 #3:  (&intf->beacon_skb_mutex){+.+...}, at: [<f997a644>] rt2x00queue_update_beacon+0x24/0x50 [rt2x00lib]

stack backtrace:
Pid: 30374, comm: kworker/u:2 Not tainted 3.4.0-wl+ #4
Call Trace:
 [<c04962a6>] print_usage_bug+0x1f6/0x220
 [<c0496a12>] mark_lock+0x2c2/0x300
 [<c0495ff0>] ? check_usage_forwards+0xc0/0xc0
 [<c04978ec>] __lock_acquire+0x4bc/0x1050
 [<c0527890>] ? __kmalloc_track_caller+0x1c0/0x1d0
 [<c0777fb6>] ? copy_skb_header+0x26/0x90
 [<c0498504>] lock_acquire+0x84/0xf0
 [<f9979a20>] ? rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
 [<c0835733>] _raw_spin_lock+0x33/0x40
 [<f9979a20>] ? rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
 [<f9979a20>] rt2x00queue_create_tx_descriptor+0x380/0x490 [rt2x00lib]
 [<f997a5cf>] rt2x00queue_update_beacon_locked+0x5f/0xb0 [rt2x00lib]
 [<f997a64d>] rt2x00queue_update_beacon+0x2d/0x50 [rt2x00lib]
 [<f9977e3a>] rt2x00mac_bss_info_changed+0x1ca/0x200 [rt2x00lib]
 [<f9977c70>] ? rt2x00mac_remove_interface+0x70/0x70 [rt2x00lib]
 [<f98e4dd0>] ieee80211_bss_info_change_notify+0xe0/0x1d0 [mac80211]
 [<f98ef7b8>] __ieee80211_sta_join_ibss+0x3b8/0x610 [mac80211]
 [<c0496ab4>] ? mark_held_locks+0x64/0xc0
 [<c0440012>] ? virt_efi_query_capsule_caps+0x12/0x50
 [<f98efb09>] ieee80211_sta_join_ibss+0xf9/0x140 [mac80211]
 [<f98f0456>] ieee80211_ibss_work+0x416/0x470 [mac80211]
 [<c0496d8b>] ? trace_hardirqs_on+0xb/0x10
 [<c077683b>] ? skb_dequeue+0x4b/0x70
 [<f98f207f>] ieee80211_iface_work+0x13f/0x230 [mac80211]
 [<c045cf99>] ? process_one_work+0x109/0x3f0
 [<c045d015>] process_one_work+0x185/0x3f0
 [<c045cf99>] ? process_one_work+0x109/0x3f0
 [<f98f1f40>] ? ieee80211_teardown_sdata+0xa0/0xa0 [mac80211]
 [<c045ed86>] worker_thread+0x116/0x270
 [<c045ec70>] ? manage_workers+0x1e0/0x1e0
 [<c0462f64>] kthread+0x84/0x90
 [<c0462ee0>] ? __init_kthread_worker+0x60/0x60
 [<c083d382>] kernel_thread_helper+0x6/0x10

Cc: [email protected]
Signed-off-by: Stanislaw Gruszka <[email protected]>
Acked-by: Helmut Schaa <[email protected]>
Acked-by: Gertjan van Wingerde <[email protected]>
Signed-off-by: John W. Linville <[email protected]>
svenkatr pushed a commit that referenced this issue Jun 22, 2012
The omapdss arch initialization code registers all the output devices as
omap_devices. However, DPI and SDI are not proper omap_devices, as they
do not have any corresponding HWMOD. This leads to crashes or problems
when the platform code tries to use omap_device functions for DPI and
SDI devices.

One such crash was reported by John Stultz <[email protected]>:

[   18.756835] Unable to handle kernel NULL pointer dereference at
virtual addr8
[   18.765319] pgd = ea6b8000
[   18.768188] [00000018] *pgd=aa942831, *pte=00000000, *ppte=00000000
[   18.774749] Internal error: Oops: 17 [#1] SMP ARM
[   18.779663] Modules linked in:
[   18.782836] CPU: 0    Not tainted  (3.5.0-rc1-dirty #456)
[   18.788482] PC is at _od_resume_noirq+0x1c/0x78
[   18.793212] LR is at _od_resume_noirq+0x6c/0x78
[   18.797943] pc : [<c00307ec>]    lr : [<c003083c>]    psr: 20000113
[   18.797943] sp : ec3abe80  ip : ec3abdb8  fp : 00000006
[   18.809936] r10: ec1148b8  r9 : c08a48f0  r8 : c00307d0
[   18.815368] r7 : 00000000  r6 : 00000000  r5 : ec114800  r4 :
ec114808
[   18.822174] r3 : 00000000  r2 : 00000000  r1 : ec154fe8  r0 :
00000006
[   18.829010] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
Segment user
[   18.836456] Control: 10c5387d  Table: aa6b804a  DAC: 00000015
[   18.842437] Process sh (pid: 1139, stack limit = 0xec3aa2f0)
[   18.848358] Stack: (0xec3abe80 to 0xec3ac000)

DPI and SDI can be plain platform_devices. This patch changes the
registration from omap_device_register() to platform_device_add().

Signed-off-by: Tomi Valkeinen <[email protected]>
Reported-by: John Stultz <[email protected]>
Tested-by: Jean Pihet <[email protected]>
svenkatr pushed a commit that referenced this issue Jun 22, 2012
Commit b01a4f1 (mmc: omap: convert to per instance workqueue) initializes
the workqueue too late causing the following:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = c0004000
[00000000] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in:
CPU: 0    Not tainted  (3.4.0-08218-gb48b2c3 #158)
PC is at __queue_work+0x8/0x46c
LR is at queue_work_on+0x38/0x40
pc : [<c005bb4c>]    lr : [<c005c00c>]    psr: 60000193
sp : c0691e1c  ip : 00000000  fp : c07374ac
r10: c7aae400  r9 : c0395700  r8 : 00000100
r7 : c0691e70  r6 : 00000000  r5 : 00000000  r4 : c7aae440
r3 : 00000001  r2 : c7aae440  r1 : 00000000  r0 : 00000000
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
Control: 00c5387d  Table: 80004000  DAC: 00000017
Process swapper/0 (pid: 0, stack limit = 0xc06902f8)
Stack: (0xc0691e1c to 0xc0692000)

Fix this by initializing the workqueue before mmc_omap_remove_slot()
get called. Tested on n770, looks like n800 at least still has some
other issue with MMC.

Signed-off-by: Tony Lindgren <[email protected]>
Signed-off-by: Chris Ball <[email protected]>
svenkatr pushed a commit that referenced this issue Jun 22, 2012
Fix:

BUG: sleeping function called from invalid context at kernel/workqueue.c:2547
in_atomic(): 1, irqs_disabled(): 0, pid: 629, name: wpa_supplicant
2 locks held by wpa_supplicant/629:
 #0:  (rtnl_mutex){+.+.+.}, at: [<c08b2b84>] rtnl_lock+0x14/0x20
 #1:  (&trigger->leddev_list_lock){.+.?..}, at: [<c0867f41>] led_trigger_event+0x21/0x80
Pid: 629, comm: wpa_supplicant Not tainted 3.3.0-0.rc3.git5.1.fc17.i686
Call Trace:
 [<c046a9f6>] __might_sleep+0x126/0x1d0
 [<c0457d6c>] wait_on_work+0x2c/0x1d0
 [<c045a09a>] __cancel_work_timer+0x6a/0x120
 [<c045a160>] cancel_delayed_work_sync+0x10/0x20
 [<f7dd3c22>] rtl8187_led_brightness_set+0x82/0xf0 [rtl8187]
 [<c0867f7c>] led_trigger_event+0x5c/0x80
 [<f7ff5e6d>] ieee80211_led_radio+0x1d/0x40 [mac80211]
 [<f7ff3583>] ieee80211_stop_device+0x13/0x230 [mac80211]

Removing _sync is ok, because if led_on work is currently running
it will be finished before led_off work start to perform, since
they are always queued on the same mac80211 local->workqueue.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=795176

Signed-off-by: Stanislaw Gruszka <[email protected]>
Acked-by: Larry Finger <[email protected]>
Acked-by: Hin-Tak Leung <[email protected]>
Signed-off-by: John W. Linville <[email protected]>
svenkatr pushed a commit that referenced this issue Jun 22, 2012
this patch fixes kernel Oops on "rmmod b43" if firmware was not loaded:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
IP: [<ffffffff8104e988>] drain_workqueue+0x25/0x142
PGD 153ac6067 PUD 153b82067 PMD 0
Oops: 0000 [#1] SMP

Signed-off-by: Oleksij Rempel <[email protected]>
Signed-off-by: John W. Linville <[email protected]>
svenkatr pushed a commit that referenced this issue Jun 22, 2012
The warning below triggers on AMD MCM packages because physical package
IDs on the cores of a _physical_ socket are the same. I.e., this field
says which CPUs belong to the same physical package.

However, the same two CPUs belong to two different internal, i.e.
"logical" nodes in the same physical socket which is reflected in the
CPU-to-node map on x86 with NUMA.

Which makes this check wrong on the above topologies so circumvent it.

[    0.444413] Booting Node   0, Processors  #1 #2 #3 #4 #5 Ok.
[    0.461388] ------------[ cut here ]------------
[    0.465997] WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()
[    0.473960] Hardware name: Dinar
[    0.477170] sched: CPU #6's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[    0.486860] Booting Node   1, Processors  #6
[    0.491104] Modules linked in:
[    0.494141] Pid: 0, comm: swapper/6 Not tainted 3.4.0+ #1
[    0.499510] Call Trace:
[    0.501946]  [<ffffffff8144bf92>] ? topology_sane.clone.1+0x6e/0x81
[    0.508185]  [<ffffffff8102f1fc>] warn_slowpath_common+0x85/0x9d
[    0.514163]  [<ffffffff8102f2b7>] warn_slowpath_fmt+0x46/0x48
[    0.519881]  [<ffffffff8144bf92>] topology_sane.clone.1+0x6e/0x81
[    0.525943]  [<ffffffff8144c234>] set_cpu_sibling_map+0x251/0x371
[    0.532004]  [<ffffffff8144c4ee>] start_secondary+0x19a/0x218
[    0.537729] ---[ end trace 4eaa2a86a8e2da22 ]---
[    0.628197]  #7 #8 #9 #10 #11 Ok.
[    0.807108] Booting Node   3, Processors  #12 #13 #14 #15 #16 #17 Ok.
[    0.897587] Booting Node   2, Processors  #18 #19 #20 #21 #22 #23 Ok.
[    0.917443] Brought up 24 CPUs

We ran a topology sanity check test we have here on it and
it all looks ok... hopefully :).

Signed-off-by: Borislav Petkov <[email protected]>
Cc: Andreas Herrmann <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
svenkatr pushed a commit that referenced this issue Jun 22, 2012
We need to make sure that the USB serial driver we find
matches the USB driver whose probe we are currently
executing. Otherwise we will end up with USB serial
devices bound to the correct serial driver but wrong
USB driver.

An example of such cross-probing, where the usbserial_generic
USB driver has found the sierra serial driver:

May 29 18:26:15 nemi kernel: [ 4442.559246] usbserial_generic 4-4:1.0: Sierra USB modem converter detected
May 29 18:26:20 nemi kernel: [ 4447.556747] usbserial_generic 4-4:1.2: Sierra USB modem converter detected
May 29 18:26:25 nemi kernel: [ 4452.557288] usbserial_generic 4-4:1.3: Sierra USB modem converter detected

sysfs view of the same problem:

bjorn@nemi:~$ ls -l /sys/bus/usb/drivers/sierra/
total 0
--w------- 1 root root 4096 May 29 18:23 bind
lrwxrwxrwx 1 root root    0 May 29 18:23 module -> ../../../../module/usbserial
--w------- 1 root root 4096 May 29 18:23 uevent
--w------- 1 root root 4096 May 29 18:23 unbind
bjorn@nemi:~$ ls -l /sys/bus/usb-serial/drivers/sierra/
total 0
--w------- 1 root root 4096 May 29 18:23 bind
lrwxrwxrwx 1 root root    0 May 29 18:23 module -> ../../../../module/sierra
-rw-r--r-- 1 root root 4096 May 29 18:23 new_id
lrwxrwxrwx 1 root root    0 May 29 18:32 ttyUSB0 -> ../../../../devices/pci0000:00/0000:00:1d.7/usb4/4-4/4-4:1.0/ttyUSB0
lrwxrwxrwx 1 root root    0 May 29 18:32 ttyUSB1 -> ../../../../devices/pci0000:00/0000:00:1d.7/usb4/4-4/4-4:1.2/ttyUSB1
lrwxrwxrwx 1 root root    0 May 29 18:32 ttyUSB2 -> ../../../../devices/pci0000:00/0000:00:1d.7/usb4/4-4/4-4:1.3/ttyUSB2
--w------- 1 root root 4096 May 29 18:23 uevent
--w------- 1 root root 4096 May 29 18:23 unbind

bjorn@nemi:~$ ls -l /sys/bus/usb/drivers/usbserial_generic/
total 0
lrwxrwxrwx 1 root root    0 May 29 18:33 4-4:1.0 -> ../../../../devices/pci0000:00/0000:00:1d.7/usb4/4-4/4-4:1.0
lrwxrwxrwx 1 root root    0 May 29 18:33 4-4:1.2 -> ../../../../devices/pci0000:00/0000:00:1d.7/usb4/4-4/4-4:1.2
lrwxrwxrwx 1 root root    0 May 29 18:33 4-4:1.3 -> ../../../../devices/pci0000:00/0000:00:1d.7/usb4/4-4/4-4:1.3
--w------- 1 root root 4096 May 29 18:33 bind
lrwxrwxrwx 1 root root    0 May 29 18:33 module -> ../../../../module/usbserial
--w------- 1 root root 4096 May 29 18:22 uevent
--w------- 1 root root 4096 May 29 18:33 unbind
bjorn@nemi:~$ ls -l /sys/bus/usb-serial/drivers/generic/
total 0
--w------- 1 root root 4096 May 29 18:33 bind
lrwxrwxrwx 1 root root    0 May 29 18:33 module -> ../../../../module/usbserial
-rw-r--r-- 1 root root 4096 May 29 18:33 new_id
--w------- 1 root root 4096 May 29 18:22 uevent
--w------- 1 root root 4096 May 29 18:33 unbind

So we end up with a mismatch between the USB driver and the
USB serial driver.  The reason for the above is simple: The
USB driver probe will succeed if *any* registered serial
driver matches, and will use that serial driver for all
serial driver functions.

This makes ref counting go wrong. We count the USB driver
as used, but not the USB serial driver.  This may result
in Oops'es as demonstrated by Johan Hovold <[email protected]>:

[11811.646396] drivers/usb/serial/usb-serial.c: get_free_serial 1
[11811.646443] drivers/usb/serial/usb-serial.c: get_free_serial - minor base = 0
[11811.646460] drivers/usb/serial/usb-serial.c: usb_serial_probe - registering ttyUSB0
[11811.646766] usb 6-1: pl2303 converter now attached to ttyUSB0
[11812.264197] USB Serial deregistering driver FTDI USB Serial Device
[11812.264865] usbcore: deregistering interface driver ftdi_sio
[11812.282180] USB Serial deregistering driver pl2303
[11812.283141] pl2303 ttyUSB0: pl2303 converter now disconnected from ttyUSB0
[11812.283272] usbcore: deregistering interface driver pl2303
[11812.301056] USB Serial deregistering driver generic
[11812.301186] usbcore: deregistering interface driver usbserial_generic
[11812.301259] drivers/usb/serial/usb-serial.c: usb_serial_disconnect
[11812.301823] BUG: unable to handle kernel paging request at f8e7438c
[11812.301845] IP: [<f8e38445>] usb_serial_disconnect+0xb5/0x100 [usbserial]
[11812.301871] *pde = 357ef067 *pte = 00000000
[11812.301957] Oops: 0000 [#1] PREEMPT SMP
[11812.301983] Modules linked in: usbserial(-) [last unloaded: pl2303]
[11812.302008]
[11812.302019] Pid: 1323, comm: modprobe Tainted: G        W    3.4.0-rc7+ #101 Dell Inc. Vostro 1520/0T816J
[11812.302115] EIP: 0060:[<f8e38445>] EFLAGS: 00010246 CPU: 1
[11812.302130] EIP is at usb_serial_disconnect+0xb5/0x100 [usbserial]
[11812.302141] EAX: f508a180 EBX: f508a180 ECX: 00000000 EDX: f8e74300
[11812.302151] ESI: f5050800 EDI: 00000001 EBP: f5141e78 ESP: f5141e58
[11812.302160]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[11812.302170] CR0: 8005003b CR2: f8e7438c CR3: 34848000 CR4: 000007d0
[11812.302180] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[11812.302189] DR6: ffff0ff0 DR7: 00000400
[11812.302199] Process modprobe (pid: 1323, ti=f5140000 task=f61e2bc0 task.ti=f5140000)
[11812.302209] Stack:
[11812.302216]  f8e3be0f f8e3b29c f8e3ae00 00000000 f513641c f5136400 f513641c f507a540
[11812.302325]  f5141e98 c133d2c1 00000000 00000000 f509c400 f513641c f507a590 f5136450
[11812.302372]  f5141ea8 c12f0344 f513641c f507a590 f5141ebc c12f0c67 00000000 f507a590
[11812.302419] Call Trace:
[11812.302439]  [<c133d2c1>] usb_unbind_interface+0x51/0x190
[11812.302456]  [<c12f0344>] __device_release_driver+0x64/0xb0
[11812.302469]  [<c12f0c67>] driver_detach+0x97/0xa0
[11812.302483]  [<c12f001c>] bus_remove_driver+0x6c/0xe0
[11812.302500]  [<c145938d>] ? __mutex_unlock_slowpath+0xcd/0x140
[11812.302514]  [<c12f0ff9>] driver_unregister+0x49/0x80
[11812.302528]  [<c1457df6>] ? printk+0x1d/0x1f
[11812.302540]  [<c133c50d>] usb_deregister+0x5d/0xb0
[11812.302557]  [<f8e37c55>] ? usb_serial_deregister+0x45/0x50 [usbserial]
[11812.302575]  [<f8e37c8d>] usb_serial_deregister_drivers+0x2d/0x40 [usbserial]
[11812.302593]  [<f8e3a6e2>] usb_serial_generic_deregister+0x12/0x20 [usbserial]
[11812.302611]  [<f8e3acf0>] usb_serial_exit+0x8/0x32 [usbserial]
[11812.302716]  [<c1080b48>] sys_delete_module+0x158/0x260
[11812.302730]  [<c110594e>] ? mntput+0x1e/0x30
[11812.302746]  [<c145c3c3>] ? sysenter_exit+0xf/0x18
[11812.302746]  [<c107777c>] ? trace_hardirqs_on_caller+0xec/0x170
[11812.302746]  [<c145c390>] sysenter_do_call+0x12/0x36
[11812.302746] Code: 24 02 00 00 e8 dd f3 20 c8 f6 86 74 02 00 00 02 74 b4 8d 86 4c 02 00 00 47 e8 78 55 4b c8 0f b6 43 0e 39 f8 7f a9 8b 53 04 89 d8 <ff> 92 8c 00 00 00 89 d8 e8 0e ff ff ff 8b 45 f0 c7 44 24 04 2f
[11812.302746] EIP: [<f8e38445>] usb_serial_disconnect+0xb5/0x100 [usbserial] SS:ESP 0068:f5141e58
[11812.302746] CR2: 00000000f8e7438c

Fix by only evaluating serial drivers pointing back to the
USB driver we are currently probing.  This still allows two
or more drivers to match the same device, running their
serial driver probes to sort out which one to use.

Cc: [email protected] # 3.0, 3.2, 3.3, 3.4
Signed-off-by: Bjørn Mork <[email protected]>
Reviewed-by: Felipe Balbi <[email protected]>
Tested-by: Johan Hovold <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
svenkatr pushed a commit that referenced this issue Jun 22, 2012
A bunch of bugzillas have complained how noisy the nmi_watchdog
is during boot-up especially with its expected failure cases
(like virt and bios resource contention).

This is my attempt to quiet them down and keep it less confusing
for the end user.  What I did is print the message for cpu0 and
save it for future comparisons.  If future cpus have an
identical message as cpu0, then don't print the redundant info.
However, if a future cpu has a different message, happily print
that loudly.

Before the change, you would see something like:

    ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
    CPU0: Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz stepping 0a
    Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
    ... version:                2
    ... bit width:              40
    ... generic registers:      2
    ... value mask:             000000ffffffffff
    ... max period:             000000007fffffff
    ... fixed-purpose events:   3
    ... event mask:             0000000700000003
    NMI watchdog enabled, takes one hw-pmu counter.
    Booting Node   0, Processors  #1
    NMI watchdog enabled, takes one hw-pmu counter.
     #2
    NMI watchdog enabled, takes one hw-pmu counter.
     #3 Ok.
    NMI watchdog enabled, takes one hw-pmu counter.
    Brought up 4 CPUs
    Total of 4 processors activated (22607.24 BogoMIPS).

After the change, it is simplified to:

    ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
    CPU0: Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz stepping 0a
    Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
    ... version:                2
    ... bit width:              40
    ... generic registers:      2
    ... value mask:             000000ffffffffff
    ... max period:             000000007fffffff
    ... fixed-purpose events:   3
    ... event mask:             0000000700000003
    NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
    Booting Node   0, Processors  #1 #2 #3 Ok.
    Brought up 4 CPUs

V2: little changes based on Joe Perches' feedback
V3: printk cleanup based on Ingo's feedback; checkpatch fix
V4: keep printk as one long line
V5: Ingo fix ups

Reported-and-tested-by: Nathan Zimmer <[email protected]>
Signed-off-by: Don Zickus <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
svenkatr pushed a commit that referenced this issue Jun 22, 2012
This patch modifies the gdm72xx driver to properly release a netlink
socket using netlink_kernel_release. It fixes the following kernel
crash, which occurs after repeatedly suspending and resuming a system.

   kernel BUG at /home/benchan/trunk/src/third_party/kernel/files/mm/slub.c:3471!
   invalid opcode: 0000 [#1] SMP
   CPU 2
   Modules linked in: asix usbnet snd_hda_codec_hdmi
   snd_hda_codec_cirrus i2c_dev uinput snd_hda_intel snd_hda_codec
   snd_hwdep snd_pcm snd_timer bluetooth snd_page_alloc fuse aesni_intel
   cryptd isl29018(C) aes_x86_64 industrialio(C) memconsole nm10_gpio
   rtc_cmos nf_conntrack_ipv6 nf_defrag_ipv6 r8169 ath9k mac80211
   ip6table_filter ath9k_common ath9k_hw ath cfg80211 xt_mark ip6_tables
   uvcvideo videobuf2_core videodev videobuf2_vmalloc videobuf2_memops
   gdmwm(C) joydev

   Pid: 3125, comm: kworker/u:30 Tainted: G        WC   3.4.0 #1
   RIP: 0010:[<ffffffff810cda19>]  [<ffffffff810cda19>] kfree+0x67/0xca
   RSP: 0018:ffff880134977d60  EFLAGS: 00010246
   RAX: 4000000000000400 RBX: ffffffff818832a0 RCX: 0000000000000000
   RDX: 4000000000000000 RSI: 0000000000000000 RDI: ffffffff818832a0
   RBP: ffff880134977d80 R08: 00000000ffffffff R09: ffffea00000620c0
   R10: ffffffff8111b729 R11: ffff880149fb3840 R12: ffffffff81a08840
   R13: ffffffff813f5bc3 R14: ffffffff8138ed84 R15: 0000000000000000
   FS:  0000000000000000(0000) GS:ffff88014fb00000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
   CR2: 00007f7cad963110 CR3: 000000000180b000 CR4: 00000000000407e0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
   Process kworker/u:30 (pid: 3125, threadinfo ffff880134976000, task ffff8801330647e0)
   Stack:
    0000000000000002 ffffffff818832a0 ffffffff81a08840 ffff880134977df0
    ffff880134977da0 ffffffff813f5bc3 ffff880134977df0 ffffffff81883250
    ffff880134977dd0 ffffffff8138e64c 0000000180150010 ffffffff81883250
   Call Trace:
    [<ffffffff813f5bc3>] ipv4_sysctl_exit_net+0x23/0x27
    [<ffffffff8138e64c>] ops_exit_list+0x27/0x50
    [<ffffffff8138ee72>] cleanup_net+0xee/0x17c
    [<ffffffff81040c64>] process_one_work+0x199/0x2b8
    [<ffffffff810416e4>] worker_thread+0x13c/0x222
    [<ffffffff810415a8>] ? manage_workers.isra.26+0x171/0x171
    [<ffffffff8104506d>] kthread+0x8b/0x93
    [<ffffffff8145b414>] kernel_thread_helper+0x4/0x10
    [<ffffffff81044fe2>] ? __init_kthread_worker+0x39/0x39
    [<ffffffff8145b410>] ? gs_change+0xb/0xb
   Code: 83 c4 10 49 83 3c 24 00 eb e4 48 83 fb 10 76 76 48 89 df e8 17
   e1 ff ff 49 89 c1 48 8b 00 a8 80 75 15 49 f7 01 00 c0 00 00 75 02
   <0f> 0b 4c 89 cf e8 b8 b4 fd ff eb 4f 4c 8b 55 08 49 8b 79 30 48
   RIP  [<ffffffff810cda19>] kfree+0x67/0xca
    RSP <ffff880134977d60>

Signed-off-by: Ben Chan <[email protected]>
Cc: Sage Ahn <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
svenkatr pushed a commit that referenced this issue Jun 25, 2012
We need to use the per event info snapshoted at record time to
synthesize the events name, so do it just after reading the perf.data
headers, when we already processed the /sys events data, otherwise we'll
end up using the local /sys that only by sheer luck will have the same
tracepoint ID -> real event association.

Example:

  # uname -a
  Linux felicio.ghostprotocols.net 3.4.0-rc5+ #1 SMP Sat May 19 15:27:11 BRT 2012 x86_64 x86_64 x86_64 GNU/Linux
  # perf record -e sched:sched_switch usleep 1
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.015 MB perf.data (~648 samples) ]
  # cat /t/events/sched/sched_switch/id
  279
  # perf evlist -v
  sched:sched_switch: sample_freq=1, type: 2, config: 279, size: 80, sample_type: 1159, read_format: 7, disabled: 1, inherit: 1, mmap: 1, comm: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1
  #

So on the above machine the sched:sched_switch has tracepoint id 279, but on
the machine were we'll analyse it it has a different id:

  $ cat /t/events/sched/sched_switch/id
  56
  $ perf evlist -i /tmp/perf.data
  kmem:mm_balancedirty_writeout
  $ cat /t/events/kmem/mm_balancedirty_writeout/id
  279

With this fix:

  $ perf evlist -i /tmp/perf.data
  sched:sched_switch

Reported-by: Dmitry Antipov <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 2, 2012
reg_timeout_work() calls restore_regulatory_settings() which
takes cfg80211_mutex.

reg_set_request_processed() already holds cfg80211_mutex
before calling cancel_delayed_work_sync(reg_timeout),
so it might deadlock.

Call the async cancel_delayed_work instead, in order
to avoid the potential deadlock.

This is the relevant lockdep warning:

cfg80211: Calling CRDA for country: XX

======================================================
[ INFO: possible circular locking dependency detected ]
3.4.0-rc5-wl+ #26 Not tainted
-------------------------------------------------------
kworker/0:2/1391 is trying to acquire lock:
 (cfg80211_mutex){+.+.+.}, at: [<bf28ae00>] restore_regulatory_settings+0x34/0x418 [cfg80211]

but task is already holding lock:
 ((reg_timeout).work){+.+...}, at: [<c0059e94>] process_one_work+0x1f0/0x480

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 ((reg_timeout).work){+.+...}:
       [<c008fd44>] validate_chain+0xb94/0x10f0
       [<c0090b68>] __lock_acquire+0x8c8/0x9b0
       [<c0090d40>] lock_acquire+0xf0/0x114
       [<c005b600>] wait_on_work+0x4c/0x154
       [<c005c000>] __cancel_work_timer+0xd4/0x11c
       [<c005c064>] cancel_delayed_work_sync+0x1c/0x20
       [<bf28b274>] reg_set_request_processed+0x50/0x78 [cfg80211]
       [<bf28bd84>] set_regdom+0x550/0x600 [cfg80211]
       [<bf294cd8>] nl80211_set_reg+0x218/0x258 [cfg80211]
       [<c03c7738>] genl_rcv_msg+0x1a8/0x1e8
       [<c03c6a00>] netlink_rcv_skb+0x5c/0xc0
       [<c03c7584>] genl_rcv+0x28/0x34
       [<c03c6720>] netlink_unicast+0x15c/0x228
       [<c03c6c7c>] netlink_sendmsg+0x218/0x298
       [<c03933c8>] sock_sendmsg+0xa4/0xc0
       [<c039406c>] __sys_sendmsg+0x1e4/0x268
       [<c0394228>] sys_sendmsg+0x4c/0x70
       [<c0013840>] ret_fast_syscall+0x0/0x3c

-> #1 (reg_mutex){+.+.+.}:
       [<c008fd44>] validate_chain+0xb94/0x10f0
       [<c0090b68>] __lock_acquire+0x8c8/0x9b0
       [<c0090d40>] lock_acquire+0xf0/0x114
       [<c04734dc>] mutex_lock_nested+0x48/0x320
       [<bf28b2cc>] reg_todo+0x30/0x538 [cfg80211]
       [<c0059f44>] process_one_work+0x2a0/0x480
       [<c005a4b4>] worker_thread+0x1bc/0x2bc
       [<c0061148>] kthread+0x98/0xa4
       [<c0014af4>] kernel_thread_exit+0x0/0x8

-> #0 (cfg80211_mutex){+.+.+.}:
       [<c008ed58>] print_circular_bug+0x68/0x2cc
       [<c008fb28>] validate_chain+0x978/0x10f0
       [<c0090b68>] __lock_acquire+0x8c8/0x9b0
       [<c0090d40>] lock_acquire+0xf0/0x114
       [<c04734dc>] mutex_lock_nested+0x48/0x320
       [<bf28ae00>] restore_regulatory_settings+0x34/0x418 [cfg80211]
       [<bf28b200>] reg_timeout_work+0x1c/0x20 [cfg80211]
       [<c0059f44>] process_one_work+0x2a0/0x480
       [<c005a4b4>] worker_thread+0x1bc/0x2bc
       [<c0061148>] kthread+0x98/0xa4
       [<c0014af4>] kernel_thread_exit+0x0/0x8

other info that might help us debug this:

Chain exists of:
  cfg80211_mutex --> reg_mutex --> (reg_timeout).work

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((reg_timeout).work);
                               lock(reg_mutex);
                               lock((reg_timeout).work);
  lock(cfg80211_mutex);

 *** DEADLOCK ***

2 locks held by kworker/0:2/1391:
 #0:  (events){.+.+.+}, at: [<c0059e94>] process_one_work+0x1f0/0x480
 #1:  ((reg_timeout).work){+.+...}, at: [<c0059e94>] process_one_work+0x1f0/0x480

stack backtrace:
[<c001b928>] (unwind_backtrace+0x0/0x12c) from [<c0471d3c>] (dump_stack+0x20/0x24)
[<c0471d3c>] (dump_stack+0x20/0x24) from [<c008ef70>] (print_circular_bug+0x280/0x2cc)
[<c008ef70>] (print_circular_bug+0x280/0x2cc) from [<c008fb28>] (validate_chain+0x978/0x10f0)
[<c008fb28>] (validate_chain+0x978/0x10f0) from [<c0090b68>] (__lock_acquire+0x8c8/0x9b0)
[<c0090b68>] (__lock_acquire+0x8c8/0x9b0) from [<c0090d40>] (lock_acquire+0xf0/0x114)
[<c0090d40>] (lock_acquire+0xf0/0x114) from [<c04734dc>] (mutex_lock_nested+0x48/0x320)
[<c04734dc>] (mutex_lock_nested+0x48/0x320) from [<bf28ae00>] (restore_regulatory_settings+0x34/0x418 [cfg80211])
[<bf28ae00>] (restore_regulatory_settings+0x34/0x418 [cfg80211]) from [<bf28b200>] (reg_timeout_work+0x1c/0x20 [cfg80211])
[<bf28b200>] (reg_timeout_work+0x1c/0x20 [cfg80211]) from [<c0059f44>] (process_one_work+0x2a0/0x480)
[<c0059f44>] (process_one_work+0x2a0/0x480) from [<c005a4b4>] (worker_thread+0x1bc/0x2bc)
[<c005a4b4>] (worker_thread+0x1bc/0x2bc) from [<c0061148>] (kthread+0x98/0xa4)
[<c0061148>] (kthread+0x98/0xa4) from [<c0014af4>] (kernel_thread_exit+0x0/0x8)
cfg80211: Calling CRDA to update world regulatory domain
cfg80211: World regulatory domain updated:
cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
cfg80211:   (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)

Cc: [email protected]
Signed-off-by: Eliad Peller <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 2, 2012
Denys Fedoryshchenko reported a LOCKDEP issue with l2tp code.

[ 8683.927442] ======================================================
[ 8683.927555] [ INFO: possible circular locking dependency detected ]
[ 8683.927672] 3.4.1-build-0061 #14 Not tainted
[ 8683.927782] -------------------------------------------------------
[ 8683.927895] swapper/0/0 is trying to acquire lock:
[ 8683.928007]  (slock-AF_INET){+.-...}, at: [<e0fc73ec>]
l2tp_xmit_skb+0x173/0x47e [l2tp_core]
[ 8683.928121]
[ 8683.928121] but task is already holding lock:
[ 8683.928121]  (_xmit_ETHER#2){+.-...}, at: [<c02f062d>]
sch_direct_xmit+0x36/0x119
[ 8683.928121]
[ 8683.928121] which lock already depends on the new lock.
[ 8683.928121]
[ 8683.928121]
[ 8683.928121] the existing dependency chain (in reverse order) is:
[ 8683.928121]
[ 8683.928121] -> #1 (_xmit_ETHER#2){+.-...}:
[ 8683.928121]        [<c015a561>] lock_acquire+0x71/0x85
[ 8683.928121]        [<c034da2d>] _raw_spin_lock+0x33/0x40
[ 8683.928121]        [<c0304e0c>] ip_send_reply+0xf2/0x1ce
[ 8683.928121]        [<c0317dbc>] tcp_v4_send_reset+0x153/0x16f
[ 8683.928121]        [<c0317f4a>] tcp_v4_do_rcv+0x172/0x194
[ 8683.928121]        [<c031929b>] tcp_v4_rcv+0x387/0x5a0
[ 8683.928121]        [<c03001d0>] ip_local_deliver_finish+0x13a/0x1e9
[ 8683.928121]        [<c0300645>] NF_HOOK.clone.11+0x46/0x4d
[ 8683.928121]        [<c030075b>] ip_local_deliver+0x41/0x45
[ 8683.928121]        [<c03005dd>] ip_rcv_finish+0x31a/0x33c
[ 8683.928121]        [<c0300645>] NF_HOOK.clone.11+0x46/0x4d
[ 8683.928121]        [<c0300960>] ip_rcv+0x201/0x23d
[ 8683.928121]        [<c02de91b>] __netif_receive_skb+0x329/0x378
[ 8683.928121]        [<c02deae8>] netif_receive_skb+0x4e/0x7d
[ 8683.928121]        [<e08d5ef3>] rtl8139_poll+0x243/0x33d [8139too]
[ 8683.928121]        [<c02df103>] net_rx_action+0x90/0x15d
[ 8683.928121]        [<c012b2b5>] __do_softirq+0x7b/0x118
[ 8683.928121]
[ 8683.928121] -> #0 (slock-AF_INET){+.-...}:
[ 8683.928121]        [<c0159f1b>] __lock_acquire+0x9a3/0xc27
[ 8683.928121]        [<c015a561>] lock_acquire+0x71/0x85
[ 8683.928121]        [<c034da2d>] _raw_spin_lock+0x33/0x40
[ 8683.928121]        [<e0fc73ec>] l2tp_xmit_skb+0x173/0x47e
[l2tp_core]
[ 8683.928121]        [<e0fe31fb>] l2tp_eth_dev_xmit+0x1a/0x2f
[l2tp_eth]
[ 8683.928121]        [<c02e01e7>] dev_hard_start_xmit+0x333/0x3f2
[ 8683.928121]        [<c02f064c>] sch_direct_xmit+0x55/0x119
[ 8683.928121]        [<c02e0528>] dev_queue_xmit+0x282/0x418
[ 8683.928121]        [<c031f4fb>] NF_HOOK.clone.19+0x45/0x4c
[ 8683.928121]        [<c031f524>] arp_xmit+0x22/0x24
[ 8683.928121]        [<c031f567>] arp_send+0x41/0x48
[ 8683.928121]        [<c031fa7d>] arp_process+0x289/0x491
[ 8683.928121]        [<c031f4fb>] NF_HOOK.clone.19+0x45/0x4c
[ 8683.928121]        [<c031f7a0>] arp_rcv+0xb1/0xc3
[ 8683.928121]        [<c02de91b>] __netif_receive_skb+0x329/0x378
[ 8683.928121]        [<c02de9d3>] process_backlog+0x69/0x130
[ 8683.928121]        [<c02df103>] net_rx_action+0x90/0x15d
[ 8683.928121]        [<c012b2b5>] __do_softirq+0x7b/0x118
[ 8683.928121]
[ 8683.928121] other info that might help us debug this:
[ 8683.928121]
[ 8683.928121]  Possible unsafe locking scenario:
[ 8683.928121]
[ 8683.928121]        CPU0                    CPU1
[ 8683.928121]        ----                    ----
[ 8683.928121]   lock(_xmit_ETHER#2);
[ 8683.928121]                                lock(slock-AF_INET);
[ 8683.928121]                                lock(_xmit_ETHER#2);
[ 8683.928121]   lock(slock-AF_INET);
[ 8683.928121]
[ 8683.928121]  *** DEADLOCK ***
[ 8683.928121]
[ 8683.928121] 3 locks held by swapper/0/0:
[ 8683.928121]  #0:  (rcu_read_lock){.+.+..}, at: [<c02dbc10>]
rcu_lock_acquire+0x0/0x30
[ 8683.928121]  #1:  (rcu_read_lock_bh){.+....}, at: [<c02dbc10>]
rcu_lock_acquire+0x0/0x30
[ 8683.928121]  #2:  (_xmit_ETHER#2){+.-...}, at: [<c02f062d>]
sch_direct_xmit+0x36/0x119
[ 8683.928121]
[ 8683.928121] stack backtrace:
[ 8683.928121] Pid: 0, comm: swapper/0 Not tainted 3.4.1-build-0061 #14
[ 8683.928121] Call Trace:
[ 8683.928121]  [<c034bdd2>] ? printk+0x18/0x1a
[ 8683.928121]  [<c0158904>] print_circular_bug+0x1ac/0x1b6
[ 8683.928121]  [<c0159f1b>] __lock_acquire+0x9a3/0xc27
[ 8683.928121]  [<c015a561>] lock_acquire+0x71/0x85
[ 8683.928121]  [<e0fc73ec>] ? l2tp_xmit_skb+0x173/0x47e [l2tp_core]
[ 8683.928121]  [<c034da2d>] _raw_spin_lock+0x33/0x40
[ 8683.928121]  [<e0fc73ec>] ? l2tp_xmit_skb+0x173/0x47e [l2tp_core]
[ 8683.928121]  [<e0fc73ec>] l2tp_xmit_skb+0x173/0x47e [l2tp_core]
[ 8683.928121]  [<e0fe31fb>] l2tp_eth_dev_xmit+0x1a/0x2f [l2tp_eth]
[ 8683.928121]  [<c02e01e7>] dev_hard_start_xmit+0x333/0x3f2
[ 8683.928121]  [<c02f064c>] sch_direct_xmit+0x55/0x119
[ 8683.928121]  [<c02e0528>] dev_queue_xmit+0x282/0x418
[ 8683.928121]  [<c02e02a6>] ? dev_hard_start_xmit+0x3f2/0x3f2
[ 8683.928121]  [<c031f4fb>] NF_HOOK.clone.19+0x45/0x4c
[ 8683.928121]  [<c031f524>] arp_xmit+0x22/0x24
[ 8683.928121]  [<c02e02a6>] ? dev_hard_start_xmit+0x3f2/0x3f2
[ 8683.928121]  [<c031f567>] arp_send+0x41/0x48
[ 8683.928121]  [<c031fa7d>] arp_process+0x289/0x491
[ 8683.928121]  [<c031f7f4>] ? __neigh_lookup.clone.20+0x42/0x42
[ 8683.928121]  [<c031f4fb>] NF_HOOK.clone.19+0x45/0x4c
[ 8683.928121]  [<c031f7a0>] arp_rcv+0xb1/0xc3
[ 8683.928121]  [<c031f7f4>] ? __neigh_lookup.clone.20+0x42/0x42
[ 8683.928121]  [<c02de91b>] __netif_receive_skb+0x329/0x378
[ 8683.928121]  [<c02de9d3>] process_backlog+0x69/0x130
[ 8683.928121]  [<c02df103>] net_rx_action+0x90/0x15d
[ 8683.928121]  [<c012b2b5>] __do_softirq+0x7b/0x118
[ 8683.928121]  [<c012b23a>] ? local_bh_enable+0xd/0xd
[ 8683.928121]  <IRQ>  [<c012b4d0>] ? irq_exit+0x41/0x91
[ 8683.928121]  [<c0103c6f>] ? do_IRQ+0x79/0x8d
[ 8683.928121]  [<c0157ea1>] ? trace_hardirqs_off_caller+0x2e/0x86
[ 8683.928121]  [<c034ef6e>] ? common_interrupt+0x2e/0x34
[ 8683.928121]  [<c0108a33>] ? default_idle+0x23/0x38
[ 8683.928121]  [<c01091a8>] ? cpu_idle+0x55/0x6f
[ 8683.928121]  [<c033df25>] ? rest_init+0xa1/0xa7
[ 8683.928121]  [<c033de84>] ? __read_lock_failed+0x14/0x14
[ 8683.928121]  [<c0498745>] ? start_kernel+0x303/0x30a
[ 8683.928121]  [<c0498209>] ? repair_env_string+0x51/0x51
[ 8683.928121]  [<c04980a8>] ? i386_start_kernel+0xa8/0xaf

It appears that like most virtual devices, l2tp should be converted to
LLTX mode.

This patch takes care of statistics using atomic_long in both RX and TX
paths, and fix a bug in l2tp_eth_dev_recv(), which was caching skb->data
before a pskb_may_pull() call.

Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Denys Fedoryshchenko <[email protected]>
Cc: James Chapman <[email protected]>
Cc: Hong zhi guo <[email protected]>
Cc: Francois Romieu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 2, 2012
Fixes:
[   15.470311] WARNING: at /local/scratch/ianc/devel/kernels/linux/fs/sysfs/file.c:498 sysfs_attr_ns+0x95/0xa0()
[   15.470326] sysfs: kobject eth0 without dirent
[   15.470333] Modules linked in:
[   15.470342] Pid: 12, comm: xenwatch Not tainted 3.4.0-x86_32p-xenU #93
and
[    9.150554] BUG: unable to handle kernel paging request at 2b359000
[    9.150577] IP: [<c1279561>] linkwatch_do_dev+0x81/0xc0
[    9.150592] *pdpt = 000000002c3c9027 *pde = 0000000000000000
[    9.150604] Oops: 0002 [#1] SMP
[    9.150613] Modules linked in:

This is http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=675190

Reported-by: George Shuklin <[email protected]>
Signed-off-by: Ian Campbell <[email protected]>
Tested-by: William Dauchy <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: David S. Miller <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 10, 2012
cfq may be built w/ or w/o blkcg support depending on
CONFIG_CFQ_CGROUP_IOSCHED.  If blkcg support is disabled, most of
related code is ifdef'd out but some part is left dangling -
blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register()
calls are made on it.

Feeding zero filled policy to blkcg_policy_register() is incorrect and
triggers the following WARN_ON() if CONFIG_BLK_CGROUP &&
!CONFIG_CFQ_GROUP_IOSCHED.

 ------------[ cut here ]------------
 WARNING: at block/blk-cgroup.c:867
 Modules linked in:
 Modules linked in:
 CPU: 3 Not tainted 3.4.0-09547-gfb21aff #1
 Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8)
 Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0)
	    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
 Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000
	    000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048
	    000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8
	    00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70
 Krnl Code: 00000000003d76be: a7280001           lhi     %r2,1
	    00000000003d76c2: a7f4ffdf           brc     15,3d7680
	   #00000000003d76c6: a7f40001           brc     15,3d76c8
	   >00000000003d76ca: a7c8ffea           lhi     %r12,-22
	    00000000003d76ce: a7f4ffce           brc     15,3d766a
	    00000000003d76d2: a7f40001           brc     15,3d76d4
	    00000000003d76d6: a7c80000           lhi     %r12,0
	    00000000003d76da: a7f4ffc2           brc     15,3d765e
 Call Trace:
 ([<0000000000b6f000>] initcall_debug+0x0/0x4)
  [<0000000000989e8a>] cfq_init+0x62/0xd4
  [<00000000001000ba>] do_one_initcall+0x3a/0x170
  [<000000000096fb60>] kernel_init+0x214/0x2bc
  [<0000000000623202>] kernel_thread_starter+0x6/0xc
  [<00000000006231fc>] kernel_thread_starter+0x0/0xc
 no locks held by swapper/0/1.
 Last Breaking-Event-Address:
  [<00000000003d76c6>] blkcg_policy_register+0xc6/0xe0
 ---[ end trace b8ef4903fcbf9dd3 ]---

This patch fixes the problem by ensuring all blkcg support code is
inside CONFIG_CFQ_GROUP_IOSCHED.

* blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved
  inside the first CONFIG_CFQ_GROUP_IOSCHED block.  __maybe_unused is
  dropped from blkcg_policy_cfq decl.

* blkcg_deactivate_poilcy() invocation is moved inside ifdef.  This
  also makes the activation logic match cfq_init_queue().

* All blkcg_policy_[un]register() invocations are moved inside ifdef.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Heiko Carstens <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 10, 2012
Sasha Levin reported following panic :

[ 2136.383310] BUG: unable to handle kernel NULL pointer dereference at
00000000000003b0
[ 2136.384022] IP: [<ffffffff8114e400>] __lock_acquire+0xc0/0x4b0
[ 2136.384022] PGD 131c4067 PUD 11c0c067 PMD 0
[ 2136.388106] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 2136.388106] CPU 1
[ 2136.388106] Pid: 24855, comm: trinity-child1 Tainted: G        W
3.5.0-rc2-sasha-00015-g7b268f7 #374
[ 2136.388106] RIP: 0010:[<ffffffff8114e400>]  [<ffffffff8114e400>]
__lock_acquire+0xc0/0x4b0
[ 2136.388106] RSP: 0018:ffff8800130b3ca8  EFLAGS: 00010046
[ 2136.388106] RAX: 0000000000000086 RBX: ffff88001186b000 RCX:
0000000000000000
[ 2136.388106] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000000
[ 2136.388106] RBP: ffff8800130b3d08 R08: 0000000000000001 R09:
0000000000000000
[ 2136.388106] R10: 0000000000000000 R11: 0000000000000001 R12:
0000000000000002
[ 2136.388106] R13: 00000000000003b0 R14: 0000000000000000 R15:
0000000000000000
[ 2136.388106] FS:  00007fa5b1bd4700(0000) GS:ffff88001b800000(0000)
knlGS:0000000000000000
[ 2136.388106] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2136.388106] CR2: 00000000000003b0 CR3: 0000000011d1f000 CR4:
00000000000406e0
[ 2136.388106] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 2136.388106] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 2136.388106] Process trinity-child1 (pid: 24855, threadinfo
ffff8800130b2000, task ffff88001186b000)
[ 2136.388106] Stack:
[ 2136.388106]  ffff8800130b3cd8 ffffffff81121785 ffffffff81236774
000080d000000001
[ 2136.388106]  ffff88001b9d6c00 00000000001d6c00 ffffffff130b3d08
ffff88001186b000
[ 2136.388106]  0000000000000000 0000000000000002 0000000000000000
0000000000000000
[ 2136.388106] Call Trace:
[ 2136.388106]  [<ffffffff81121785>] ? sched_clock_local+0x25/0x90
[ 2136.388106]  [<ffffffff81236774>] ? get_empty_filp+0x74/0x220
[ 2136.388106]  [<ffffffff8114e97a>] lock_acquire+0x18a/0x1e0
[ 2136.388106]  [<ffffffff836b37df>] ? rawsock_release+0x4f/0xa0
[ 2136.388106]  [<ffffffff837c0ef0>] _raw_write_lock_bh+0x40/0x80
[ 2136.388106]  [<ffffffff836b37df>] ? rawsock_release+0x4f/0xa0
[ 2136.388106]  [<ffffffff836b37df>] rawsock_release+0x4f/0xa0
[ 2136.388106]  [<ffffffff8321cfe8>] sock_release+0x18/0x70
[ 2136.388106]  [<ffffffff8321d069>] sock_close+0x29/0x30
[ 2136.388106]  [<ffffffff81236bca>] __fput+0x11a/0x2c0
[ 2136.388106]  [<ffffffff81236d85>] fput+0x15/0x20
[ 2136.388106]  [<ffffffff8321de34>] sys_accept4+0x1b4/0x200
[ 2136.388106]  [<ffffffff837c165c>] ? _raw_spin_unlock_irq+0x4c/0x80
[ 2136.388106]  [<ffffffff837c1669>] ? _raw_spin_unlock_irq+0x59/0x80
[ 2136.388106]  [<ffffffff837c2565>] ? sysret_check+0x22/0x5d
[ 2136.388106]  [<ffffffff8321de8b>] sys_accept+0xb/0x10
[ 2136.388106]  [<ffffffff837c2539>] system_call_fastpath+0x16/0x1b
[ 2136.388106] Code: ec 04 00 0f 85 ea 03 00 00 be d5 0b 00 00 48 c7 c7
8a c1 40 84 e8 b1 a5 f8 ff 31 c0 e9 d4 03 00 00 66 2e 0f 1f 84 00 00 00
00 00 <49> 81 7d 00 60 73 5e 85 b8 01 00 00 00 44 0f 44 e0 83 fe 01 77
[ 2136.388106] RIP  [<ffffffff8114e400>] __lock_acquire+0xc0/0x4b0
[ 2136.388106]  RSP <ffff8800130b3ca8>
[ 2136.388106] CR2: 00000000000003b0
[ 2136.388106] ---[ end trace 6d450e935ee18982 ]---
[ 2136.388106] Kernel panic - not syncing: Fatal exception in interrupt

rawsock_release() should test if sock->sk is NULL before calling
sock_orphan()/sock_put()

Reported-by: Sasha Levin <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Cc: [email protected]
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Samuel Ortiz <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 10, 2012
usbnet_disconnect() will set intfdata to NULL before calling
the minidriver unbind function.  The cdc_wdm subdriver cannot
know that it is disconnecting until the qmi_wwan unbind
function has called its disconnect function.  This means that
we must be able to support the cdc_wdm subdriver operating
normally while usbnet_disconnect() is running, and in
particular that intfdata may be NULL.

The only place this matters is in qmi_wwan_cdc_wdm_manage_power
which is called from cdc_wdm.  Simply testing for NULL
intfdata there is sufficient to allow it to continue working
at all times.

Fixes this Oops where a cdc-wdm device was closed while the
USB device was disconnecting, causing wdm_release to call
qmi_wwan_cdc_wdm_manage_power after intfdata was set to
NULL by usbnet_disconnect:

[41819.087460] BUG: unable to handle kernel NULL pointer dereference at 00000080
[41819.087815] IP: [<f8640458>] qmi_wwan_manage_power+0x68/0x90 [qmi_wwan]
[41819.088028] *pdpt = 000000000314f001 *pde = 0000000000000000
[41819.088028] Oops: 0002 [#1] SMP
[41819.088028] Modules linked in: qmi_wwan option usb_wwan usbserial usbnet
cdc_wdm nls_iso8859_1 nls_cp437 vfat fat usb_storage bnep rfcomm bluetooth
parport_pc ppdev binfmt_misc iptable_nat nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 iptable_mangle iptable_filter ip_tables
x_tables dm_crypt uvcvideo snd_hda_codec_realtek snd_hda_intel
videobuf2_core snd_hda_codec joydev videodev videobuf2_vmalloc
hid_multitouch snd_hwdep arc4 videobuf2_memops snd_pcm snd_seq_midi
snd_rawmidi snd_seq_midi_event ath9k mac80211 snd_seq ath9k_common ath9k_hw
ath snd_timer snd_seq_device sparse_keymap dm_multipath scsi_dh coretemp
mac_hid snd soundcore cfg80211 snd_page_alloc psmouse serio_raw microcode
lp parport dm_mirror dm_region_hash dm_log usbhid hid i915 drm_kms_helper
drm r8169 i2c_algo_bit wmi video [last unloaded: qmi_wwan]
[41819.088028]
[41819.088028] Pid: 23292, comm: qmicli Not tainted 3.4.0-5-generic #11-Ubuntu GIGABYTE T1005/T1005
[41819.088028] EIP: 0060:[<f8640458>] EFLAGS: 00010246 CPU: 1
[41819.088028] EIP is at qmi_wwan_manage_power+0x68/0x90 [qmi_wwan]
[41819.088028] EAX: 00000000 EBX: 00000000 ECX: 000000c3 EDX: 00000000
[41819.088028] ESI: c3b27658 EDI: 00000000 EBP: c298bea4 ESP: c298be98
[41819.088028]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[41819.088028] CR0: 8005003b CR2: 00000080 CR3: 3605e00 CR4: 000007f0
[41819.088028] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[41819.088028] DR6: ffff0ff0 DR7: 00000400
[41819.088028] Process qmicli (pid: 23292, ti=c298a000 task=f343b280 task.ti=c298a000)
[41819.088028] Stack:
[41819.088028]  00000000 c3b27658 e2a80d00 c298beb0 f864051a c3b27600 c298bec0 f9027099
[41819.088028]  c2fd6000 00000008 c298bef0 c1147f96 00000001 00000000 00000000 f4e54790
[41819.088028]  ecf43a00 ecf43a00 c2fd6008 c2fd6000 ebbd7600 ffffffb9 c298bf08 c1144474
[41819.088028] Call Trace:
[41819.088028]  [<f864051a>] qmi_wwan_cdc_wdm_manage_power+0x1a/0x20 [qmi_wwan]
[41819.088028]  [<f9027099>] wdm_release+0x69/0x70 [cdc_wdm]
[41819.088028]  [<c1147f96>] fput+0xe6/0x210
[41819.088028]  [<c1144474>] filp_close+0x54/0x80
[41819.088028]  [<c1046a65>] put_files_struct+0x75/0xc0
[41819.088028]  [<c1046b56>] exit_files+0x46/0x60
[41819.088028]  [<c1046f81>] do_exit+0x141/0x780
[41819.088028]  [<c107248f>] ? wake_up_state+0xf/0x20
[41819.088028]  [<c1053f48>] ? signal_wake_up+0x28/0x40
[41819.088028]  [<c1054f3b>] ? zap_other_threads+0x6b/0x80
[41819.088028]  [<c1047864>] do_group_exit+0x34/0xa0
[41819.088028]  [<c10478e8>] sys_exit_group+0x18/0x20
[41819.088028]  [<c15bb7df>] sysenter_do_call+0x12/0x28
[41819.088028] Code: 04 83 e7 01 c1 e7 03 0f b6 42 18 83 e0 f7 09 f8 88 42
18 8b 43 04 e8 48 9a dd c8 89 f0 8b 5d f4 8b 75 f8 8b 7d fc 89 ec 5d c3 90
<f0> ff 88 80 00 00 00 0f 94 c0 84 c0 75 b7 31 f6 8b 5d f4 89 f0
[41819.088028] EIP: [<f8640458>] qmi_wwan_manage_power+0x68/0x90 [qmi_wwan] SS:ESP 0068:c298be98
[41819.088028] CR2: 0000000000000080
[41819.149492] ---[ end trace 0944479ff8257f55 ]---

Reported-by: Marius Bjørnstad Kotsbak <[email protected]>
Cc: <[email protected]> # v3.4
Signed-off-by: Bjørn Mork <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 10, 2012
If CONFIG_DM_DEBUG_SPACE_MAPS is enabled and dm_sm_checker_create()
fails, dm_tm_create_internal() would still return success even though it
cleaned up all resources it was supposed to have created.  This will
lead to a kernel crash:

general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
...
RIP: 0010:[<ffffffff81593659>]  [<ffffffff81593659>] dm_bufio_get_block_size+0x9/0x20
Call Trace:
  [<ffffffff81599bae>] dm_bm_block_size+0xe/0x10
  [<ffffffff8159b8b8>] sm_ll_init+0x78/0xd0
  [<ffffffff8159c1a6>] sm_ll_new_disk+0x16/0xa0
  [<ffffffff8159c98e>] dm_sm_disk_create+0xfe/0x160
  [<ffffffff815abf6e>] dm_pool_metadata_open+0x16e/0x6a0
  [<ffffffff815aa010>] pool_ctr+0x3f0/0x900
  [<ffffffff8158d565>] dm_table_add_target+0x195/0x450
  [<ffffffff815904c4>] table_load+0xe4/0x330
  [<ffffffff815917ea>] ctl_ioctl+0x15a/0x2c0
  [<ffffffff81591963>] dm_ctl_ioctl+0x13/0x20
  [<ffffffff8116a4f8>] do_vfs_ioctl+0x98/0x560
  [<ffffffff8116aa51>] sys_ioctl+0x91/0xa0
  [<ffffffff81869f52>] system_call_fastpath+0x16/0x1b

Fix the space map checker code to return an appropriate ERR_PTR and have
dm_sm_disk_create() and dm_tm_create_internal() check for it with
IS_ERR.

Reported-by: Vivek Goyal <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
Cc: [email protected]
Signed-off-by: Alasdair G Kergon <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 10, 2012
A recursive lockdep warning occurs if you call
regulator_set_optimum_mode() on a regulator with a supply because
there is no nesting annotation for the rdev->mutex. To avoid this
warning, get the supply's load before locking the regulator's
mutex to avoid grabbing the same class of lock twice.

=============================================
[ INFO: possible recursive locking detected ]
3.4.0 #3257 Tainted: G        W
---------------------------------------------
swapper/0/1 is trying to acquire lock:
 (&rdev->mutex){+.+.+.}, at: [<c036e9e0>] regulator_get_voltage+0x18/0x38

but task is already holding lock:
 (&rdev->mutex){+.+.+.}, at: [<c036ef38>] regulator_set_optimum_mode+0x24/0x224

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&rdev->mutex);
  lock(&rdev->mutex);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by swapper/0/1:
 #0:  (&__lockdep_no_validate__){......}, at: [<c03dbb48>] __driver_attach+0x40/0x8c
 #1:  (&__lockdep_no_validate__){......}, at: [<c03dbb58>] __driver_attach+0x50/0x8c
 #2:  (&rdev->mutex){+.+.+.}, at: [<c036ef38>] regulator_set_optimum_mode+0x24/0x224

stack backtrace:
[<c001521c>] (unwind_backtrace+0x0/0x12c) from [<c00cc4d4>] (validate_chain+0x760/0x1080)
[<c00cc4d4>] (validate_chain+0x760/0x1080) from [<c00cd744>] (__lock_acquire+0x950/0xa10)
[<c00cd744>] (__lock_acquire+0x950/0xa10) from [<c00cd990>] (lock_acquire+0x18c/0x1e8)
[<c00cd990>] (lock_acquire+0x18c/0x1e8) from [<c080c248>] (mutex_lock_nested+0x68/0x3c4)
[<c080c248>] (mutex_lock_nested+0x68/0x3c4) from [<c036e9e0>] (regulator_get_voltage+0x18/0x38)
[<c036e9e0>] (regulator_get_voltage+0x18/0x38) from [<c036efb8>] (regulator_set_optimum_mode+0xa4/0x224)
...

Signed-off-by: Stephen Boyd <[email protected]>
Signed-off-by: Mark Brown <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 10, 2012
Fix:

 [ 3190.059226] BUG: unable to handle kernel NULL pointer dereference at           (null)
 [ 3190.062224] IP: [<ffffffffa02aac66>] mmu_page_zap_pte+0x10/0xa7 [kvm]
 [ 3190.063760] PGD 104f50067 PUD 112bea067 PMD 0
 [ 3190.065309] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
 [ 3190.066860] CPU 1
[ ...... ]
 [ 3190.109629] Call Trace:
 [ 3190.111342]  [<ffffffffa02aada6>] kvm_mmu_prepare_zap_page+0xa9/0x1fc [kvm]
 [ 3190.113091]  [<ffffffffa02ab2f5>] mmu_shrink+0x11f/0x1f3 [kvm]
 [ 3190.114844]  [<ffffffffa02ab25d>] ? mmu_shrink+0x87/0x1f3 [kvm]
 [ 3190.116598]  [<ffffffff81150c9d>] ? prune_super+0x142/0x154
 [ 3190.118333]  [<ffffffff8110a4f4>] ? shrink_slab+0x39/0x31e
 [ 3190.120043]  [<ffffffff8110a687>] shrink_slab+0x1cc/0x31e
 [ 3190.121718]  [<ffffffff8110ca1d>] do_try_to_free_pages

This is caused by shrinking page from the empty mmu, although we have
checked n_used_mmu_pages, it is useless since the check is out of mmu-lock

Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 10, 2012
Don't grab the daemon mutex while holding the message context mutex.
Addresses this lockdep warning:

 ecryptfsd/2141 is trying to acquire lock:
  (&ecryptfs_msg_ctx_arr[i].mux){+.+.+.}, at: [<ffffffffa029c213>] ecryptfs_miscdev_read+0x143/0x470 [ecryptfs]

 but task is already holding lock:
  (&(*daemon)->mux){+.+...}, at: [<ffffffffa029c2ec>] ecryptfs_miscdev_read+0x21c/0x470 [ecryptfs]

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (&(*daemon)->mux){+.+...}:
        [<ffffffff810a3b8d>] lock_acquire+0x9d/0x220
        [<ffffffff8151c6da>] __mutex_lock_common+0x5a/0x4b0
        [<ffffffff8151cc64>] mutex_lock_nested+0x44/0x50
        [<ffffffffa029c5d7>] ecryptfs_send_miscdev+0x97/0x120 [ecryptfs]
        [<ffffffffa029b744>] ecryptfs_send_message+0x134/0x1e0 [ecryptfs]
        [<ffffffffa029a24e>] ecryptfs_generate_key_packet_set+0x2fe/0xa80 [ecryptfs]
        [<ffffffffa02960f8>] ecryptfs_write_metadata+0x108/0x250 [ecryptfs]
        [<ffffffffa0290f80>] ecryptfs_create+0x130/0x250 [ecryptfs]
        [<ffffffff811963a4>] vfs_create+0xb4/0x120
        [<ffffffff81197865>] do_last+0x8c5/0xa10
        [<ffffffff811998f9>] path_openat+0xd9/0x460
        [<ffffffff81199da2>] do_filp_open+0x42/0xa0
        [<ffffffff81187998>] do_sys_open+0xf8/0x1d0
        [<ffffffff81187a91>] sys_open+0x21/0x30
        [<ffffffff81527d69>] system_call_fastpath+0x16/0x1b

 -> #0 (&ecryptfs_msg_ctx_arr[i].mux){+.+.+.}:
        [<ffffffff810a3418>] __lock_acquire+0x1bf8/0x1c50
        [<ffffffff810a3b8d>] lock_acquire+0x9d/0x220
        [<ffffffff8151c6da>] __mutex_lock_common+0x5a/0x4b0
        [<ffffffff8151cc64>] mutex_lock_nested+0x44/0x50
        [<ffffffffa029c213>] ecryptfs_miscdev_read+0x143/0x470 [ecryptfs]
        [<ffffffff811887d3>] vfs_read+0xb3/0x180
        [<ffffffff811888ed>] sys_read+0x4d/0x90
        [<ffffffff81527d69>] system_call_fastpath+0x16/0x1b

Signed-off-by: Tyler Hicks <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 10, 2012
With commit 49dca5a I introduced
a bug (visible if CONFIG_PROVE_RCU is enabled) which occures when a panic
has happened:

[ 1526.520230] ===============================
[ 1526.520230] [ INFO: suspicious RCU usage. ]
[ 1526.520230] 3.5.0-rc1+ #12 Not tainted
[ 1526.520230] -------------------------------
[ 1526.520230] /c/kernel-tests/mm/include/linux/rcupdate.h:436 Illegal context switch in RCU read-side critical section!
[ 1526.520230]
[ 1526.520230] other info that might help us debug this:
[ 1526.520230]
[ 1526.520230]
[ 1526.520230] rcu_scheduler_active = 1, debug_locks = 0
[ 1526.520230] 3 locks held by net.agent/3279:
[ 1526.520230]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff82f85962>] do_page_fault+0x193/0x390
[ 1526.520230]  #1:  (panic_lock){+.+...}, at: [<ffffffff82ed2830>] panic+0x37/0x1d3
[ 1526.520230]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff810b9b28>] rcu_lock_acquire+0x0/0x29
[ 1526.520230]
[ 1526.520230] stack backtrace:
[ 1526.520230] Pid: 3279, comm: net.agent Not tainted 3.5.0-rc1+ #12
[ 1526.520230] Call Trace:
[ 1526.520230]  [<ffffffff810e1570>] lockdep_rcu_suspicious+0x109/0x112
[ 1526.520230]  [<ffffffff810bfe3a>] rcu_preempt_sleep_check+0x45/0x47
[ 1526.520230]  [<ffffffff810bfe5a>] __might_sleep+0x1e/0x19a
[ 1526.520230]  [<ffffffff82f8010e>] down_write+0x26/0x81
[ 1526.520230]  [<ffffffff8276a966>] led_trigger_unregister+0x1f/0x9c
[ 1526.520230]  [<ffffffff8276def5>] heartbeat_reboot_notifier+0x15/0x19
[ 1526.520230]  [<ffffffff82f85bf5>] notifier_call_chain+0x96/0xcd
[ 1526.520230]  [<ffffffff82f85cba>] __atomic_notifier_call_chain+0x8e/0xff
[ 1526.520230]  [<ffffffff81094b7c>] ? kmsg_dump+0x37/0x1eb
[ 1526.520230]  [<ffffffff82f85d3f>] atomic_notifier_call_chain+0x14/0x16
[ 1526.520230]  [<ffffffff82ed28e1>] panic+0xe8/0x1d3
[ 1526.520230]  [<ffffffff811473e2>] out_of_memory+0x15d/0x1d3

So in case of a panic, now just turn of the LED. Other approaches like
scheduling a work to unregister the trigger aren't working because there
isn't much which still runs after a panic occured (except timers).

Signed-off-by: Alexander Holler <[email protected]>
Signed-off-by: Bryan Wu <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 23, 2012
The filesystem layer expects pages in the block device's mapping to not
be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
currently replace lowmem pages with highmem pages, leading to crashes in
filesystem code such as the one below:

  Unable to handle kernel NULL pointer dereference at virtual address 00000400
  pgd = c0c98000
  [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
  Internal error: Oops: 817 [#1] PREEMPT SMP ARM
  CPU: 0    Not tainted  (3.5.0-rc5+ #80)
  PC is at __memzero+0x24/0x80
  ...
  Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
  Backtrace:
  [<c010e3f0>] (ext4_getblk+0x0/0x180) from [<c010e58c>] (ext4_bread+0x1c/0x98)
  [<c010e570>] (ext4_bread+0x0/0x98) from [<c0117944>] (ext4_mkdir+0x160/0x3bc)
   r4:c15337f0
  [<c01177e4>] (ext4_mkdir+0x0/0x3bc) from [<c00c29e0>] (vfs_mkdir+0x8c/0x98)
  [<c00c2954>] (vfs_mkdir+0x0/0x98) from [<c00c2a60>] (sys_mkdirat+0x74/0xac)
   r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
  [<c00c29ec>] (sys_mkdirat+0x0/0xac) from [<c00c2ab8>] (sys_mkdir+0x20/0x24)
   r6:beccdcf0 r5:00074000 r4:beccdbbc
  [<c00c2a98>] (sys_mkdir+0x0/0x24) from [<c000e3c0>] (ret_fast_syscall+0x0/0x30)

Fix this by replacing only highmem pages with highmem.

Reported-by: Laura Abbott <[email protected]>
Signed-off-by: Rabin Vincent <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Signed-off-by: Marek Szyprowski <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 23, 2012
kswapd_stop() is called to destroy the kswapd work thread when all memory
of a NUMA node has been offlined.  But kswapd_stop() only terminates the
work thread without resetting NODE_DATA(nid)->kswapd to NULL.  The stale
pointer will prevent kswapd_run() from creating a new work thread when
adding memory to the memory-less NUMA node again.  Eventually the stale
pointer may cause invalid memory access.

An example stack dump as below. It's reproduced with 2.6.32, but latest
kernel has the same issue.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
  PGD 0
  Oops: 0000 [#1] SMP
  last sysfs file: /sys/devices/system/memory/memory391/state
  CPU 11
  Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
  Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
  RIP: 0010:exit_creds+0x12/0x78
  RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
  RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
  RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
  RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
  R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
  R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
  FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
  Stack:
   ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
   ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
   0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
  Call Trace:
    __put_task_struct+0x5d/0x97
    kthread_stop+0x50/0x58
    offline_pages+0x324/0x3da
    memory_block_change_state+0x179/0x1db
    store_mem_state+0x9e/0xbb
    sysfs_write_file+0xd0/0x107
    vfs_write+0xad/0x169
    sys_write+0x45/0x6e
    system_call_fastpath+0x16/0x1b
  Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
  RIP  exit_creds+0x12/0x78
   RSP <ffff8806044f1d78>
  CR2: 0000000000000000

[[email protected]: add pglist_data.kswapd locking comments]
Signed-off-by: Xishi Qiu <[email protected]>
Signed-off-by: Jiang Liu <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 23, 2012
memblock_free_reserved_regions() calls memblock_free(), but
memblock_free() would double reserved.regions too, so we could free the
old range for reserved.regions.

Also tj said there is another bug which could be related to this.

| I don't think we're saving any noticeable
| amount by doing this "free - give it to page allocator - reserve
| again" dancing.  We should just allocate regions aligned to page
| boundaries and free them later when memblock is no longer in use.

in that case, when DEBUG_PAGEALLOC, will get panic:

     memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39
  BUG: unable to handle kernel paging request at ffff88102febd948
  IP: [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155
  PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  CPU 0
  Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447
  RIP: 0010:[<ffffffff836a5774>]  [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155

See the discussion at https://lkml.org/lkml/2012/6/13/469

So try to allocate with PAGE_SIZE alignment and free it later.

Reported-by: Sasha Levin <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 23, 2012
This can be trivially triggered from userspace by passing in something unexpected.

    kernel BUG at fs/locks.c:1468!
    invalid opcode: 0000 [#1] SMP
    RIP: 0010:generic_setlease+0xc2/0x100
    Call Trace:
      __vfs_setlease+0x35/0x40
      fcntl_setlease+0x76/0x150
      sys_fcntl+0x1c6/0x810
      system_call_fastpath+0x1a/0x1f

Signed-off-by: Dave Jones <[email protected]>
Cc: [email protected] # 3.2+
Signed-off-by: Linus Torvalds <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 23, 2012
Jian found that when he ran fsx on a 32 bit arch with a large wsize the
process and one of the bdi writeback kthreads would sometimes deadlock
with a stack trace like this:

crash> bt
PID: 2789   TASK: f02edaa0  CPU: 3   COMMAND: "fsx"
 #0 [eed63cbc] schedule at c083c5b3
 #1 [eed63d80] kmap_high at c0500ec8
 #2 [eed63db0] cifs_async_writev at f7fabcd7 [cifs]
 #3 [eed63df0] cifs_writepages at f7fb7f5c [cifs]
 #4 [eed63e50] do_writepages at c04f3e32
 #5 [eed63e54] __filemap_fdatawrite_range at c04e152a
 #6 [eed63ea4] filemap_fdatawrite at c04e1b3e
 #7 [eed63eb4] cifs_file_aio_write at f7fa111a [cifs]
 #8 [eed63ecc] do_sync_write at c052d202
 #9 [eed63f74] vfs_write at c052d4ee
#10 [eed63f94] sys_write at c052df4c
#11 [eed63fb0] ia32_sysenter_target at c0409a98
    EAX: 00000004  EBX: 00000003  ECX: abd73b73  EDX: 012a65c6
    DS:  007b      ESI: 012a65c6  ES:  007b      EDI: 00000000
    SS:  007b      ESP: bf8db178  EBP: bf8db1f8  GS:  0033
    CS:  0073      EIP: 40000424  ERR: 00000004  EFLAGS: 00000246

Each task would kmap part of its address array before getting stuck, but
not enough to actually issue the write.

This patch fixes this by serializing the marshal_iov operations for
async reads and writes. The idea here is to ensure that cifs
aggressively tries to populate a request before attempting to fulfill
another one. As soon as all of the pages are kmapped for a request, then
we can unlock and allow another one to proceed.

There's no need to do this serialization on non-CONFIG_HIGHMEM arches
however, so optimize all of this out when CONFIG_HIGHMEM isn't set.

Cc: <[email protected]>
Reported-by: Jian Li <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Steve French <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 23, 2012
…list

A few days ago Dave Jones reported this oops:

[22766.294255] general protection fault: 0000 [#1] PREEMPT SMP
[22766.295376] CPU 0
[22766.295384] Modules linked in:
[22766.387137]  ffffffffa169f292 6b6b6b6b6b6b6b6b ffff880147c03a90
ffff880147c03a74
[22766.387135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000
[22766.387136] Process trinity-watchdo (pid: 10896, threadinfo ffff88013e7d2000,
[22766.387137] Stack:
[22766.387140]  ffff880147c03a10
[22766.387140]  ffffffffa169f2b6
[22766.387140]  ffff88013ed95728
[22766.387143]  0000000000000002
[22766.387143]  0000000000000000
[22766.387143]  ffff880003fad062
[22766.387144]  ffff88013c120000
[22766.387144]
[22766.387145] Call Trace:
[22766.387145]  <IRQ>
[22766.387150]  [<ffffffffa169f292>] ? __sctp_lookup_association+0x62/0xd0
[sctp]
[22766.387154]  [<ffffffffa169f2b6>] __sctp_lookup_association+0x86/0xd0 [sctp]
[22766.387157]  [<ffffffffa169f597>] sctp_rcv+0x207/0xbb0 [sctp]
[22766.387161]  [<ffffffff810d4da8>] ? trace_hardirqs_off_caller+0x28/0xd0
[22766.387163]  [<ffffffff815827e3>] ? nf_hook_slow+0x133/0x210
[22766.387166]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387168]  [<ffffffff8159043d>] ip_local_deliver_finish+0x18d/0x4c0
[22766.387169]  [<ffffffff815902fc>] ? ip_local_deliver_finish+0x4c/0x4c0
[22766.387171]  [<ffffffff81590a07>] ip_local_deliver+0x47/0x80
[22766.387172]  [<ffffffff8158fd80>] ip_rcv_finish+0x150/0x680
[22766.387174]  [<ffffffff81590c54>] ip_rcv+0x214/0x320
[22766.387176]  [<ffffffff81558c07>] __netif_receive_skb+0x7b7/0x910
[22766.387178]  [<ffffffff8155856c>] ? __netif_receive_skb+0x11c/0x910
[22766.387180]  [<ffffffff810d423e>] ? put_lock_stats.isra.25+0xe/0x40
[22766.387182]  [<ffffffff81558f83>] netif_receive_skb+0x23/0x1f0
[22766.387183]  [<ffffffff815596a9>] ? dev_gro_receive+0x139/0x440
[22766.387185]  [<ffffffff81559280>] napi_skb_finish+0x70/0xa0
[22766.387187]  [<ffffffff81559cb5>] napi_gro_receive+0xf5/0x130
[22766.387218]  [<ffffffffa01c4679>] e1000_receive_skb+0x59/0x70 [e1000e]
[22766.387242]  [<ffffffffa01c5aab>] e1000_clean_rx_irq+0x28b/0x460 [e1000e]
[22766.387266]  [<ffffffffa01c9c18>] e1000e_poll+0x78/0x430 [e1000e]
[22766.387268]  [<ffffffff81559fea>] net_rx_action+0x1aa/0x3d0
[22766.387270]  [<ffffffff810a495f>] ? account_system_vtime+0x10f/0x130
[22766.387273]  [<ffffffff810734d0>] __do_softirq+0xe0/0x420
[22766.387275]  [<ffffffff8169826c>] call_softirq+0x1c/0x30
[22766.387278]  [<ffffffff8101db15>] do_softirq+0xd5/0x110
[22766.387279]  [<ffffffff81073bc5>] irq_exit+0xd5/0xe0
[22766.387281]  [<ffffffff81698b03>] do_IRQ+0x63/0xd0
[22766.387283]  [<ffffffff8168ee2f>] common_interrupt+0x6f/0x6f
[22766.387283]  <EOI>
[22766.387284]
[22766.387285]  [<ffffffff8168eed9>] ? retint_swapgs+0x13/0x1b
[22766.387285] Code: c0 90 5d c3 66 0f 1f 44 00 00 4c 89 c8 5d c3 0f 1f 00 55 48
89 e5 48 83
ec 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 <0f> b7 87 98 00 00 00
48 89 fb
49 89 f5 66 c1 c0 08 66 39 46 02
[22766.387307]
[22766.387307] RIP
[22766.387311]  [<ffffffffa168a2c9>] sctp_assoc_is_match+0x19/0x90 [sctp]
[22766.387311]  RSP <ffff880147c039b0>
[22766.387142]  ffffffffa16ab120
[22766.599537] ---[ end trace 3f6dae82e37b17f5 ]---
[22766.601221] Kernel panic - not syncing: Fatal exception in interrupt

It appears from his analysis and some staring at the code that this is likely
occuring because an association is getting freed while still on the
sctp_assoc_hashtable.  As a result, we get a gpf when traversing the hashtable
while a freed node corrupts part of the list.

Nominally I would think that an mibalanced refcount was responsible for this,
but I can't seem to find any obvious imbalance.  What I did note however was
that the two places where we create an association using
sctp_primitive_ASSOCIATE (__sctp_connect and sctp_sendmsg), have failure paths
which free a newly created association after calling sctp_primitive_ASSOCIATE.
sctp_primitive_ASSOCIATE brings us into the sctp_sf_do_prm_asoc path, which
issues a SCTP_CMD_NEW_ASOC side effect, which in turn adds a new association to
the aforementioned hash table.  the sctp command interpreter that process side
effects has not way to unwind previously processed commands, so freeing the
association from the __sctp_connect or sctp_sendmsg error path would lead to a
freed association remaining on this hash table.

I've fixed this but modifying sctp_[un]hash_established to use hlist_del_init,
which allows us to proerly use hlist_unhashed to check if the node is on a
hashlist safely during a delete.  That in turn alows us to safely call
sctp_unhash_established in the __sctp_connect and sctp_sendmsg error paths
before freeing them, regardles of what the associations state is on the hash
list.

I noted, while I was doing this, that the __sctp_unhash_endpoint was using
hlist_unhsashed in a simmilar fashion, but never nullified any removed nodes
pointers to make that function work properly, so I fixed that up in a simmilar
fashion.

I attempted to test this using a virtual guest running the SCTP_RR test from
netperf in a loop while running the trinity fuzzer, both in a loop.  I wasn't
able to recreate the problem prior to this fix, nor was I able to trigger the
failure after (neither of which I suppose is suprising).  Given the trace above
however, I think its likely that this is what we hit.

Signed-off-by: Neil Horman <[email protected]>
Reported-by: [email protected]
CC: [email protected]
CC: "David S. Miller" <[email protected]>
CC: Vlad Yasevich <[email protected]>
CC: Sridhar Samudrala <[email protected]>
CC: [email protected]
Signed-off-by: David S. Miller <[email protected]>
svenkatr pushed a commit that referenced this issue Jul 23, 2012
IPVS should not reset skb->nf_bridge in FORWARD hook
by calling nf_reset for NAT replies. It triggers oops in
br_nf_forward_finish.

[  579.781508] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[  579.781669] IP: [<ffffffff817b1ca5>] br_nf_forward_finish+0x58/0x112
[  579.781792] PGD 218f9067 PUD 0
[  579.781865] Oops: 0000 [#1] SMP
[  579.781945] CPU 0
[  579.781983] Modules linked in:
[  579.782047]
[  579.782080]
[  579.782114] Pid: 4644, comm: qemu Tainted: G        W    3.5.0-rc5-00006-g95e69f9 #282 Hewlett-Packard  /30E8
[  579.782300] RIP: 0010:[<ffffffff817b1ca5>]  [<ffffffff817b1ca5>] br_nf_forward_finish+0x58/0x112
[  579.782455] RSP: 0018:ffff88007b003a98  EFLAGS: 00010287
[  579.782541] RAX: 0000000000000008 RBX: ffff8800762ead00 RCX: 000000000001670a
[  579.782653] RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff8800762ead00
[  579.782845] RBP: ffff88007b003ac8 R08: 0000000000016630 R09: ffff88007b003a90
[  579.782957] R10: ffff88007b0038e8 R11: ffff88002da37540 R12: ffff88002da01a02
[  579.783066] R13: ffff88002da01a80 R14: ffff88002d83c000 R15: ffff88002d82a000
[  579.783177] FS:  0000000000000000(0000) GS:ffff88007b000000(0063) knlGS:00000000f62d1b70
[  579.783306] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
[  579.783395] CR2: 0000000000000004 CR3: 00000000218fe000 CR4: 00000000000027f0
[  579.783505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  579.783684] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  579.783795] Process qemu (pid: 4644, threadinfo ffff880021b20000, task ffff880021aba760)
[  579.783919] Stack:
[  579.783959]  ffff88007693cedc ffff8800762ead00 ffff88002da01a02 ffff8800762ead00
[  579.784110]  ffff88002da01a02 ffff88002da01a80 ffff88007b003b18 ffffffff817b26c7
[  579.784260]  ffff880080000000 ffffffff81ef59f0 ffff8800762ead00 ffffffff81ef58b0
[  579.784477] Call Trace:
[  579.784523]  <IRQ>
[  579.784562]
[  579.784603]  [<ffffffff817b26c7>] br_nf_forward_ip+0x275/0x2c8
[  579.784707]  [<ffffffff81704b58>] nf_iterate+0x47/0x7d
[  579.784797]  [<ffffffff817ac32e>] ? br_dev_queue_push_xmit+0xae/0xae
[  579.784906]  [<ffffffff81704bfb>] nf_hook_slow+0x6d/0x102
[  579.784995]  [<ffffffff817ac32e>] ? br_dev_queue_push_xmit+0xae/0xae
[  579.785175]  [<ffffffff8187fa95>] ? _raw_write_unlock_bh+0x19/0x1b
[  579.785179]  [<ffffffff817ac417>] __br_forward+0x97/0xa2
[  579.785179]  [<ffffffff817ad366>] br_handle_frame_finish+0x1a6/0x257
[  579.785179]  [<ffffffff817b2386>] br_nf_pre_routing_finish+0x26d/0x2cb
[  579.785179]  [<ffffffff817b2cf0>] br_nf_pre_routing+0x55d/0x5c1
[  579.785179]  [<ffffffff81704b58>] nf_iterate+0x47/0x7d
[  579.785179]  [<ffffffff817ad1c0>] ? br_handle_local_finish+0x44/0x44
[  579.785179]  [<ffffffff81704bfb>] nf_hook_slow+0x6d/0x102
[  579.785179]  [<ffffffff817ad1c0>] ? br_handle_local_finish+0x44/0x44
[  579.785179]  [<ffffffff81551525>] ? sky2_poll+0xb35/0xb54
[  579.785179]  [<ffffffff817ad62a>] br_handle_frame+0x213/0x229
[  579.785179]  [<ffffffff817ad417>] ? br_handle_frame_finish+0x257/0x257
[  579.785179]  [<ffffffff816e3b47>] __netif_receive_skb+0x2b4/0x3f1
[  579.785179]  [<ffffffff816e69fc>] process_backlog+0x99/0x1e2
[  579.785179]  [<ffffffff816e6800>] net_rx_action+0xdf/0x242
[  579.785179]  [<ffffffff8107e8a8>] __do_softirq+0xc1/0x1e0
[  579.785179]  [<ffffffff8135a5ba>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[  579.785179]  [<ffffffff8188812c>] call_softirq+0x1c/0x30

The steps to reproduce as follow,

1. On Host1, setup brige br0(192.168.1.106)
2. Boot a kvm guest(192.168.1.105) on Host1 and start httpd
3. Start IPVS service on Host1
   ipvsadm -A -t 192.168.1.106:80 -s rr
   ipvsadm -a -t 192.168.1.106:80 -r 192.168.1.105:80 -m
4. Run apache benchmark on Host2(192.168.1.101)
   ab -n 1000 http://192.168.1.106/

ip_vs_reply4
  ip_vs_out
    handle_response
      ip_vs_notrack
        nf_reset()
        {
          skb->nf_bridge = NULL;
        }

Actually, IPVS wants in this case just to replace nfct
with untracked version. So replace the nf_reset(skb) call
in ip_vs_notrack() with a nf_conntrack_put(skb->nfct) call.

Signed-off-by: Lin Ming <[email protected]>
Signed-off-by: Julian Anastasov <[email protected]>
Signed-off-by: Simon Horman <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Oct 28, 2024
The variable wwan_rtnl_link_ops assign a *bigger* maxtype which leads to
a global out-of-bounds read when parsing the netlink attributes. Exactly
same bug cause as the oob fixed in commit b33fb5b ("net: qualcomm:
rmnet: fix global oob in rmnet_policy").

==================================================================
BUG: KASAN: global-out-of-bounds in validate_nla lib/nlattr.c:388 [inline]
BUG: KASAN: global-out-of-bounds in __nla_validate_parse+0x19d7/0x29a0 lib/nlattr.c:603
Read of size 1 at addr ffffffff8b09cb60 by task syz.1.66276/323862

CPU: 0 PID: 323862 Comm: syz.1.66276 Not tainted 6.1.70 svenkatr#1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x177/0x231 lib/dump_stack.c:106
 print_address_description mm/kasan/report.c:284 [inline]
 print_report+0x14f/0x750 mm/kasan/report.c:395
 kasan_report+0x139/0x170 mm/kasan/report.c:495
 validate_nla lib/nlattr.c:388 [inline]
 __nla_validate_parse+0x19d7/0x29a0 lib/nlattr.c:603
 __nla_parse+0x3c/0x50 lib/nlattr.c:700
 nla_parse_nested_deprecated include/net/netlink.h:1269 [inline]
 __rtnl_newlink net/core/rtnetlink.c:3514 [inline]
 rtnl_newlink+0x7bc/0x1fd0 net/core/rtnetlink.c:3623
 rtnetlink_rcv_msg+0x794/0xef0 net/core/rtnetlink.c:6122
 netlink_rcv_skb+0x1de/0x420 net/netlink/af_netlink.c:2508
 netlink_unicast_kernel net/netlink/af_netlink.c:1326 [inline]
 netlink_unicast+0x74b/0x8c0 net/netlink/af_netlink.c:1352
 netlink_sendmsg+0x882/0xb90 net/netlink/af_netlink.c:1874
 sock_sendmsg_nosec net/socket.c:716 [inline]
 __sock_sendmsg net/socket.c:728 [inline]
 ____sys_sendmsg+0x5cc/0x8f0 net/socket.c:2499
 ___sys_sendmsg+0x21c/0x290 net/socket.c:2553
 __sys_sendmsg net/socket.c:2582 [inline]
 __do_sys_sendmsg net/socket.c:2591 [inline]
 __se_sys_sendmsg+0x19e/0x270 net/socket.c:2589
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x45/0x90 arch/x86/entry/common.c:81
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f67b19a24ad
RSP: 002b:00007f67b17febb8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f67b1b45f80 RCX: 00007f67b19a24ad
RDX: 0000000000000000 RSI: 0000000020005e40 RDI: 0000000000000004
RBP: 00007f67b1a1e01d R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd2513764f R14: 00007ffd251376e0 R15: 00007f67b17fed40
 </TASK>

The buggy address belongs to the variable:
 wwan_rtnl_policy+0x20/0x40

The buggy address belongs to the physical page:
page:ffffea00002c2700 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xb09c
flags: 0xfff00000001000(reserved|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000001000 ffffea00002c2708 ffffea00002c2708 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner info is not present (never set?)

Memory state around the buggy address:
 ffffffff8b09ca00: 05 f9 f9 f9 05 f9 f9 f9 00 01 f9 f9 00 01 f9 f9
 ffffffff8b09ca80: 00 00 00 05 f9 f9 f9 f9 00 00 03 f9 f9 f9 f9 f9
>ffffffff8b09cb00: 00 00 00 00 05 f9 f9 f9 00 00 00 00 f9 f9 f9 f9
                                                       ^
 ffffffff8b09cb80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

According to the comment of `nla_parse_nested_deprecated`, use correct size
`IFLA_WWAN_MAX` here to fix this issue.

Fixes: 88b7105 ("wwan: add interface creation support")
Signed-off-by: Lin Ma <[email protected]>
Reviewed-by: Loic Poulain <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Oct 28, 2024
…n_net

In the normal case, when we excute `echo 0 > /proc/fs/nfsd/threads`, the
function `nfs4_state_destroy_net` in `nfs4_state_shutdown_net` will
release all resources related to the hashed `nfs4_client`. If the
`nfsd_client_shrinker` is running concurrently, the `expire_client`
function will first unhash this client and then destroy it. This can
lead to the following warning. Additionally, numerous use-after-free
errors may occur as well.

nfsd_client_shrinker         echo 0 > /proc/fs/nfsd/threads

expire_client                nfsd_shutdown_net
  unhash_client                ...
                               nfs4_state_shutdown_net
                                 /* won't wait shrinker exit */
  /*                             cancel_work(&nn->nfsd_shrinker_work)
   * nfsd_file for this          /* won't destroy unhashed client1 */
   * client1 still alive         nfs4_state_destroy_net
   */

                               nfsd_file_cache_shutdown
                                 /* trigger warning */
                                 kmem_cache_destroy(nfsd_file_slab)
                                 kmem_cache_destroy(nfsd_file_mark_slab)
  /* release nfsd_file and mark */
  __destroy_client

====================================================================
BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on
__kmem_cache_shutdown()
--------------------------------------------------------------------
CPU: 4 UID: 0 PID: 764 Comm: sh Not tainted 6.12.0-rc3+ svenkatr#1

 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xac/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1a5/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

====================================================================
BUG nfsd_file_mark (Tainted: G    B   W         ): Objects remaining
nfsd_file_mark on __kmem_cache_shutdown()
--------------------------------------------------------------------

 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xc8/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1a5/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

To resolve this issue, cancel `nfsd_shrinker_work` using synchronous
mode in nfs4_state_shutdown_net.

Fixes: 7c24fa2 ("NFSD: replace delayed_work with work_struct for nfsd_client_shrinker")
Signed-off-by: Yang Erkun <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Oct 28, 2024
[BUG]
Syzbot reports the following crash:

  BTRFS info (device loop0 state MCS): disabling free space tree
  BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
  BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [svenkatr#1] PREEMPT SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
  RIP: 0010:backup_super_roots fs/btrfs/disk-io.c:1691 [inline]
  RIP: 0010:write_all_supers+0x97a/0x40f0 fs/btrfs/disk-io.c:4041
  Call Trace:
   <TASK>
   btrfs_commit_transaction+0x1eae/0x3740 fs/btrfs/transaction.c:2530
   btrfs_delete_free_space_tree+0x383/0x730 fs/btrfs/free-space-tree.c:1312
   btrfs_start_pre_rw_mount+0xf28/0x1300 fs/btrfs/disk-io.c:3012
   btrfs_remount_rw fs/btrfs/super.c:1309 [inline]
   btrfs_reconfigure+0xae6/0x2d40 fs/btrfs/super.c:1534
   btrfs_reconfigure_for_mount fs/btrfs/super.c:2020 [inline]
   btrfs_get_tree_subvol fs/btrfs/super.c:2079 [inline]
   btrfs_get_tree+0x918/0x1920 fs/btrfs/super.c:2115
   vfs_get_tree+0x90/0x2b0 fs/super.c:1800
   do_new_mount+0x2be/0xb40 fs/namespace.c:3472
   do_mount fs/namespace.c:3812 [inline]
   __do_sys_mount fs/namespace.c:4020 [inline]
   __se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

[CAUSE]
To support mounting different subvolume with different RO/RW flags for
the new mount APIs, btrfs introduced two workaround to support this feature:

- Skip mount option/feature checks if we are mounting a different
  subvolume

- Reconfigure the fs to RW if the initial mount is RO

Combining these two, we can have the following sequence:

- Mount the fs ro,rescue=all,clear_cache,space_cache=v1
  rescue=all will mark the fs as hard read-only, so no v2 cache clearing
  will happen.

- Mount a subvolume rw of the same fs.
  We go into btrfs_get_tree_subvol(), but fc_mount() returns EBUSY
  because our new fc is RW, different from the original fs.

  Now we enter btrfs_reconfigure_for_mount(), which switches the RO flag
  first so that we can grab the existing fs_info.
  Then we reconfigure the fs to RW.

- During reconfiguration, option/features check is skipped
  This means we will restart the v2 cache clearing, and convert back to
  v1 cache.
  This will trigger fs writes, and since the original fs has "rescue=all"
  option, it skips the csum tree read.

  And eventually causing NULL pointer dereference in super block
  writeback.

[FIX]
For reconfiguration caused by different subvolume RO/RW flags, ensure we
always run btrfs_check_options() to ensure we have proper hard RO
requirements met.

In fact the function btrfs_check_options() doesn't really do many
complex checks, but hard RO requirement and some feature dependency
checks, thus there is no special reason not to do the check for mount
reconfiguration.

Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: f044b31 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
CC: [email protected] # 6.8+
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Oct 28, 2024
When creating a trace_probe we would set nr_args prior to truncating the
arguments to MAX_TRACE_ARGS. However, we would only initialize arguments
up to the limit.

This caused invalid memory access when attempting to set up probes with
more than 128 fetchargs.

  BUG: kernel NULL pointer dereference, address: 0000000000000020
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [svenkatr#1] PREEMPT SMP PTI
  CPU: 0 UID: 0 PID: 1769 Comm: cat Not tainted 6.11.0-rc7+ svenkatr#8
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
  RIP: 0010:__set_print_fmt+0x134/0x330

Resolve the issue by applying the MAX_TRACE_ARGS limit earlier. Return
an error when there are too many arguments instead of silently
truncating.

Link: https://lore.kernel.org/all/[email protected]/

Fixes: 035ba76 ("tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init")
Signed-off-by: Mikel Rychliski <[email protected]>
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
Currently, when configuring TMU (Time Management Unit) mode of a given
router, we take into account only its own TMU requirements ignoring
other routers in the domain. This is problematic if the router we are
configuring has lower TMU requirements than what is already configured
in the domain.

In the scenario below, we have a host router with two USB4 ports: A and
B. Port A connected to device router svenkatr#1 (which supports CL states) and
existing DisplayPort tunnel, thus, the TMU mode is HiFi uni-directional.

1. Initial topology

          [Host]
         A/
         /
 [Device svenkatr#1]
   /
Monitor

2. Plug in device svenkatr#2 (that supports CL states) to downstream port B of
   the host router

         [Host]
        A/    B\
        /       \
 [Device svenkatr#1]    [Device svenkatr#2]
   /
Monitor

The TMU mode on port B and port A will be configured to LowRes which is
not what we want and will cause monitor to start flickering.

To address this we first scan the domain and search for any router
configured to HiFi uni-directional mode, and if found, configure TMU
mode of the given router to HiFi uni-directional as well.

Cc: [email protected]
Signed-off-by: Gil Fine <[email protected]>
Signed-off-by: Mika Westerberg <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
The scmi_dev->name is released prematurely in __scmi_device_destroy(),
which causes slab-use-after-free when accessing scmi_dev->name in
scmi_bus_notifier(). So move the release of scmi_dev->name to
scmi_device_release() to avoid slab-use-after-free.

  |  BUG: KASAN: slab-use-after-free in strncmp+0xe4/0xec
  |  Read of size 1 at addr ffffff80a482bcc0 by task swapper/0/1
  |
  |  CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.6.38-debug svenkatr#1
  |  Hardware name: Qualcomm Technologies, Inc. SA8775P Ride (DT)
  |  Call trace:
  |   dump_backtrace+0x94/0x114
  |   show_stack+0x18/0x24
  |   dump_stack_lvl+0x48/0x60
  |   print_report+0xf4/0x5b0
  |   kasan_report+0xa4/0xec
  |   __asan_report_load1_noabort+0x20/0x2c
  |   strncmp+0xe4/0xec
  |   scmi_bus_notifier+0x5c/0x54c
  |   notifier_call_chain+0xb4/0x31c
  |   blocking_notifier_call_chain+0x68/0x9c
  |   bus_notify+0x54/0x78
  |   device_del+0x1bc/0x840
  |   device_unregister+0x20/0xb4
  |   __scmi_device_destroy+0xac/0x280
  |   scmi_device_destroy+0x94/0xd0
  |   scmi_chan_setup+0x524/0x750
  |   scmi_probe+0x7fc/0x1508
  |   platform_probe+0xc4/0x19c
  |   really_probe+0x32c/0x99c
  |   __driver_probe_device+0x15c/0x3c4
  |   driver_probe_device+0x5c/0x170
  |   __driver_attach+0x1c8/0x440
  |   bus_for_each_dev+0xf4/0x178
  |   driver_attach+0x3c/0x58
  |   bus_add_driver+0x234/0x4d4
  |   driver_register+0xf4/0x3c0
  |   __platform_driver_register+0x60/0x88
  |   scmi_driver_init+0xb0/0x104
  |   do_one_initcall+0xb4/0x664
  |   kernel_init_freeable+0x3c8/0x894
  |   kernel_init+0x24/0x1e8
  |   ret_from_fork+0x10/0x20
  |
  |  Allocated by task 1:
  |   kasan_save_stack+0x2c/0x54
  |   kasan_set_track+0x2c/0x40
  |   kasan_save_alloc_info+0x24/0x34
  |   __kasan_kmalloc+0xa0/0xb8
  |   __kmalloc_node_track_caller+0x6c/0x104
  |   kstrdup+0x48/0x84
  |   kstrdup_const+0x34/0x40
  |   __scmi_device_create.part.0+0x8c/0x408
  |   scmi_device_create+0x104/0x370
  |   scmi_chan_setup+0x2a0/0x750
  |   scmi_probe+0x7fc/0x1508
  |   platform_probe+0xc4/0x19c
  |   really_probe+0x32c/0x99c
  |   __driver_probe_device+0x15c/0x3c4
  |   driver_probe_device+0x5c/0x170
  |   __driver_attach+0x1c8/0x440
  |   bus_for_each_dev+0xf4/0x178
  |   driver_attach+0x3c/0x58
  |   bus_add_driver+0x234/0x4d4
  |   driver_register+0xf4/0x3c0
  |   __platform_driver_register+0x60/0x88
  |   scmi_driver_init+0xb0/0x104
  |   do_one_initcall+0xb4/0x664
  |   kernel_init_freeable+0x3c8/0x894
  |   kernel_init+0x24/0x1e8
  |   ret_from_fork+0x10/0x20
  |
  |  Freed by task 1:
  |   kasan_save_stack+0x2c/0x54
  |   kasan_set_track+0x2c/0x40
  |   kasan_save_free_info+0x38/0x5c
  |   __kasan_slab_free+0xe8/0x164
  |   __kmem_cache_free+0x11c/0x230
  |   kfree+0x70/0x130
  |   kfree_const+0x20/0x40
  |   __scmi_device_destroy+0x70/0x280
  |   scmi_device_destroy+0x94/0xd0
  |   scmi_chan_setup+0x524/0x750
  |   scmi_probe+0x7fc/0x1508
  |   platform_probe+0xc4/0x19c
  |   really_probe+0x32c/0x99c
  |   __driver_probe_device+0x15c/0x3c4
  |   driver_probe_device+0x5c/0x170
  |   __driver_attach+0x1c8/0x440
  |   bus_for_each_dev+0xf4/0x178
  |   driver_attach+0x3c/0x58
  |   bus_add_driver+0x234/0x4d4
  |   driver_register+0xf4/0x3c0
  |   __platform_driver_register+0x60/0x88
  |   scmi_driver_init+0xb0/0x104
  |   do_one_initcall+0xb4/0x664
  |   kernel_init_freeable+0x3c8/0x894
  |   kernel_init+0x24/0x1e8
  |   ret_from_fork+0x10/0x20

Fixes: ee7a9c9 ("firmware: arm_scmi: Add support for multiple device per protocol")
Signed-off-by: Xinqi Zhang <[email protected]>
Reviewed-by: Cristian Marussi <[email protected]>
Reviewed-by: Bjorn Andersson <[email protected]>
Message-Id: <20241016-fix-arm-scmi-slab-use-after-free-v2-1-1783685ef90d@quicinc.com>
Signed-off-by: Sudeep Holla <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
…ing to satisfy some BPF verifiers

In a RHEL8 kernel (4.18.0-513.11.1.el8_9.x86_64), that, as enterprise
kernels go, have backports from modern kernels, the verifier complains
about lack of bounds check for the index into the array of syscall
arguments, on a BPF bytecode generated by clang 17, with:

  ; } else if (size < 0 && size >= -6) { /* buffer */
  116: (b7) r1 = -6
  117: (2d) if r1 > r6 goto pc-30
   R0=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=inv-6 R2=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3=inv(id=0) R5=inv40 R6=inv(id=0,umin_value=18446744073709551610,var_off=(0xffffffff00000000; 0xffffffff)) R7=map_value(id=0,off=56,ks=4,vs=8272,imm=0) R8=invP6 R9=map_value(id=0,off=20,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16=map_value fp-24=map_value fp-32=inv40 fp-40=ctx fp-48=map_value fp-56=inv1 fp-64=map_value fp-72=map_value fp-80=map_value
  ; index = -(size + 1);
  118: (a7) r6 ^= -1
  119: (67) r6 <<= 32
  120: (77) r6 >>= 32
  ; aug_size = args->args[index];
  121: (67) r6 <<= 3
  122: (79) r1 = *(u64 *)(r10 -24)
  123: (0f) r1 += r6
  last_idx 123 first_idx 116
  regs=40 stack=0 before 122: (79) r1 = *(u64 *)(r10 -24)
  regs=40 stack=0 before 121: (67) r6 <<= 3
  regs=40 stack=0 before 120: (77) r6 >>= 32
  regs=40 stack=0 before 119: (67) r6 <<= 32
  regs=40 stack=0 before 118: (a7) r6 ^= -1
  regs=40 stack=0 before 117: (2d) if r1 > r6 goto pc-30
  regs=42 stack=0 before 116: (b7) r1 = -6
   R0_w=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=inv1 R2_w=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3_w=inv(id=0) R5_w=inv40 R6_rw=invP(id=0,smin_value=-2147483648,smax_value=0) R7_w=map_value(id=0,off=56,ks=4,vs=8272,imm=0) R8_w=invP6 R9_w=map_value(id=0,off=20,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16_w=map_value fp-24_r=map_value fp-32_w=inv40 fp-40=ctx fp-48=map_value fp-56_w=inv1 fp-64_w=map_value fp-72=map_value fp-80=map_value
  parent didn't have regs=40 stack=0 marks
  last_idx 110 first_idx 98
  regs=40 stack=0 before 110: (6d) if r1 s> r6 goto pc+5
  regs=42 stack=0 before 109: (b7) r1 = 1
  regs=40 stack=0 before 108: (65) if r6 s> 0x1000 goto pc+7
  regs=40 stack=0 before 98: (55) if r6 != 0x1 goto pc+9
   R0_w=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=invP12 R2_w=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3_rw=inv(id=0) R5_w=inv24 R6_rw=invP(id=0,smin_value=-2147483648,smax_value=2147483647) R7_w=map_value(id=0,off=40,ks=4,vs=8272,imm=0) R8_rw=invP4 R9_w=map_value(id=0,off=12,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16_rw=map_value fp-24_r=map_value fp-32_rw=invP24 fp-40_r=ctx fp-48_r=map_value fp-56_w=invP1 fp-64_rw=map_value fp-72_r=map_value fp-80_r=map_value
  parent already had regs=40 stack=0 marks
  124: (79) r6 = *(u64 *)(r1 +16)
   R0=map_value(id=0,off=0,ks=4,vs=24688,imm=0) R1_w=map_value(id=0,off=0,ks=4,vs=8272,umax_value=34359738360,var_off=(0x0; 0x7fffffff8),s32_max_value=2147483640,u32_max_value=-8) R2=map_value(id=0,off=16,ks=4,vs=8272,imm=0) R3=inv(id=0) R5=inv40 R6_w=invP(id=0,umax_value=34359738360,var_off=(0x0; 0x7fffffff8),s32_max_value=2147483640,u32_max_value=-8) R7=map_value(id=0,off=56,ks=4,vs=8272,imm=0) R8=invP6 R9=map_value(id=0,off=20,ks=4,vs=24,imm=0) R10=fp0 fp-8=mmmmmmmm fp-16=map_value fp-24=map_value fp-32=inv40 fp-40=ctx fp-48=map_value fp-56=inv1 fp-64=map_value fp-72=map_value fp-80=map_value
  R1 unbounded memory access, make sure to bounds check any such access
  processed 466 insns (limit 1000000) max_states_per_insn 2 total_states 20 peak_states 20 mark_read 3

If we add this line, as used in other BPF programs, to cap that index:

   index &= 7;

The generated BPF program is considered safe by that version of the BPF
verifier, allowing perf to collect the syscall args in one more kernel
using the BPF based pointer contents collector.

With the above one-liner it works with that kernel:

  [root@dell-per740-01 ~]# uname -a
  Linux dell-per740-01.khw.eng.rdu2.dc.redhat.com 4.18.0-513.11.1.el8_9.x86_64 svenkatr#1 SMP Thu Dec 7 03:06:13 EST 2023 x86_64 x86_64 x86_64 GNU/Linux
  [root@dell-per740-01 ~]# ~acme/bin/perf trace -e *sleep* sleep 1.234567890
       0.000 (1234.704 ms): sleep/3863610 nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 234567890 })                  = 0
  [root@dell-per740-01 ~]#

As well as with the one in Fedora 40:

  root@number:~# uname -a
  Linux number 6.11.3-200.fc40.x86_64 svenkatr#1 SMP PREEMPT_DYNAMIC Thu Oct 10 22:31:19 UTC 2024 x86_64 GNU/Linux
  root@number:~# perf trace -e *sleep* sleep 1.234567890
       0.000 (1234.722 ms): sleep/14873 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 234567890 }, rmtp: 0x7ffe87311a40) = 0
  root@number:~#

Song Liu reported that this one-liner was being optimized out by clang
18, so I suggested and he tested that adding a compiler barrier before
it made clang v18 to keep it and the verifier in the kernel in Song's
case (Meta's 5.12 based kernel) also was happy with the resulting
bytecode.

I'll investigate using virtme-ng[1] to have all the perf BPF based
functionality thoroughly tested over multiple kernels and clang
versions.

[1] https://kernel-recipes.org/en/2024/virtme-ng/

Cc: Adrian Hunter <[email protected]>
Cc: Alan Maguire <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Howard Chu <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Song Liu <[email protected]>
Link: https://lore.kernel.org/lkml/Zw7JgJc0LOwSpuvx@x1
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
The purpose of btrfs_bbio_propagate_error() shall be propagating an error
of split bio to its original btrfs_bio, and tell the error to the upper
layer. However, it's not working well on some cases.

* Case 1. Immediate (or quick) end_bio with an error

When btrfs sends btrfs_bio to mirrored devices, btrfs calls
btrfs_bio_end_io() when all the mirroring bios are completed. If that
btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function
is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error()
accesses the orig_bbio's bio context to increase the error count.

That works well in most cases. However, if the end_io is called enough
fast, orig_bbio's (remaining part after split) bio context may not be
properly set at that time. Since the bio context is set when the orig_bbio
(the last btrfs_bio) is sent to devices, that might be too late for earlier
split btrfs_bio's completion.  That will result in NULL pointer
dereference.

That bug is easily reproducible by running btrfs/146 on zoned devices [1]
and it shows the following trace.

[1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS.

  BUG: kernel NULL pointer dereference, address: 0000000000000020
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [svenkatr#1] PREEMPT SMP PTI
  CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ #474
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-btrfs-5)
  RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs]
  BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
  RSP: 0018:ffffc9000006f248 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc
  RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080
  RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001
  R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58
  R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158
  FS:  0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   ? __die_body.cold+0x19/0x26
   ? page_fault_oops+0x13e/0x2b0
   ? _printk+0x58/0x73
   ? do_user_addr_fault+0x5f/0x750
   ? exc_page_fault+0x76/0x240
   ? asm_exc_page_fault+0x22/0x30
   ? btrfs_bio_end_io+0xae/0xc0 [btrfs]
   ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs]
   btrfs_orig_write_end_io+0x51/0x90 [btrfs]
   dm_submit_bio+0x5c2/0xa50 [dm_mod]
   ? find_held_lock+0x2b/0x80
   ? blk_try_enter_queue+0x90/0x1e0
   __submit_bio+0xe0/0x130
   ? ktime_get+0x10a/0x160
   ? lockdep_hardirqs_on+0x74/0x100
   submit_bio_noacct_nocheck+0x199/0x410
   btrfs_submit_bio+0x7d/0x150 [btrfs]
   btrfs_submit_chunk+0x1a1/0x6d0 [btrfs]
   ? lockdep_hardirqs_on+0x74/0x100
   ? __folio_start_writeback+0x10/0x2c0
   btrfs_submit_bbio+0x1c/0x40 [btrfs]
   submit_one_bio+0x44/0x60 [btrfs]
   submit_extent_folio+0x13f/0x330 [btrfs]
   ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs]
   extent_writepage_io+0x18b/0x360 [btrfs]
   extent_write_locked_range+0x17c/0x340 [btrfs]
   ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs]
   run_delalloc_cow+0x71/0xd0 [btrfs]
   btrfs_run_delalloc_range+0x176/0x500 [btrfs]
   ? find_lock_delalloc_range+0x119/0x260 [btrfs]
   writepage_delalloc+0x2ab/0x480 [btrfs]
   extent_write_cache_pages+0x236/0x7d0 [btrfs]
   btrfs_writepages+0x72/0x130 [btrfs]
   do_writepages+0xd4/0x240
   ? find_held_lock+0x2b/0x80
   ? wbc_attach_and_unlock_inode+0x12c/0x290
   ? wbc_attach_and_unlock_inode+0x12c/0x290
   __writeback_single_inode+0x5c/0x4c0
   ? do_raw_spin_unlock+0x49/0xb0
   writeback_sb_inodes+0x22c/0x560
   __writeback_inodes_wb+0x4c/0xe0
   wb_writeback+0x1d6/0x3f0
   wb_workfn+0x334/0x520
   process_one_work+0x1ee/0x570
   ? lock_is_held_type+0xc6/0x130
   worker_thread+0x1d1/0x3b0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xee/0x120
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x30/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl
  CR2: 0000000000000020

* Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios

btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is
called last among split bios. In that case, btrfs_orig_write_end_io() sets
the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2].
Otherwise, the increased orig_bio's bioc->error is not checked by anyone
and return BLK_STS_OK to the upper layer.

[2] Actually, this is not true. Because we only increases orig_bioc->errors
by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors"
is still not met if only one split btrfs_bio fails.

* Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios

In contrast to the above case, btrfs_bbio_propagate_error() is not working
well if un-mirrored orig_bbio is completed last. It sets
orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily
over-written by orig_bbio's completion status. If the status is BLK_STS_OK,
the upper layer would not know the failure.

* Solution

Considering the above cases, we can only save the error status in the
orig_bbio (remaining part after split) itself as it is always
available. Also, the saved error status should be propagated when all the
split btrfs_bios are finished (i.e, bbio->pending_ios == 0).

This commit introduces "status" to btrfs_bbio and saves the first error of
split bios to original btrfs_bio's "status" variable. When all the split
bios are finished, the saved status is loaded into original btrfs_bio's
status.

With this commit, btrfs/146 on zoned devices does not hit the NULL pointer
dereference anymore.

Fixes: 852eee6 ("btrfs: allow btrfs_submit_bio to split bios")
CC: [email protected] # 6.6+
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
Signed-off-by: David Sterba <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
Running rcutorture scenario TREE05, the below warning is triggered.

[   32.604594] WARNING: suspicious RCU usage
[   32.605928] 6.11.0-rc5-00040-g4ba4f1afb6a9 #55238 Not tainted
[   32.607812] -----------------------------
[   32.609140] kernel/events/core.c:13946 RCU-list traversed in non-reader section!!
[   32.611595] other info that might help us debug this:
[   32.614247] rcu_scheduler_active = 2, debug_locks = 1
[   32.616392] 3 locks held by cpuhp/4/35:
[   32.617687]  #0: ffffffffb666a650 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200
[   32.620563]  svenkatr#1: ffffffffb666cd20 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200
[   32.623412]  svenkatr#2: ffffffffb677c288 (pmus_lock){+.+.}-{3:3}, at: perf_event_exit_cpu_context+0x32/0x2f0

In perf_event_clear_cpumask(), uses list_for_each_entry_rcu() without an
obvious RCU read-side critical section.

Either pmus_srcu or pmus_lock is good enough to protect the pmus list.
In the current context, pmus_lock is already held. The
list_for_each_entry_rcu() is not required.

Fixes: 4ba4f1a ("perf: Generic hotplug support for a PMU with a scope")
Closes: https://lore.kernel.org/lkml/2b66dff8-b827-494b-b151-1ad8d56f13e6@paulmck-laptop/
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Reported-by: "Paul E. McKenney" <[email protected]>
Reported-by: kernel test robot <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: "Paul E. McKenney" <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
Add check for the return value of spi_get_csgpiod() to avoid passing a NULL
pointer to gpiod_direction_output(), preventing a crash when GPIO chip
select is not used.

Fix below crash:
[    4.251960] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[    4.260762] Mem abort info:
[    4.263556]   ESR = 0x0000000096000004
[    4.267308]   EC = 0x25: DABT (current EL), IL = 32 bits
[    4.272624]   SET = 0, FnV = 0
[    4.275681]   EA = 0, S1PTW = 0
[    4.278822]   FSC = 0x04: level 0 translation fault
[    4.283704] Data abort info:
[    4.286583]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[    4.292074]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    4.297130]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    4.302445] [0000000000000000] user address but active_mm is swapper
[    4.308805] Internal error: Oops: 0000000096000004 [svenkatr#1] PREEMPT SMP
[    4.315072] Modules linked in:
[    4.318124] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc4-next-20241023-00008-ga20ec42c5fc1 #359
[    4.328130] Hardware name: LS1046A QDS Board (DT)
[    4.332832] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    4.339794] pc : gpiod_direction_output+0x34/0x5c
[    4.344505] lr : gpiod_direction_output+0x18/0x5c
[    4.349208] sp : ffff80008003b8f0
[    4.352517] x29: ffff80008003b8f0 x28: 0000000000000000 x27: ffffc96bcc7e9068
[    4.359659] x26: ffffc96bcc6e00b0 x25: ffffc96bcc598398 x24: ffff447400132810
[    4.366800] x23: 0000000000000000 x22: 0000000011e1a300 x21: 0000000000020002
[    4.373940] x20: 0000000000000000 x19: 0000000000000000 x18: ffffffffffffffff
[    4.381081] x17: ffff44740016e600 x16: 0000000500000003 x15: 0000000000000007
[    4.388221] x14: 0000000000989680 x13: 0000000000020000 x12: 000000000000001e
[    4.395362] x11: 0044b82fa09b5a53 x10: 0000000000000019 x9 : 0000000000000008
[    4.402502] x8 : 0000000000000002 x7 : 0000000000000007 x6 : 0000000000000000
[    4.409641] x5 : 0000000000000200 x4 : 0000000002000000 x3 : 0000000000000000
[    4.416781] x2 : 0000000000022202 x1 : 0000000000000000 x0 : 0000000000000000
[    4.423921] Call trace:
[    4.426362]  gpiod_direction_output+0x34/0x5c (P)
[    4.431067]  gpiod_direction_output+0x18/0x5c (L)
[    4.435771]  dspi_setup+0x220/0x334

Fixes: 9e264f3 ("spi: Replace all spi->chip_select and spi->cs_gpiod references with function call")
Cc: [email protected]
Signed-off-by: Frank Li <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Mark Brown <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
… non-PCI device

The function cxl_endpoint_gather_bandwidth() invokes
pci_bus_read/write_XXX(), however, not all CXL devices are presently
implemented via PCI. It is recognized that the cxl_test has realized a CXL
device using a platform device.

Calling pci_bus_read/write_XXX() in cxl_test will cause kernel panic:
 platform cxl_host_bridge.3: host supports CXL (restricted)
 Oops: general protection fault, probably for non-canonical address 0x3ef17856fcae4fbd: 0000 [svenkatr#1] PREEMPT SMP PTI
 Call Trace:
  <TASK>
  ? __die_body.cold+0x19/0x27
  ? die_addr+0x38/0x60
  ? exc_general_protection+0x1f5/0x4b0
  ? asm_exc_general_protection+0x22/0x30
  ? pci_bus_read_config_word+0x1c/0x60
  pcie_capability_read_word+0x93/0xb0
  pcie_link_speed_mbps+0x18/0x50
  cxl_pci_get_bandwidth+0x18/0x60 [cxl_core]
  cxl_endpoint_gather_bandwidth.constprop.0+0xf4/0x230 [cxl_core]
  ? xas_store+0x54/0x660
  ? preempt_count_add+0x69/0xa0
  ? _raw_spin_lock+0x13/0x40
  ? __kmalloc_cache_noprof+0xe7/0x270
  cxl_region_shared_upstream_bandwidth_update+0x9c/0x790 [cxl_core]
  cxl_region_attach+0x520/0x7e0 [cxl_core]
  store_targetN+0xf2/0x120 [cxl_core]
  kernfs_fop_write_iter+0x13a/0x1f0
  vfs_write+0x23b/0x410
  ksys_write+0x53/0xd0
  do_syscall_64+0x62/0x180
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

And Ying also reported a KASAN error with similar calltrace.

Reported-by: Huang, Ying <[email protected]>
Closes: http://lore.kernel.org/[email protected]
Fixes: a5ab0de ("cxl: Calculate region bandwidth of targets with shared upstream link")
Signed-off-by: Li Zhijian <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Tested-by: Huang, Ying <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Ira Weiny <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
In support of investigating an initialization failure report [1],
cxl_test was updated to register mock memory-devices after the mock
root-port/bus device had been registered. That led to cxl_test crashing
with a use-after-free bug with the following signature:

    cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
    cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
    cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
1)  cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1
    [..]
    cxld_unregister: cxl decoder14.0:
    cxl_region_decode_reset: cxl_region region3:
    mock_decoder_reset: cxl_port port3: decoder3.0 reset
2)  mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1
    cxl_endpoint_decoder_release: cxl decoder14.0:
    [..]
    cxld_unregister: cxl decoder7.0:
3)  cxl_region_decode_reset: cxl_region region3:
    Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [svenkatr#1] PREEMPT SMP PTI
    [..]
    RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core]
    [..]
    Call Trace:
     <TASK>
     cxl_region_decode_reset+0x69/0x190 [cxl_core]
     cxl_region_detach+0xe8/0x210 [cxl_core]
     cxl_decoder_kill_region+0x27/0x40 [cxl_core]
     cxld_unregister+0x5d/0x60 [cxl_core]

At 1) a region has been established with 2 endpoint decoders (7.0 and
14.0). Those endpoints share a common switch-decoder in the topology
(3.0). At teardown, 2), decoder14.0 is the first to be removed and hits
the "out of order reset case" in the switch decoder. The effect though
is that region3 cleanup is aborted leaving it in-tact and
referencing decoder14.0. At 3) the second attempt to teardown region3
trips over the stale decoder14.0 object which has long since been
deleted.

The fix here is to recognize that the CXL specification places no
mandate on in-order shutdown of switch-decoders, the driver enforces
in-order allocation, and hardware enforces in-order commit. So, rather
than fail and leave objects dangling, always remove them.

In support of making cxl_region_decode_reset() always succeed,
cxl_region_invalidate_memregion() failures are turned into warnings.
Crashing the kernel is ok there since system integrity is at risk if
caches cannot be managed around physical address mutation events like
CXL region destruction.

A new device_for_each_child_reverse_from() is added to cleanup
port->commit_end after all dependent decoders have been disabled. In
other words if decoders are allocated 0->1->2 and disabled 1->2->0 then
port->commit_end only decrements from 2 after 2 has been disabled, and
it decrements all the way to zero since 1 was disabled previously.

Link: http://lore.kernel.org/[email protected] [1]
Cc: [email protected]
Fixes: 176baef ("cxl/hdm: Commit decoder state to hardware")
Reviewed-by: Jonathan Cameron <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Alison Schofield <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Zijun Hu <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Link: https://patch.msgid.link/172964782781.81806.17902885593105284330.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Ira Weiny <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
Under memory pressure it's possible for GFP_ATOMIC order-0 allocations to
fail even though free pages are available in the highatomic reserves. 
GFP_ATOMIC allocations cannot trigger unreserve_highatomic_pageblock()
since it's only run from reclaim.

Given that such allocations will pass the watermarks in
__zone_watermark_unusable_free(), it makes sense to fallback to highatomic
reserves the same way that ALLOC_OOM can.

This fixes order-0 page allocation failures observed on Cloudflare's fleet
when handling network packets:

  kswapd1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC),
  nodemask=(null),cpuset=/,mems_allowed=0-7
  CPU: 10 PID: 696 Comm: kswapd1 Kdump: loaded Tainted: G           O 6.6.43-CUSTOM svenkatr#1
  Hardware name: MACHINE
  Call Trace:
   <IRQ>
   dump_stack_lvl+0x3c/0x50
   warn_alloc+0x13a/0x1c0
   __alloc_pages_slowpath.constprop.0+0xc9d/0xd10
   __alloc_pages+0x327/0x340
   __napi_alloc_skb+0x16d/0x1f0
   bnxt_rx_page_skb+0x96/0x1b0 [bnxt_en]
   bnxt_rx_pkt+0x201/0x15e0 [bnxt_en]
   __bnxt_poll_work+0x156/0x2b0 [bnxt_en]
   bnxt_poll+0xd9/0x1c0 [bnxt_en]
   __napi_poll+0x2b/0x1b0
   bpf_trampoline_6442524138+0x7d/0x1000
   __napi_poll+0x5/0x1b0
   net_rx_action+0x342/0x740
   handle_softirqs+0xcf/0x2b0
   irq_exit_rcu+0x6c/0x90
   sysvec_apic_timer_interrupt+0x72/0x90
   </IRQ>

[[email protected]: update comment]
  Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/CAGis_TWzSu=P7QJmjD58WWiu3zjMTVKSzdOwWE8ORaGytzWJwQ@mail.gmail.com/
Fixes: 1d91df8 ("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs")
Signed-off-by: Matt Fleming <[email protected]>
Suggested-by: Vlastimil Babka <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
walk_system_ram_res_rev() erroneously discards resource flags when passing
the information to the callback.

This causes systems with IORESOURCE_SYSRAM_DRIVER_MANAGED memory to have
these resources selected during kexec to store kexec buffers if that
memory happens to be at placed above normal system ram.

This leads to undefined behavior after reboot.  If the kexec buffer is
never touched, nothing happens.  If the kexec buffer is touched, it could
lead to a crash (like below) or undefined behavior.

Tested on a system with CXL memory expanders with driver managed memory,
TPM enabled, and CONFIG_IMA_KEXEC=y.  Adding printk's showed the flags
were being discarded and as a result the check for
IORESOURCE_SYSRAM_DRIVER_MANAGED passes.

find_next_iomem_res: name(System RAM (kmem))
		     start(10000000000)
		     end(1034fffffff)
		     flags(83000200)

locate_mem_hole_top_down: start(10000000000) end(1034fffffff) flags(0)

[.] BUG: unable to handle page fault for address: ffff89834ffff000
[.] #PF: supervisor read access in kernel mode
[.] #PF: error_code(0x0000) - not-present page
[.] PGD c04c8bf067 P4D c04c8bf067 PUD c04c8be067 PMD 0
[.] Oops: 0000 [svenkatr#1] SMP
[.] RIP: 0010:ima_restore_measurement_list+0x95/0x4b0
[.] RSP: 0018:ffffc900000d3a80 EFLAGS: 00010286
[.] RAX: 0000000000001000 RBX: 0000000000000000 RCX: ffff89834ffff000
[.] RDX: 0000000000000018 RSI: ffff89834ffff000 RDI: ffff89834ffff018
[.] RBP: ffffc900000d3ba0 R08: 0000000000000020 R09: ffff888132b8a900
[.] R10: 4000000000000000 R11: 000000003a616d69 R12: 0000000000000000
[.] R13: ffffffff8404ac28 R14: 0000000000000000 R15: ffff89834ffff000
[.] FS:  0000000000000000(0000) GS:ffff893d44640000(0000) knlGS:0000000000000000
[.] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[.] ata5: SATA link down (SStatus 0 SControl 300)
[.] CR2: ffff89834ffff000 CR3: 000001034d00f001 CR4: 0000000000770ef0
[.] PKRU: 55555554
[.] Call Trace:
[.]  <TASK>
[.]  ? __die+0x78/0xc0
[.]  ? page_fault_oops+0x2a8/0x3a0
[.]  ? exc_page_fault+0x84/0x130
[.]  ? asm_exc_page_fault+0x22/0x30
[.]  ? ima_restore_measurement_list+0x95/0x4b0
[.]  ? template_desc_init_fields+0x317/0x410
[.]  ? crypto_alloc_tfm_node+0x9c/0xc0
[.]  ? init_ima_lsm+0x30/0x30
[.]  ima_load_kexec_buffer+0x72/0xa0
[.]  ima_init+0x44/0xa0
[.]  __initstub__kmod_ima__373_1201_init_ima7+0x1e/0xb0
[.]  ? init_ima_lsm+0x30/0x30
[.]  do_one_initcall+0xad/0x200
[.]  ? idr_alloc_cyclic+0xaa/0x110
[.]  ? new_slab+0x12c/0x420
[.]  ? new_slab+0x12c/0x420
[.]  ? number+0x12a/0x430
[.]  ? sysvec_apic_timer_interrupt+0xa/0x80
[.]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[.]  ? parse_args+0xd4/0x380
[.]  ? parse_args+0x14b/0x380
[.]  kernel_init_freeable+0x1c1/0x2b0
[.]  ? rest_init+0xb0/0xb0
[.]  kernel_init+0x16/0x1a0
[.]  ret_from_fork+0x2f/0x40
[.]  ? rest_init+0xb0/0xb0
[.]  ret_from_fork_asm+0x11/0x20
[.]  </TASK>

Link: https://lore.kernel.org/all/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 7acf164 ("resource: add walk_system_ram_res_rev()")
Signed-off-by: Gregory Price <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: AKASHI Takahiro <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Ilpo Järvinen <[email protected]>
Cc: Mika Westerberg <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
The following BUG was triggered:

=============================
[ BUG: Invalid wait context ]
6.12.0-rc2-XXX #406 Not tainted
-----------------------------
kworker/1:1/62 is trying to lock:
ffffff8801593030 (&cpc_ptr->rmw_lock){+.+.}-{3:3}, at: cpc_write+0xcc/0x370
other info that might help us debug this:
context-{5:5}
2 locks held by kworker/1:1/62:
  #0: ffffff897ef5ec98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2c/0x50
  svenkatr#1: ffffff880154e238 (&sg_policy->update_lock){....}-{2:2}, at: sugov_update_shared+0x3c/0x280
stack backtrace:
CPU: 1 UID: 0 PID: 62 Comm: kworker/1:1 Not tainted 6.12.0-rc2-g9654bd3e8806 #406
Workqueue:  0x0 (events)
Call trace:
  dump_backtrace+0xa4/0x130
  show_stack+0x20/0x38
  dump_stack_lvl+0x90/0xd0
  dump_stack+0x18/0x28
  __lock_acquire+0x480/0x1ad8
  lock_acquire+0x114/0x310
  _raw_spin_lock+0x50/0x70
  cpc_write+0xcc/0x370
  cppc_set_perf+0xa0/0x3a8
  cppc_cpufreq_fast_switch+0x40/0xc0
  cpufreq_driver_fast_switch+0x4c/0x218
  sugov_update_shared+0x234/0x280
  update_load_avg+0x6ec/0x7b8
  dequeue_entities+0x108/0x830
  dequeue_task_fair+0x58/0x408
  __schedule+0x4f0/0x1070
  schedule+0x54/0x130
  worker_thread+0xc0/0x2e8
  kthread+0x130/0x148
  ret_from_fork+0x10/0x20

sugov_update_shared() locks a raw_spinlock while cpc_write() locks a
spinlock.

To have a correct wait-type order, update rmw_lock to a raw spinlock and
ensure that interrupts will be disabled on the CPU holding it.

Fixes: 60949b7 ("ACPI: CPPC: Fix MASK_VAL() usage")
Signed-off-by: Pierre Gondois <[email protected]>
Link: https://patch.msgid.link/[email protected]
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
I got a syzbot report without a repro [1] crashing in nf_send_reset6()

I think the issue is that dev->hard_header_len is zero, and we attempt
later to push an Ethernet header.

Use LL_MAX_HEADER, as other functions in net/ipv6/netfilter/nf_reject_ipv6.c.

[1]

skbuff: skb_under_panic: text:ffffffff89b1d008 len:74 put:14 head:ffff88803123aa00 data:ffff88803123a9f2 tail:0x3c end:0x140 dev:syz_tun
 kernel BUG at net/core/skbuff.c:206 !
Oops: invalid opcode: 0000 [svenkatr#1] PREEMPT SMP KASAN PTI
CPU: 0 UID: 0 PID: 7373 Comm: syz.1.568 Not tainted 6.12.0-rc2-syzkaller-00631-g6d858708d465 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
 RIP: 0010:skb_panic net/core/skbuff.c:206 [inline]
 RIP: 0010:skb_under_panic+0x14b/0x150 net/core/skbuff.c:216
Code: 0d 8d 48 c7 c6 60 a6 29 8e 48 8b 54 24 08 8b 0c 24 44 8b 44 24 04 4d 89 e9 50 41 54 41 57 41 56 e8 ba 30 38 02 48 83 c4 20 90 <0f> 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
RSP: 0018:ffffc900045269b0 EFLAGS: 00010282
RAX: 0000000000000088 RBX: dffffc0000000000 RCX: cd66dacdc5d8e800
RDX: 0000000000000000 RSI: 0000000000000200 RDI: 0000000000000000
RBP: ffff88802d39a3d0 R08: ffffffff8174afec R09: 1ffff920008a4ccc
R10: dffffc0000000000 R11: fffff520008a4ccd R12: 0000000000000140
R13: ffff88803123aa00 R14: ffff88803123a9f2 R15: 000000000000003c
FS:  00007fdbee5ff6c0(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000005d322000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
  skb_push+0xe5/0x100 net/core/skbuff.c:2636
  eth_header+0x38/0x1f0 net/ethernet/eth.c:83
  dev_hard_header include/linux/netdevice.h:3208 [inline]
  nf_send_reset6+0xce6/0x1270 net/ipv6/netfilter/nf_reject_ipv6.c:358
  nft_reject_inet_eval+0x3b9/0x690 net/netfilter/nft_reject_inet.c:48
  expr_call_ops_eval net/netfilter/nf_tables_core.c:240 [inline]
  nft_do_chain+0x4ad/0x1da0 net/netfilter/nf_tables_core.c:288
  nft_do_chain_inet+0x418/0x6b0 net/netfilter/nft_chain_filter.c:161
  nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
  nf_hook_slow+0xc3/0x220 net/netfilter/core.c:626
  nf_hook include/linux/netfilter.h:269 [inline]
  NF_HOOK include/linux/netfilter.h:312 [inline]
  br_nf_pre_routing_ipv6+0x63e/0x770 net/bridge/br_netfilter_ipv6.c:184
  nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline]
  nf_hook_bridge_pre net/bridge/br_input.c:277 [inline]
  br_handle_frame+0x9fd/0x1530 net/bridge/br_input.c:424
  __netif_receive_skb_core+0x13e8/0x4570 net/core/dev.c:5562
  __netif_receive_skb_one_core net/core/dev.c:5666 [inline]
  __netif_receive_skb+0x12f/0x650 net/core/dev.c:5781
  netif_receive_skb_internal net/core/dev.c:5867 [inline]
  netif_receive_skb+0x1e8/0x890 net/core/dev.c:5926
  tun_rx_batched+0x1b7/0x8f0 drivers/net/tun.c:1550
  tun_get_user+0x3056/0x47e0 drivers/net/tun.c:2007
  tun_chr_write_iter+0x10d/0x1f0 drivers/net/tun.c:2053
  new_sync_write fs/read_write.c:590 [inline]
  vfs_write+0xa6d/0xc90 fs/read_write.c:683
  ksys_write+0x183/0x2b0 fs/read_write.c:736
  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fdbeeb7d1ff
Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 c9 8d 02 00 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 1c 8e 02 00 48
RSP: 002b:00007fdbee5ff000 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fdbeed36058 RCX: 00007fdbeeb7d1ff
RDX: 000000000000008e RSI: 0000000020000040 RDI: 00000000000000c8
RBP: 00007fdbeebf12be R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000008e R11: 0000000000000293 R12: 0000000000000000
R13: 0000000000000000 R14: 00007fdbeed36058 R15: 00007ffc38de06e8
 </TASK>

Fixes: c8d7b98 ("netfilter: move nf_send_resetX() code to nf_reject_ipvX modules")
Reported-by: syzbot <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
Hou Tao says:

====================
The patch set fixes several issues in bits iterator. Patch svenkatr#1 fixes the
kmemleak problem of bits iterator. Patch svenkatr#2~svenkatr#3 fix the overflow problem
of nr_bits. Patch svenkatr#4 fixes the potential stack corruption when bits
iterator is used on 32-bit host. Patch svenkatr#5 adds more test cases for bits
iterator.

Please see the individual patches for more details. And comments are
always welcome.
---
v4:
 * patch svenkatr#1: add ack from Yafang
 * patch svenkatr#3: revert code-churn like changes:
   (1) compute nr_bytes and nr_bits before the check of nr_words.
   (2) use nr_bits == 64 to check for single u64, preventing build
       warning on 32-bit hosts.
 * patch svenkatr#4: use "BITS_PER_LONG == 32" instead of "!defined(CONFIG_64BIT)"

v3: https://lore.kernel.org/bpf/[email protected]/T/#t
  * split the bits-iterator related patches from "Misc fixes for bpf"
    patch set
  * patch svenkatr#1: use "!nr_bits || bits >= nr_bits" to stop the iteration
  * patch svenkatr#2: add a new helper for the overflow problem
  * patch svenkatr#3: decrease the limitation from 512 to 511 and check whether
    nr_bytes is too large for bpf memory allocator explicitly
  * patch svenkatr#5: add two more test cases for bit iterator

v2: http://lore.kernel.org/bpf/[email protected]
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
Petr Machata says:

====================
mlxsw: Fixes

In this patchset:

- Tx header should be pushed for each packet which is transmitted via
  Spectrum ASICs. Patch svenkatr#1 adds a missing call to skb_cow_head() to make
  sure that there is both enough room to push the Tx header and that the
  SKB header is not cloned and can be modified.

- Commit b5b60bb ("mlxsw: pci: Use page pool for Rx buffers
  allocation") converted mlxsw to use page pool for Rx buffers allocation.
  Sync for CPU and for device should be done for Rx pages. In patches svenkatr#2
  and svenkatr#3, add the missing calls to sync pages for, respectively, CPU and
  the device.

- Patch svenkatr#4 then fixes a bug to IPv6 GRE forwarding offload. Patch svenkatr#5 adds
  a generic forwarding test that fails with mlxsw ports prior to the fix.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
When we compile and load lib/slub_kunit.c,it will cause a panic.

The root cause is that __kmalloc_cache_noprof was directly called instead
of kmem_cache_alloc,which resulted in no alloc_tag being allocated.This
caused current->alloc_tag to be null,leading to a null pointer dereference
in alloc_tag_ref_set.

Despite the fact that my colleague Pei Xiao will later fix the code in
slub_kunit.c,we still need fix null pointer check logic for ref and tag to
avoid panic caused by a null pointer dereference.

Here is the log for the panic:

[   74.779373][ T2158] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
[   74.780130][ T2158] Mem abort info:
[   74.780406][ T2158]   ESR = 0x0000000096000004
[   74.780756][ T2158]   EC = 0x25: DABT (current EL), IL = 32 bits
[   74.781225][ T2158]   SET = 0, FnV = 0
[   74.781529][ T2158]   EA = 0, S1PTW = 0
[   74.781836][ T2158]   FSC = 0x04: level 0 translation fault
[   74.782288][ T2158] Data abort info:
[   74.782577][ T2158]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[   74.783068][ T2158]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[   74.783533][ T2158]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   74.784010][ T2158] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000105f34000
[   74.784586][ T2158] [0000000000000020] pgd=0000000000000000, p4d=0000000000000000
[   74.785293][ T2158] Internal error: Oops: 0000000096000004 [svenkatr#1] SMP
[   74.785805][ T2158] Modules linked in: slub_kunit kunit ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle 4
[   74.790661][ T2158] CPU: 0 UID: 0 PID: 2158 Comm: kunit_try_catch Kdump: loaded Tainted: G        W        N 6.12.0-rc3+ svenkatr#2
[   74.791535][ T2158] Tainted: [W]=WARN, [N]=TEST
[   74.791889][ T2158] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[   74.792479][ T2158] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   74.793101][ T2158] pc : alloc_tagging_slab_alloc_hook+0x120/0x270
[   74.793607][ T2158] lr : alloc_tagging_slab_alloc_hook+0x120/0x270
[   74.794095][ T2158] sp : ffff800084d33cd0
[   74.794418][ T2158] x29: ffff800084d33cd0 x28: 0000000000000000 x27: 0000000000000000
[   74.795095][ T2158] x26: 0000000000000000 x25: 0000000000000012 x24: ffff80007b30e314
[   74.795822][ T2158] x23: ffff000390ff6f10 x22: 0000000000000000 x21: 0000000000000088
[   74.796555][ T2158] x20: ffff000390285840 x19: fffffd7fc3ef7830 x18: ffffffffffffffff
[   74.797283][ T2158] x17: ffff8000800e63b4 x16: ffff80007b33afc4 x15: ffff800081654c00
[   74.798011][ T2158] x14: 0000000000000000 x13: 205d383531325420 x12: 5b5d383734363537
[   74.798744][ T2158] x11: ffff800084d337e0 x10: 000000000000005d x9 : 00000000ffffffd0
[   74.799476][ T2158] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008219d188 x6 : c0000000ffff7fff
[   74.800206][ T2158] x5 : ffff0003fdbc9208 x4 : ffff800081edd188 x3 : 0000000000000001
[   74.800932][ T2158] x2 : 0beaa6dee1ac5a00 x1 : 0beaa6dee1ac5a00 x0 : ffff80037c2cb000
[   74.801656][ T2158] Call trace:
[   74.801954][ T2158]  alloc_tagging_slab_alloc_hook+0x120/0x270
[   74.802494][ T2158]  __kmalloc_cache_noprof+0x148/0x33c
[   74.802976][ T2158]  test_kmalloc_redzone_access+0x4c/0x104 [slub_kunit]
[   74.803607][ T2158]  kunit_try_run_case+0x70/0x17c [kunit]
[   74.804124][ T2158]  kunit_generic_run_threadfn_adapter+0x2c/0x4c [kunit]
[   74.804768][ T2158]  kthread+0x10c/0x118
[   74.805141][ T2158]  ret_from_fork+0x10/0x20
[   74.805540][ T2158] Code: b9400a80 11000400 b9000a80 97ffd858 (f94012d3)
[   74.806176][ T2158] SMP: stopping secondary CPUs
[   74.808130][ T2158] Starting crashdump kernel...

Link: https://lkml.kernel.org/r/[email protected]
Fixes: e0a955b ("mm/codetag: add pgalloc_tag_copy()")
Signed-off-by: Hao Ge <[email protected]>
Acked-by: Suren Baghdasaryan <[email protected]>
Suggested-by: Suren Baghdasaryan <[email protected]>
Acked-by: Yu Zhao <[email protected]>
Cc: Kent Overstreet <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
Enqueue packets in dql after dma engine starts causes race condition.
Tx transfer starts once dma engine is started and may execute dql dequeue
in completion before it gets queued. It results in following kernel crash
while running iperf stress test:

kernel BUG at lib/dynamic_queue_limits.c:99!
<snip>
Internal error: Oops - BUG: 00000000f2000800 [svenkatr#1] SMP
pc : dql_completed+0x238/0x248
lr : dql_completed+0x3c/0x248

Call trace:
  dql_completed+0x238/0x248
  axienet_dma_tx_cb+0xa0/0x170
  xilinx_dma_do_tasklet+0xdc/0x290
  tasklet_action_common+0xf8/0x11c
  tasklet_action+0x30/0x3c
  handle_softirqs+0xf8/0x230
<snip>

Start dmaengine after enqueue in dql fixes the crash.

Fixes: 6a91b84 ("net: axienet: Introduce dmaengine support")
Signed-off-by: Suraj Gupta <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
Recently, we got a customer report that CIFS triggers oops while
reconnecting to a server.  [0]

The workload runs on Kubernetes, and some pods mount CIFS servers
in non-root network namespaces.  The problem rarely happened, but
it was always while the pod was dying.

The root cause is wrong reference counting for network namespace.

CIFS uses kernel sockets, which do not hold refcnt of the netns that
the socket belongs to.  That means CIFS must ensure the socket is
always freed before its netns; otherwise, use-after-free happens.

The repro steps are roughly:

  1. mount CIFS in a non-root netns
  2. drop packets from the netns
  3. destroy the netns
  4. unmount CIFS

We can reproduce the issue quickly with the script [1] below and see
the splat [2] if CONFIG_NET_NS_REFCNT_TRACKER is enabled.

When the socket is TCP, it is hard to guarantee the netns lifetime
without holding refcnt due to async timers.

Let's hold netns refcnt for each socket as done for SMC in commit
9744d2b ("smc: Fix use-after-free in tcp_write_timer_handler().").

Note that we need to move put_net() from cifs_put_tcp_session() to
clean_demultiplex_info(); otherwise, __sock_create() still could touch a
freed netns while cifsd tries to reconnect from cifs_demultiplex_thread().

Also, maybe_get_net() cannot be put just before __sock_create() because
the code is not under RCU and there is a small chance that the same
address happened to be reallocated to another netns.

[0]:
CIFS: VFS: \\XXXXXXXXXXX has not responded in 15 seconds. Reconnecting...
CIFS: Serverclose failed 4 times, giving up
Unable to handle kernel paging request at virtual address 14de99e461f84a07
Mem abort info:
  ESR = 0x0000000096000004
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x04: level 0 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000004
  CM = 0, WnR = 0
[14de99e461f84a07] address between user and kernel address ranges
Internal error: Oops: 0000000096000004 [svenkatr#1] SMP
Modules linked in: cls_bpf sch_ingress nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver tcp_diag inet_diag veth xt_state xt_connmark nf_conntrack_netlink xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink overlay nls_ascii nls_cp437 sunrpc vfat fat aes_ce_blk aes_ce_cipher ghash_ce sm4_ce_cipher sm4 sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 sha1_ce ena button sch_fq_codel loop fuse configfs dmi_sysfs sha2_ce sha256_arm64 dm_mirror dm_region_hash dm_log dm_mod dax efivarfs
CPU: 5 PID: 2690970 Comm: cifsd Not tainted 6.1.103-109.184.amzn2023.aarch64 svenkatr#1
Hardware name: Amazon EC2 r7g.4xlarge/, BIOS 1.0 11/1/2018
pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : fib_rules_lookup+0x44/0x238
lr : __fib_lookup+0x64/0xbc
sp : ffff8000265db790
x29: ffff8000265db790 x28: 0000000000000000 x27: 000000000000bd01
x26: 0000000000000000 x25: ffff000b4baf8000 x24: ffff00047b5e4580
x23: ffff8000265db7e0 x22: 0000000000000000 x21: ffff00047b5e4500
x20: ffff0010e3f694f8 x19: 14de99e461f849f7 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 3f92800abd010002
x11: 0000000000000001 x10: ffff0010e3f69420 x9 : ffff800008a6f294
x8 : 0000000000000000 x7 : 0000000000000006 x6 : 0000000000000000
x5 : 0000000000000001 x4 : ffff001924354280 x3 : ffff8000265db7e0
x2 : 0000000000000000 x1 : ffff0010e3f694f8 x0 : ffff00047b5e4500
Call trace:
 fib_rules_lookup+0x44/0x238
 __fib_lookup+0x64/0xbc
 ip_route_output_key_hash_rcu+0x2c4/0x398
 ip_route_output_key_hash+0x60/0x8c
 tcp_v4_connect+0x290/0x488
 __inet_stream_connect+0x108/0x3d0
 inet_stream_connect+0x50/0x78
 kernel_connect+0x6c/0xac
 generic_ip_connect+0x10c/0x6c8 [cifs]
 __reconnect_target_unlocked+0xa0/0x214 [cifs]
 reconnect_dfs_server+0x144/0x460 [cifs]
 cifs_reconnect+0x88/0x148 [cifs]
 cifs_readv_from_socket+0x230/0x430 [cifs]
 cifs_read_from_socket+0x74/0xa8 [cifs]
 cifs_demultiplex_thread+0xf8/0x704 [cifs]
 kthread+0xd0/0xd4
Code: aa0003f8 f8480f13 eb18027f 540006c0 (b9401264)

[1]:
CIFS_CRED="/root/cred.cifs"
CIFS_USER="Administrator"
CIFS_PASS="Password"
CIFS_IP="X.X.X.X"
CIFS_PATH="//${CIFS_IP}/Users/Administrator/Desktop/CIFS_TEST"
CIFS_MNT="/mnt/smb"
DEV="enp0s3"

cat <<EOF > ${CIFS_CRED}
username=${CIFS_USER}
password=${CIFS_PASS}
domain=EXAMPLE.COM
EOF

unshare -n bash -c "
mkdir -p ${CIFS_MNT}
ip netns attach root 1
ip link add eth0 type veth peer veth0 netns root
ip link set eth0 up
ip -n root link set veth0 up
ip addr add 192.168.0.2/24 dev eth0
ip -n root addr add 192.168.0.1/24 dev veth0
ip route add default via 192.168.0.1 dev eth0
ip netns exec root sysctl net.ipv4.ip_forward=1
ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE
mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1
touch ${CIFS_MNT}/a.txt
ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE
"

umount ${CIFS_MNT}

[2]:
ref_tracker: net notrefcnt@000000004bbc008d has 1/1 users at
     sk_alloc (./include/net/net_namespace.h:339 net/core/sock.c:2227)
     inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252)
     __sock_create (net/socket.c:1576)
     generic_ip_connect (fs/smb/client/connect.c:3075)
     cifs_get_tcp_session.part.0 (fs/smb/client/connect.c:3160 fs/smb/client/connect.c:1798)
     cifs_mount_get_session (fs/smb/client/trace.h:959 fs/smb/client/connect.c:3366)
     dfs_mount_share (fs/smb/client/dfs.c:63 fs/smb/client/dfs.c:285)
     cifs_mount (fs/smb/client/connect.c:3622)
     cifs_smb3_do_mount (fs/smb/client/cifsfs.c:949)
     smb3_get_tree (fs/smb/client/fs_context.c:784 fs/smb/client/fs_context.c:802 fs/smb/client/fs_context.c:794)
     vfs_get_tree (fs/super.c:1800)
     path_mount (fs/namespace.c:3508 fs/namespace.c:3834)
     __x64_sys_mount (fs/namespace.c:3848 fs/namespace.c:4057 fs/namespace.c:4034 fs/namespace.c:4034)
     do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
     entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Fixes: 26abe14 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Acked-by: Tom Talpey <[email protected]>
Signed-off-by: Steve French <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
Unloading the ice driver while switchdev port representors are added to
a bridge can lead to kernel panic. Reproducer:

  modprobe ice

  devlink dev eswitch set $PF1_PCI mode switchdev

  ip link add $BR type bridge
  ip link set $BR up

  echo 2 > /sys/class/net/$PF1/device/sriov_numvfs
  sleep 2

  ip link set $PF1 master $BR
  ip link set $VF1_PR master $BR
  ip link set $VF2_PR master $BR
  ip link set $PF1 up
  ip link set $VF1_PR up
  ip link set $VF2_PR up
  ip link set $VF1 up

  rmmod irdma ice

When unloading the driver, ice_eswitch_detach() is eventually called as
part of VF freeing. First, it removes a port representor from xarray,
then unregister_netdev() is called (via repr->ops.rem()), finally
representor is deallocated. The problem comes from the bridge doing its
own deinit at the same time. unregister_netdev() triggers a notifier
chain, resulting in ice_eswitch_br_port_deinit() being called. It should
set repr->br_port = NULL, but this does not happen since repr has
already been removed from xarray and is not found. Regardless, it
finishes up deallocating br_port. At this point, repr is still not freed
and an fdb event can happen, in which ice_eswitch_br_fdb_event_work()
takes repr->br_port and tries to use it, which causes a panic (use after
free).

Note that this only happens with 2 or more port representors added to
the bridge, since with only one representor port, the bridge deinit is
slightly different (ice_eswitch_br_port_deinit() is called via
ice_eswitch_br_ports_flush(), not ice_eswitch_br_port_unlink()).

Trace:
  Oops: general protection fault, probably for non-canonical address 0xf129010fd1a93284: 0000 [svenkatr#1] PREEMPT SMP KASAN NOPTI
  KASAN: maybe wild-memory-access in range [0x8948287e8d499420-0x8948287e8d499427]
  (...)
  Workqueue: ice_bridge_wq ice_eswitch_br_fdb_event_work [ice]
  RIP: 0010:__rht_bucket_nested+0xb4/0x180
  (...)
  Call Trace:
   (...)
   ice_eswitch_br_fdb_find+0x3fa/0x550 [ice]
   ? __pfx_ice_eswitch_br_fdb_find+0x10/0x10 [ice]
   ice_eswitch_br_fdb_event_work+0x2de/0x1e60 [ice]
   ? __schedule+0xf60/0x5210
   ? mutex_lock+0x91/0xe0
   ? __pfx_ice_eswitch_br_fdb_event_work+0x10/0x10 [ice]
   ? ice_eswitch_br_update_work+0x1f4/0x310 [ice]
   (...)

A workaround is available: brctl setageing $BR 0, which stops the bridge
from adding fdb entries altogether.

Change the order of operations in ice_eswitch_detach(): move the call to
unregister_netdev() before removing repr from xarray. This way
repr->br_port will be correctly set to NULL in
ice_eswitch_br_port_deinit(), preventing a panic.

Fixes: fff292b ("ice: add VF representors one by one")
Reviewed-by: Michal Swiatkowski <[email protected]>
Reviewed-by: Paul Menzel <[email protected]>
Signed-off-by: Marcin Szycik <[email protected]>
Tested-by: Sujai Buvaneswaran <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
The RTC update work involves runtime resuming the UFS controller. Hence,
only start the RTC update work after runtime power management in the UFS
driver has been fully initialized. This patch fixes the following kernel
crash:

Internal error: Oops: 0000000096000006 [svenkatr#1] PREEMPT SMP
Workqueue: events ufshcd_rtc_work
Call trace:
 _raw_spin_lock_irqsave+0x34/0x8c (P)
 pm_runtime_get_if_active+0x24/0x9c (L)
 pm_runtime_get_if_active+0x24/0x9c
 ufshcd_rtc_work+0x138/0x1b4
 process_one_work+0x148/0x288
 worker_thread+0x2cc/0x3d4
 kthread+0x110/0x114
 ret_from_fork+0x10/0x20

Reported-by: Neil Armstrong <[email protected]>
Closes: https://lore.kernel.org/linux-scsi/[email protected]/
Fixes: 6bf999e ("scsi: ufs: core: Add UFS RTC support")
Cc: Bean Huo <[email protected]>
Cc: [email protected]
Signed-off-by: Bart Van Assche <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Peter Wang <[email protected]>
Reviewed-by: Bean Huo <[email protected]>
Tested-by: Neil Armstrong <[email protected]> # on SM8650-HDK
Signed-off-by: Martin K. Petersen <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
In unlikely event that we fail during sending the new VF GGTT
configuration to the GuC, we will free only the GGTT node data
struct but will miss to release the actual GGTT allocation.

This will later lead to list corruption, GGTT space leak and
finally risking crash when unloading the driver:

 [ ] ... [drm] GT0: PF: Failed to provision VF1 with 1073741824 (1.00 GiB) GGTT (-EIO)
 [ ] ... [drm] GT0: PF: VF1 provisioning remains at 0 (0 B) GGTT

 [ ] list_add corruption. next->prev should be prev (ffff88813cfcd628), but was 0000000000000000. (next=ffff88813cfe2028).
 [ ] RIP: 0010:__list_add_valid_or_report+0x6b/0xb0
 [ ] Call Trace:
 [ ]  drm_mm_insert_node_in_range+0x2c0/0x4e0
 [ ]  xe_ggtt_node_insert+0x46/0x70 [xe]
 [ ]  pf_provision_vf_ggtt+0x7f5/0xa70 [xe]
 [ ]  xe_gt_sriov_pf_config_set_ggtt+0x5e/0x770 [xe]
 [ ]  ggtt_set+0x4b/0x70 [xe]
 [ ]  simple_attr_write_xsigned.constprop.0.isra.0+0xb0/0x110

 [ ] ... [drm] GT0: PF: Failed to provision VF1 with 1073741824 (1.00 GiB) GGTT (-ENOSPC)
 [ ] ... [drm] GT0: PF: VF1 provisioning remains at 0 (0 B) GGTT

 [ ] Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b7b: 0000 [svenkatr#1] PREEMPT SMP NOPTI
 [ ] RIP: 0010:drm_mm_remove_node+0x1b7/0x390
 [ ] Call Trace:
 [ ]  <TASK>
 [ ]  ? die_addr+0x2e/0x80
 [ ]  ? exc_general_protection+0x1a1/0x3e0
 [ ]  ? asm_exc_general_protection+0x22/0x30
 [ ]  ? drm_mm_remove_node+0x1b7/0x390
 [ ]  ggtt_node_remove+0xa5/0xf0 [xe]
 [ ]  xe_ggtt_node_remove+0x35/0x70 [xe]
 [ ]  xe_ttm_bo_destroy+0x123/0x220 [xe]
 [ ]  intel_user_framebuffer_destroy+0x44/0x70 [xe]
 [ ]  intel_plane_destroy_state+0x3b/0xc0 [xe]
 [ ]  drm_atomic_state_default_clear+0x1cd/0x2f0
 [ ]  intel_atomic_state_clear+0x9/0x20 [xe]
 [ ]  __drm_atomic_state_free+0x1d/0xb0

Fix that by using pf_release_ggtt() on the error path, which now
works regardless if the node has GGTT allocation or not.

Fixes: 34e8042 ("drm/xe: Make xe_ggtt_node struct independent")
Signed-off-by: Michal Wajdeczko <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Matthew Brost <[email protected]>
Cc: Matthew Auld <[email protected]>
Reviewed-by: Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 43b1dd2)
Signed-off-by: Lucas De Marchi <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
vp_modern_avq_cleanup() and vp_del_vqs() clean up admin vq
resources by virtio_pci_vq_info pointer. The info pointer of admin
vq is stored in vp_dev->admin_vq.info instead of vp_dev->vqs[].
Using the info pointer from vp_dev->vqs[] for admin vq causes a
kernel NULL pointer dereference bug.
In vp_modern_avq_cleanup() and vp_del_vqs(), get the info pointer
from vp_dev->admin_vq.info for admin vq to clean up the resources.
Also make info ptr as argument of vp_del_vq() to be symmetric with
vp_setup_vq().

vp_reset calls vp_modern_avq_cleanup, and causes the Call Trace:
==================================================================
BUG: kernel NULL pointer dereference, address:0000000000000000
...
CPU: 49 UID: 0 PID: 4439 Comm: modprobe Not tainted 6.11.0-rc5 svenkatr#1
RIP: 0010:vp_reset+0x57/0x90 [virtio_pci]
Call Trace:
 <TASK>
...
 ? vp_reset+0x57/0x90 [virtio_pci]
 ? vp_reset+0x38/0x90 [virtio_pci]
 virtio_reset_device+0x1d/0x30
 remove_vq_common+0x1c/0x1a0 [virtio_net]
 virtnet_remove+0xa1/0xc0 [virtio_net]
 virtio_dev_remove+0x46/0xa0
...
 virtio_pci_driver_exit+0x14/0x810 [virtio_pci]
==================================================================

Fixes: 4c3b54a ("virtio_pci_modern: use completion instead of busy loop to wait on admin cmd result")
Signed-off-by: Feng Liu <[email protected]>
Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Parav Pandit <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
In the error recovery path of mlx5_vdpa_dev_add(), the cleanup is
executed and at the end put_device() is called which ends up calling
mlx5_vdpa_free(). This function will execute the same cleanup all over
again. Most resources support being cleaned up twice, but the recent
mlx5_vdpa_destroy_mr_resources() doesn't.

This change drops the explicit cleanup from within the
mlx5_vdpa_dev_add() and lets mlx5_vdpa_free() do its work.

This issue was discovered while trying to add 2 vdpa devices with the
same name:
$> vdpa dev add name vdpa-0 mgmtdev auxiliary/mlx5_core.sf.2
$> vdpa dev add name vdpa-0 mgmtdev auxiliary/mlx5_core.sf.3

... yields the following dump:

  BUG: kernel NULL pointer dereference, address: 00000000000000b8
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [svenkatr#1] SMP
  CPU: 4 UID: 0 PID: 2811 Comm: vdpa Not tainted 6.12.0-rc6 svenkatr#1
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:destroy_workqueue+0xe/0x2a0
  Code: ...
  RSP: 0018:ffff88814920b9a8 EFLAGS: 00010282
  RAX: 0000000000000000 RBX: ffff888105c10000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff888100400168 RDI: 0000000000000000
  RBP: 0000000000000000 R08: ffff888100120c00 R09: ffffffff828578c0
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  R13: ffff888131fd99a0 R14: 0000000000000000 R15: ffff888105c10580
  FS:  00007fdfa6b4f740(0000) GS:ffff88852ca00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000b8 CR3: 000000018db09006 CR4: 0000000000372eb0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   ? __die+0x20/0x60
   ? page_fault_oops+0x150/0x3e0
   ? exc_page_fault+0x74/0x130
   ? asm_exc_page_fault+0x22/0x30
   ? destroy_workqueue+0xe/0x2a0
   mlx5_vdpa_destroy_mr_resources+0x2b/0x40 [mlx5_vdpa]
   mlx5_vdpa_free+0x45/0x150 [mlx5_vdpa]
   vdpa_release_dev+0x1e/0x50 [vdpa]
   device_release+0x31/0x90
   kobject_put+0x8d/0x230
   mlx5_vdpa_dev_add+0x328/0x8b0 [mlx5_vdpa]
   vdpa_nl_cmd_dev_add_set_doit+0x2b8/0x4c0 [vdpa]
   genl_family_rcv_msg_doit+0xd0/0x120
   genl_rcv_msg+0x180/0x2b0
   ? __vdpa_alloc_device+0x1b0/0x1b0 [vdpa]
   ? genl_family_rcv_msg_dumpit+0xf0/0xf0
   netlink_rcv_skb+0x54/0x100
   genl_rcv+0x24/0x40
   netlink_unicast+0x1fc/0x2d0
   netlink_sendmsg+0x1e4/0x410
   __sock_sendmsg+0x38/0x60
   ? sockfd_lookup_light+0x12/0x60
   __sys_sendto+0x105/0x160
   ? __count_memcg_events+0x53/0xe0
   ? handle_mm_fault+0x100/0x220
   ? do_user_addr_fault+0x40d/0x620
   __x64_sys_sendto+0x20/0x30
   do_syscall_64+0x4c/0x100
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7fdfa6c66b57
  Code: ...
  RSP: 002b:00007ffeace22998 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
  RAX: ffffffffffffffda RBX: 000055a498608350 RCX: 00007fdfa6c66b57
  RDX: 000000000000006c RSI: 000055a498608350 RDI: 0000000000000003
  RBP: 00007ffeace229c0 R08: 00007fdfa6d35200 R09: 000000000000000c
  R10: 0000000000000000 R11: 0000000000000202 R12: 000055a4986082a0
  R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffeace233f3
   </TASK>
  Modules linked in: ...
  CR2: 00000000000000b8

Fixes: 6211165 ("vdpa/mlx5: Postpone MR deletion")
Signed-off-by: Dragos Tatulea <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>
Acked-by: Eugenio Pérez <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
Eric reported a division by zero splat in the MPTCP protocol:

Oops: divide error: 0000 [svenkatr#1] PREEMPT SMP KASAN PTI
CPU: 1 UID: 0 PID: 6094 Comm: syz-executor317 Not tainted
6.12.0-rc5-syzkaller-00291-g05b92660cdfe #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 09/13/2024
RIP: 0010:__tcp_select_window+0x5b4/0x1310 net/ipv4/tcp_output.c:3163
Code: f6 44 01 e3 89 df e8 9b 75 09 f8 44 39 f3 0f 8d 11 ff ff ff e8
0d 74 09 f8 45 89 f4 e9 04 ff ff ff e8 00 74 09 f8 44 89 f0 99 <f7> 7c
24 14 41 29 d6 45 89 f4 e9 ec fe ff ff e8 e8 73 09 f8 48 89
RSP: 0018:ffffc900041f7930 EFLAGS: 00010293
RAX: 0000000000017e67 RBX: 0000000000017e67 RCX: ffffffff8983314b
RDX: 0000000000000000 RSI: ffffffff898331b0 RDI: 0000000000000004
RBP: 00000000005d6000 R08: 0000000000000004 R09: 0000000000017e67
R10: 0000000000003e80 R11: 0000000000000000 R12: 0000000000003e80
R13: ffff888031d9b440 R14: 0000000000017e67 R15: 00000000002eb000
FS: 00007feb5d7f16c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007feb5d8adbb8 CR3: 0000000074e4c000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__tcp_cleanup_rbuf+0x3e7/0x4b0 net/ipv4/tcp.c:1493
mptcp_rcv_space_adjust net/mptcp/protocol.c:2085 [inline]
mptcp_recvmsg+0x2156/0x2600 net/mptcp/protocol.c:2289
inet_recvmsg+0x469/0x6a0 net/ipv4/af_inet.c:885
sock_recvmsg_nosec net/socket.c:1051 [inline]
sock_recvmsg+0x1b2/0x250 net/socket.c:1073
__sys_recvfrom+0x1a5/0x2e0 net/socket.c:2265
__do_sys_recvfrom net/socket.c:2283 [inline]
__se_sys_recvfrom net/socket.c:2279 [inline]
__x64_sys_recvfrom+0xe0/0x1c0 net/socket.c:2279
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7feb5d857559
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007feb5d7f1208 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
RAX: ffffffffffffffda RBX: 00007feb5d8e1318 RCX: 00007feb5d857559
RDX: 000000800000000e RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007feb5d8e1310 R08: 0000000000000000 R09: ffffffff81000000
R10: 0000000000000100 R11: 0000000000000246 R12: 00007feb5d8e131c
R13: 00007feb5d8ae074 R14: 000000800000000e R15: 00000000fffffdef

and provided a nice reproducer.

The root cause is the current bad handling of racing disconnect.
After the blamed commit below, sk_wait_data() can return (with
error) with the underlying socket disconnected and a zero rcv_mss.

Catch the error and return without performing any additional
operations on the current socket.

Reported-by: Eric Dumazet <[email protected]>
Fixes: 419ce13 ("tcp: allow again tcp_disconnect() when threads are waiting")
Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/8c82ecf71662ecbc47bf390f9905de70884c9f2d.1731060874.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
syzbot and Daan report a NULL pointer crash in the new full swap cluster
reclaim work:

> Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [svenkatr#1] PREEMPT SMP KASAN PTI
> KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
> CPU: 1 UID: 0 PID: 51 Comm: kworker/1:1 Not tainted 6.12.0-rc6-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
> Workqueue: events swap_reclaim_work
> RIP: 0010:__list_del_entry_valid_or_report+0x20/0x1c0 lib/list_debug.c:49
> Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 fe 48 83 c7 08 48 83 ec 18 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 19 01 00 00 48 89 f2 48 8b 4e 08 48 b8 00 00 00
> RSP: 0018:ffffc90000bb7c30 EFLAGS: 00010202
> RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff88807b9ae078
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
> RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 000000000000004f R12: dffffc0000000000
> R13: ffffffffffffffb8 R14: ffff88807b9ae000 R15: ffffc90003af1000
> FS:  0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fffaca68fb8 CR3: 00000000791c8000 CR4: 00000000003526f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  __list_del_entry_valid include/linux/list.h:124 [inline]
>  __list_del_entry include/linux/list.h:215 [inline]
>  list_move_tail include/linux/list.h:310 [inline]
>  swap_reclaim_full_clusters+0x109/0x460 mm/swapfile.c:748
>  swap_reclaim_work+0x2e/0x40 mm/swapfile.c:779

The syzbot console output indicates a virtual environment where swapfile
is on a rotational device.  In this case, clusters aren't actually used,
and si->full_clusters is not initialized.  Daan's report is from qemu, so
likely rotational too.

Make sure to only schedule the cluster reclaim work when clusters are
actually in use.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: systemd/systemd#35044
Fixes: 5168a68 ("mm, swap: avoid over reclaim of full clusters")
Reported-by: [email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Daan De Meyer <[email protected]>
Cc: Kairui Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
The kdump kernel is broken on SME systems with CONFIG_IMA_KEXEC=y enabled.
Debugging traced the issue back to

  b69a2af ("x86/kexec: Carry forward IMA measurement log on kexec").

Testing was previously not conducted on SME systems with CONFIG_IMA_KEXEC
enabled, which led to the oversight, with the following incarnation:

...
  ima: No TPM chip found, activating TPM-bypass!
  Loading compiled-in module X.509 certificates
  Loaded X.509 cert 'Build time autogenerated kernel key: 18ae0bc7e79b64700122bb1d6a904b070fef2656'
  ima: Allocated hash algorithm: sha256
  Oops: general protection fault, probably for non-canonical address 0xcfacfdfe6660003e: 0000 [svenkatr#1] PREEMPT SMP NOPTI
  CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.11.0-rc2+ #14
  Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.20.0 05/03/2023
  RIP: 0010:ima_restore_measurement_list
  Call Trace:
   <TASK>
   ? show_trace_log_lvl
   ? show_trace_log_lvl
   ? ima_load_kexec_buffer
   ? __die_body.cold
   ? die_addr
   ? exc_general_protection
   ? asm_exc_general_protection
   ? ima_restore_measurement_list
   ? vprintk_emit
   ? ima_load_kexec_buffer
   ima_load_kexec_buffer
   ima_init
   ? __pfx_init_ima
   init_ima
   ? __pfx_init_ima
   do_one_initcall
   do_initcalls
   ? __pfx_kernel_init
   kernel_init_freeable
   kernel_init
   ret_from_fork
   ? __pfx_kernel_init
   ret_from_fork_asm
   </TASK>
  Modules linked in:
  ---[ end trace 0000000000000000 ]---
  ...
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled
  Rebooting in 10 seconds..

Adding debug printks showed that the stored addr and size of ima_kexec buffer
are not decrypted correctly like:

  ima: ima_load_kexec_buffer, buffer:0xcfacfdfe6660003e, size:0xe48066052d5df359

Three types of setup_data info

  — SETUP_EFI,
  - SETUP_IMA, and
  - SETUP_RNG_SEED

are passed to the kexec/kdump kernel. Only the ima_kexec buffer
experienced incorrect decryption. Debugging identified a bug in
early_memremap_is_setup_data(), where an incorrect range calculation
occurred due to the len variable in struct setup_data ended up only
representing the length of the data field, excluding the struct's size,
and thus leading to miscalculation.

Address a similar issue in memremap_is_setup_data() while at it.

  [ bp: Heavily massage. ]

Fixes: b3c72fc ("x86/boot: Introduce setup_indirect")
Signed-off-by: Baoquan He <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Acked-by: Tom Lendacky <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
mhiramat pushed a commit to mhiramat/linux that referenced this issue Nov 25, 2024
Accessing `mr_table->mfc_cache_list` is protected by an RCU lock. In the
following code flow, the RCU read lock is not held, causing the
following error when `RCU_PROVE` is not held. The same problem might
show up in the IPv6 code path.

	6.12.0-rc5-kbuilder-01145-gbac17284bdcb #33 Tainted: G            E    N
	-----------------------------
	net/ipv4/ipmr_base.c:313 RCU-list traversed in non-reader section!!

	rcu_scheduler_active = 2, debug_locks = 1
		   2 locks held by RetransmitAggre/3519:
		    #0: ffff88816188c6c0 (nlk_cb_mutex-ROUTE){+.+.}-{3:3}, at: __netlink_dump_start+0x8a/0x290
		    svenkatr#1: ffffffff83fcf7a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_dumpit+0x6b/0x90

	stack backtrace:
		    lockdep_rcu_suspicious
		    mr_table_dump
		    ipmr_rtm_dumproute
		    rtnl_dump_all
		    rtnl_dumpit
		    netlink_dump
		    __netlink_dump_start
		    rtnetlink_rcv_msg
		    netlink_rcv_skb
		    netlink_unicast
		    netlink_sendmsg

This is not a problem per see, since the RTNL lock is held here, so, it
is safe to iterate in the list without the RCU read lock, as suggested
by Eric.

To alleviate the concern, modify the code to use
list_for_each_entry_rcu() with the RTNL-held argument.

The annotation will raise an error only if RTNL or RCU read lock are
missing during iteration, signaling a legitimate problem, otherwise it
will avoid this false positive.

This will solve the IPv6 case as well, since ip6mr_rtm_dumproute() calls
this function as well.

Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant