-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BeagleBone DMTimer2 unexpected stop after one or more days #203
Comments
@mgkiller7 did you find a resolution? I see the last post is:
|
@mgkiller7 Please re-open if still an issue. You may also be interested in the Debian images and kernel builds that we are currently testing for the next release: |
…before setting skb ownership commit e940e08 upstream. There are two ref count variables controlling the free()ing of a socket: - struct sock::sk_refcnt - which is changed by sock_hold()/sock_put() - struct sock::sk_wmem_alloc - which accounts the memory allocated by the skbs in the send path. In case there are still TX skbs on the fly and the socket() is closed, the struct sock::sk_refcnt reaches 0. In the TX-path the CAN stack clones an "echo" skb, calls sock_hold() on the original socket and references it. This produces the following back trace: | WARNING: CPU: 0 PID: 280 at lib/refcount.c:25 refcount_warn_saturate+0x114/0x134 | refcount_t: addition on 0; use-after-free. | Modules linked in: coda_vpu(E) v4l2_jpeg(E) videobuf2_vmalloc(E) imx_vdoa(E) | CPU: 0 PID: 280 Comm: test_can.sh Tainted: G E 5.11.0-04577-gf8ff6603c617 #203 | Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) | Backtrace: | [<80bafea4>] (dump_backtrace) from [<80bb0280>] (show_stack+0x20/0x24) r7:00000000 r6:600f0113 r5:00000000 r4:81441220 | [<80bb0260>] (show_stack) from [<80bb593c>] (dump_stack+0xa0/0xc8) | [<80bb589c>] (dump_stack) from [<8012b268>] (__warn+0xd4/0x114) r9:00000019 r8:80f4a8c2 r7:83e4150c r6:00000000 r5:00000009 r4:80528f90 | [<8012b194>] (__warn) from [<80bb09c4>] (warn_slowpath_fmt+0x88/0xc8) r9:83f26400 r8:80f4a8d1 r7:00000009 r6:80528f90 r5:00000019 r4:80f4a8c2 | [<80bb0940>] (warn_slowpath_fmt) from [<80528f90>] (refcount_warn_saturate+0x114/0x134) r8:00000000 r7:00000000 r6:82b44000 r5:834e5600 r4:83f4d540 | [<80528e7c>] (refcount_warn_saturate) from [<8079a4c8>] (__refcount_add.constprop.0+0x4c/0x50) | [<8079a47c>] (__refcount_add.constprop.0) from [<8079a57c>] (can_put_echo_skb+0xb0/0x13c) | [<8079a4cc>] (can_put_echo_skb) from [<8079ba98>] (flexcan_start_xmit+0x1c4/0x230) r9:00000010 r8:83f48610 r7:0fdc0000 r6:0c080000 r5:82b44000 r4:834e5600 | [<8079b8d4>] (flexcan_start_xmit) from [<80969078>] (netdev_start_xmit+0x44/0x70) r9:814c0ba0 r8:80c8790c r7:00000000 r6:834e5600 r5:82b44000 r4:82ab1f00 | [<80969034>] (netdev_start_xmit) from [<809725a4>] (dev_hard_start_xmit+0x19c/0x318) r9:814c0ba0 r8:00000000 r7:82ab1f00 r6:82b44000 r5:00000000 r4:834e5600 | [<80972408>] (dev_hard_start_xmit) from [<809c6584>] (sch_direct_xmit+0xcc/0x264) r10:834e5600 r9:00000000 r8:00000000 r7:82b44000 r6:82ab1f00 r5:834e5600 r4:83f27400 | [<809c64b8>] (sch_direct_xmit) from [<809c6c0c>] (__qdisc_run+0x4f0/0x534) To fix this problem, only set skb ownership to sockets which have still a ref count > 0. Fixes: 0ae89be ("can: add destructor for self generated skbs") Cc: Oliver Hartkopp <[email protected]> Cc: Andre Naujoks <[email protected]> Link: https://lore.kernel.org/r/[email protected] Suggested-by: Eric Dumazet <[email protected]> Signed-off-by: Oleksij Rempel <[email protected]> Reviewed-by: Oliver Hartkopp <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Would love to know if you found a solution to this problem, we are seeing a very similar problem |
@wiltshiretom Please run
which will show the uboot and linux versions and what device tree overlays are present. |
Apologies, my post lacked some detail! I'm seeing an identical issue but we are using a custom board (not beagle) also using the AM335x part. I found this thread and was wondering if you had identified a workaround. Sorry to resurrect something that is already closed on your platform but I am looking for inspiration. Also identical symptoms here: https://e2e.ti.com/support/processors/f/processors-forum/237808/am335x-system-time-looping |
…before setting skb ownership commit e940e08 upstream. There are two ref count variables controlling the free()ing of a socket: - struct sock::sk_refcnt - which is changed by sock_hold()/sock_put() - struct sock::sk_wmem_alloc - which accounts the memory allocated by the skbs in the send path. In case there are still TX skbs on the fly and the socket() is closed, the struct sock::sk_refcnt reaches 0. In the TX-path the CAN stack clones an "echo" skb, calls sock_hold() on the original socket and references it. This produces the following back trace: | WARNING: CPU: 0 PID: 280 at lib/refcount.c:25 refcount_warn_saturate+0x114/0x134 | refcount_t: addition on 0; use-after-free. | Modules linked in: coda_vpu(E) v4l2_jpeg(E) videobuf2_vmalloc(E) imx_vdoa(E) | CPU: 0 PID: 280 Comm: test_can.sh Tainted: G E 5.11.0-04577-gf8ff6603c617 #203 | Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) | Backtrace: | [<80bafea4>] (dump_backtrace) from [<80bb0280>] (show_stack+0x20/0x24) r7:00000000 r6:600f0113 r5:00000000 r4:81441220 | [<80bb0260>] (show_stack) from [<80bb593c>] (dump_stack+0xa0/0xc8) | [<80bb589c>] (dump_stack) from [<8012b268>] (__warn+0xd4/0x114) r9:00000019 r8:80f4a8c2 r7:83e4150c r6:00000000 r5:00000009 r4:80528f90 | [<8012b194>] (__warn) from [<80bb09c4>] (warn_slowpath_fmt+0x88/0xc8) r9:83f26400 r8:80f4a8d1 r7:00000009 r6:80528f90 r5:00000019 r4:80f4a8c2 | [<80bb0940>] (warn_slowpath_fmt) from [<80528f90>] (refcount_warn_saturate+0x114/0x134) r8:00000000 r7:00000000 r6:82b44000 r5:834e5600 r4:83f4d540 | [<80528e7c>] (refcount_warn_saturate) from [<8079a4c8>] (__refcount_add.constprop.0+0x4c/0x50) | [<8079a47c>] (__refcount_add.constprop.0) from [<8079a57c>] (can_put_echo_skb+0xb0/0x13c) | [<8079a4cc>] (can_put_echo_skb) from [<8079ba98>] (flexcan_start_xmit+0x1c4/0x230) r9:00000010 r8:83f48610 r7:0fdc0000 r6:0c080000 r5:82b44000 r4:834e5600 | [<8079b8d4>] (flexcan_start_xmit) from [<80969078>] (netdev_start_xmit+0x44/0x70) r9:814c0ba0 r8:80c8790c r7:00000000 r6:834e5600 r5:82b44000 r4:82ab1f00 | [<80969034>] (netdev_start_xmit) from [<809725a4>] (dev_hard_start_xmit+0x19c/0x318) r9:814c0ba0 r8:00000000 r7:82ab1f00 r6:82b44000 r5:00000000 r4:834e5600 | [<80972408>] (dev_hard_start_xmit) from [<809c6584>] (sch_direct_xmit+0xcc/0x264) r10:834e5600 r9:00000000 r8:00000000 r7:82b44000 r6:82ab1f00 r5:834e5600 r4:83f27400 | [<809c64b8>] (sch_direct_xmit) from [<809c6c0c>] (__qdisc_run+0x4f0/0x534) To fix this problem, only set skb ownership to sockets which have still a ref count > 0. Fixes: 0ae89be ("can: add destructor for self generated skbs") Cc: Oliver Hartkopp <[email protected]> Cc: Andre Naujoks <[email protected]> Link: https://lore.kernel.org/r/[email protected] Suggested-by: Eric Dumazet <[email protected]> Signed-off-by: Oleksij Rempel <[email protected]> Reviewed-by: Oliver Hartkopp <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Please check the PMU voltage output for AM335x in your custom board. my board issue is belong to cpu power supply undervoltage.
Hope this help.
----- 原始邮件 -----
发件人:wiltshiretom ***@***.***>
收件人:beagleboard/linux ***@***.***>
抄送人:mgkiller7 ***@***.***>, Mention ***@***.***>
主题:Re: [beagleboard/linux] BeagleBone DMTimer2 unexpected stop after one or more days (#203)
日期:2021年04月02日 04点07分
Would love to know if you found a solution to this problem, we are seeing a very similar problem
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
The set channel operation "ethtool -L tx <n>" broke with the recent suspend/resume changes. Revert back to original driver behaviour of not freeing the TX/RX IRQs at am65_cpsw_nuss_common_stop(). We will now free them only on .suspend() as we need to release the DMA channels (as DMA looses context) and re-acquiring them on .resume() may not necessarily give us the same IRQs. Introduce am65_cpsw_nuss_remove_rx_chns() which is similar to am65_cpsw_nuss_remove_tx_chns() and invoke them both in .suspend(). At .resume() call am65_cpsw_nuss_init_rx/tx_chns() to acquire the DMA channels. To as IRQs need to be requested after knowing the IRQ numbers, move am65_cpsw_nuss_ndev_add_tx_napi() call to am65_cpsw_nuss_init_tx_chns(). Also fixes the below warning during suspend/resume on multi CPU system. [ 67.347684] ------------[ cut here ]------------ [ 67.347700] Unbalanced enable for IRQ 119 [ 67.347726] WARNING: CPU: 0 PID: 1080 at kernel/irq/manage.c:781 __enable_irq+0x4c/0x80 [ 67.347754] Modules linked in: wlcore_sdio wl18xx wlcore mac80211 libarc4 cfg80211 rfkill crct10dif_ce sch_fq_codel ipv6 [ 67.347803] CPU: 0 PID: 1080 Comm: rtcwake Not tainted 6.1.0-rc4-00023-gc826e5480732-dirty #203 [ 67.347812] Hardware name: Texas Instruments AM625 (DT) [ 67.347818] pstate: 400000c5 (nZcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 67.347829] pc : __enable_irq+0x4c/0x80 [ 67.347838] lr : __enable_irq+0x4c/0x80 [ 67.347846] sp : ffff80000999ba00 [ 67.347850] x29: ffff80000999ba00 x28: ffff0000011c1c80 x27: 0000000000000000 [ 67.347863] x26: 00000000000001f4 x25: ffff000001058358 x24: ffff000001059080 [ 67.347876] x23: ffff000001058080 x22: ffff000001060000 x21: 0000000000000077 [ 67.347888] x20: ffff0000011c1c80 x19: ffff000001429600 x18: 0000000000000001 [ 67.347900] x17: 0000000000000080 x16: fffffc000176e008 x15: ffff0000011c21b0 [ 67.347913] x14: 0000000000000000 x13: 3931312051524920 x12: 726f6620656c6261 [ 67.347925] x11: 656820747563205b x10: 000000000000000a x9 : ffff80000999ba00 [ 67.347938] x8 : ffff800009121068 x7 : ffff80000999b810 x6 : 00000000fffff17f [ 67.347950] x5 : ffff00007fb99b18 x4 : 0000000000000000 x3 : 0000000000000027 [ 67.347962] x2 : ffff00007fb99b20 x1 : 50dd48f7f19deb00 x0 : 0000000000000000 [ 67.347975] Call trace: [ 67.347980] __enable_irq+0x4c/0x80 [ 67.347989] enable_irq+0x4c/0xa0 [ 67.347999] am65_cpsw_nuss_ndo_slave_open+0x4b0/0x568 [ 67.348015] am65_cpsw_nuss_resume+0x68/0x160 [ 67.348025] dpm_run_callback.isra.0+0x28/0x88 [ 67.348040] device_resume+0x78/0x160 [ 67.348050] dpm_resume+0xc0/0x1f8 [ 67.348057] dpm_resume_end+0x18/0x30 [ 67.348063] suspend_devices_and_enter+0x1cc/0x4e0 [ 67.348075] pm_suspend+0x1f8/0x268 [ 67.348084] state_store+0x8c/0x118 [ 67.348092] kobj_attr_store+0x18/0x30 [ 67.348104] sysfs_kf_write+0x44/0x58 [ 67.348117] kernfs_fop_write_iter+0x118/0x1a8 [ 67.348127] vfs_write+0x31c/0x418 [ 67.348140] ksys_write+0x6c/0xf8 [ 67.348150] __arm64_sys_write+0x1c/0x28 [ 67.348160] invoke_syscall+0x44/0x108 [ 67.348172] el0_svc_common.constprop.0+0x44/0xf0 [ 67.348182] do_el0_svc+0x2c/0xc8 [ 67.348191] el0_svc+0x2c/0x88 [ 67.348201] el0t_64_sync_handler+0xb8/0xc0 [ 67.348209] el0t_64_sync+0x18c/0x190 [ 67.348218] ---[ end trace 0000000000000000 ]--- Fixes: cbdde66 ("net: ethernet: ti: am65-cpsw: Add suspend/resume support") Signed-off-by: Roger Quadros <[email protected]> Signed-off-by: Vignesh Raghavendra <[email protected]>
we encounter DMTimer2 unexpected stop in am335x after run 1 or more days, we indeed seen gp_timer in /proc/interrupts never increase any more;
we have try beagleBoard github kernel version 4.4.113/4.4.155 with our own rootfs in Beagebone Black board and our custom board, the situation is the same, Eventhought i don't make any change in kernel source.
This timer is initialized for clockevent in omap2_gp_clockevent_init(clkev_nr, clkev_src, clkev_prop); //arch/arm/mach-omap2/timer.c
below is the related call stack:
omap3_gptimer_timer_init(void) =>
__omap_sync32k_timer_init(2, "timer_sys_ck", NULL,
1, "timer_sys_ck", "ti,timer-alwon", true); =>
omap2_gp_clockevent_init(clkev_nr, clkev_src, clkev_prop);
after DMTimer2 unexpected stop, those things happen:
1、gp_timer in /proc/interrupts NEVER increases
2、get time form date cmd may goback some minues or seconds
3、user apps no longer output debug log in console, it seems the scheduler of kernel do not work correctly.
4、cpu load of threads in top cmd are all 0%
By the way, i checked after situation come out, ST bit of the DMTimer2's TCLR is 1 (that is Start timer)
But If i stop DMTimer2 manually in console shell by cmd: devmem 0x48040038 32 0x0
then i can reproduced the 1/2/3 situation mentioned above, but hung while i type cmd top in console shell.
So i think DMTimer2 of my AM335x is not work correctly after run one or more days.
We also try to comment out __omap_dm_timer_override_errata() in omap2_gp_clockevent_init(), this force to enable OMAP_TIMER_ERRATA_I103_I767, but the kernel can't bootup at all.
we also posted this problem in TI community at https://e2e.ti.com/support/processors/f/791/t/796508
The text was updated successfully, but these errors were encountered: