Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic + stack overflow in i2c_driver on c71 #2004

Open
mkeeter opened this issue Feb 6, 2025 · 1 comment
Open

Panic + stack overflow in i2c_driver on c71 #2004

mkeeter opened this issue Feb 6, 2025 · 1 comment

Comments

@mkeeter
Copy link
Collaborator

mkeeter commented Feb 6, 2025

On cadbury:c71, @rmustacc noticed Gimlet Issues when he connected a particular driver (??).

The thermal, gimlet_seq and i2c_driver generation counts are spinning up:

matt@cadbury ~ () $ export HUMILITY_TARGET=c71
matt@cadbury ~ (c71) $ h tasks
humility: attached to 0483:3754:001100184741500820383733 via ST-Link V3
system time = 563017
ID TASK                       GEN PRI STATE
 0 jefe                         0   0 recv, notif: fault timer(T+87)
 1 net                          0   5 recv, notif: eth-irq(irq61) wake-timer(T+429)
 2 sys                          0   1 recv, notif: exti-wildcard-irq(irq6/irq7/irq8/irq9/irq10/irq23/irq40)
 3 spi2_driver                  0   3 recv
 4 i2c_driver                4174   3 FAULT: stack overflow; sp=0x24058fa8 (was: ready)
 5 spd                          0   2 notif: i2c1-irq(irq31/irq32)
 6 packrat                      0   1 recv
 7 thermal                   2086   5 wait: send to i2c_driver/gen14
 8 power                        1   6 wait: send to gimlet_seq/gen20
 9 hiffy                        0   5 notif: bit31(T+128)
10 gimlet_seq                1044   4 wait: send to i2c_driver/gen14
11 gimlet_inspector             0   6 notif: socket
12 hash_driver                  0   2 recv
13 hf                           0   3 recv
14 update_server                0   3 recv
15 sensor                       0   4 recv
16 host_sp_comms                0   8 recv, notif: jefe-state-change usart-irq(irq82) multitimer(T+37) control-plane-agent
17 udpecho                      0   6 notif: socket
18 udpbroadcast                 0   6 notif: bit31(T+358)
19 control_plane_agent          0   7 recv, notif: usart-irq(irq37) socket timer
20 sprot                        0   4 recv
21 validate                     0   5 recv
22 vpd                          0   4 recv
23 user_leds                    0   2 recv, notif: timer
24 dump_agent                   0   6 recv, notif: socket
25 sbrmi                        0   4 recv
26 idle                         0   9 RUNNING
27 udprpc                       0   6 notif: socket

The actual crash is a stack overflow in i2c_driver, which is seemingly not found by our static detector (#1890). In the dump, it seems like it's an overflow while panicking, but I can't make sense of the full trace:

matt@cadbury ~ (c71) $ h -d hubris.core.i2c_driver.0 tasks -sl --guess
humility: attached to dump
system time = 117420
ID TASK                       GEN PRI STATE
 4 i2c_driver                   0   3 FAULT: stack overflow; sp=0x24058fa8 (was: ready)
   could not read registers: read of 32 bytes from invalid address: 0x24058fa8
   guessing at stack trace using saved frame pointer
   |
   +--->  0x24059028 0x080493d0 core::fmt::Formatter::pad_integral
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/fmt/mod.rs:1340
          0x24059028 0x080492fa core::fmt::num::imp::fmt_u32
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/fmt/num.rs:213
          0x24059028 0x08049444 core::fmt::num::imp::<impl core::fmt::Display for u32>::fmt
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/fmt/num.rs:313
          0x24059088 0x080493d0 core::fmt::Formatter::pad_integral
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/fmt/mod.rs:1340
          0x24059088 0x080492fa core::fmt::num::imp::fmt_u32
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/fmt/num.rs:213
          0x24059088 0x08049444 core::fmt::num::imp::<impl core::fmt::Display for u32>::fmt
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/fmt/num.rs:313
          0x240590e0 0x08049d06 core::fmt::write
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/fmt/mod.rs:1168
          0x24059128 0x0804a61a <core::result::Result<T,E> as core::ops::try_trait::Try>::branch
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/result.rs:1985
          0x24059128 0x0804a5dc <core::panic::panic_info::PanicInfo as core::fmt::Display>::fmt
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/panic/panic_info.rs:145
          0x24059128 0x0804a61a <&T as core::fmt::Display>::fmt
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/fmt/mod.rs:2390
          0x24059180 0x08049d06 core::fmt::write
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/fmt/mod.rs:1168
          0x240591b8 0x0804a8bc rust_begin_unwind
                     @ /hubris/sys/userlib/src/lib.rs:1382
          0x240591b8 0x0804954a core::fmt::Formatter::pad_integral::write_prefix
                     @ /rustc/c52c23b6f44cd19718721a5e3b2eeb169e9c96ff/library/core/src/fmt/mod.rs:1361

(note that this is using --guess to use the saved frame pointer, so we may be able to stare at the memory by hand)

The I2C ringbuf shows a call to stm32xx_i2c::panic, which logged the whole state and then panicked:

 NDX LINE      GEN    COUNT PAYLOAD
  30  671     7016        1 WriteWait(ISR, 0x8020)
  31  671     7016        2 WriteWait(ISR, 0x8021)
  32  671     7016        1 WriteWait(ISR, 0x8061)
  33  736     7016        2 Read(ISR, 0x8021)
  34  736     7016        1 Read(ISR, 0x8025)
  35  779     7016        1 ReadWait(ISR, 0x8021)
  36  779     7016        1 ReadWait(ISR, 0x8061)
  37  555     7016        1 Wait(ISR, 0x21)
  38  641     7016        1 Write(ISR, 0x21)
  39  641     7016        2 Write(ISR, 0x8021)
  40  641     7016        1 Write(ISR, 0x8023)
  41  671     7016        1 WriteWait(ISR, 0x8020)
  42  671     7016        2 WriteWait(ISR, 0x8021)
  43  671     7016        1 WriteWait(ISR, 0x8061)
  44  736     7016        2 Read(ISR, 0x8021)
  45  736     7016        1 Read(ISR, 0x8025)
  46  779     7016        1 ReadWait(ISR, 0x8021)
  47  779     7016        1 ReadWait(ISR, 0x8061)
   0  555     7017        1 Wait(ISR, 0x21)
   1  641     7017        1 Write(ISR, 0x21)
   2  641     7017        2 Write(ISR, 0x8021)
   3  641     7017        1 Write(ISR, 0x8023)
   4  671     7017        1 WriteWait(ISR, 0x8020)
   5  671     7017        2 WriteWait(ISR, 0x8021)
   6  671     7017        1 WriteWait(ISR, 0x8061)
   7  736     7017        2 Read(ISR, 0x8021)
   8  736     7017        1 Read(ISR, 0x8025)
   9  779     7017        1 ReadWait(ISR, 0x8021)
  10  779     7017        1 ReadWait(ISR, 0x8061)
  11  555     7017        1 Wait(ISR, 0x21)
  12  641     7017        1 Write(ISR, 0x21)
  13  641     7017        2 Write(ISR, 0x8021)
  14  641     7017        1 Write(ISR, 0x8121)
  15  431     7017        1 Reset(CR1, 0x1000d6)
  16  440     7017        1 Reset(CR2, 0x100e2)
  17  555     7017        3 Wait(ISR, 0x8001)
  18  555     7017        1 Wait(ISR, 0x1)
  19  641     7017        3 Write(ISR, 0x8001)
  20  530     7017        1 LostInterrupt
  21  481     7017        1 Panic(CR1, 0x1000d7)
  22  482     7017        1 Panic(CR2, 0x120e0)
  23  483     7017        1 Panic(OAR1, 0x0)
  24  484     7017        1 Panic(OAR2, 0x0)
  25  485     7017        1 Panic(TIMINGR, 0x3060767f)
  26  486     7017        1 Panic(TIMEOUTR, 0x84c4)
  27  487     7017        1 Panic(ISR, 0x8001)
  28  488     7017        1 Panic(PECR, 0x0)
  29  491     7017        1 IrqStatus { notification: 0x1, pending: false, enabled: true, posted: false }

Cores are in /staff/cores/hubris-2004/

@mkeeter
Copy link
Collaborator Author

mkeeter commented Feb 6, 2025

The stack overflow can't be detected statically because wfi is passed in as a lambda function. We should consider reworking this (e.g. with a trait), but the initial panic is still a problem.

Robert pointed out that we have a handful of unresolved I2C glitches already filed: #1821, #1822, #1823, #1824. It's possible that one of these glitches is interacting badly with the new SSD controller IC.

mkeeter added a commit that referenced this issue Feb 11, 2025
In some cases, the deepest stack involves a tail call. For example,
consider `write_fmt`, which has no stack of its own and branches
directly to `core::fmt::write` (instead of using `bl`):

```asm
0804a704 <core::fmt::Write::write_fmt>:
 804a704:   460a        mov r2, r1
 804a706:   6849        ldr r1, [r1, #4]
 804a708:   2901        cmp r1, #1
 804a70a:   bf18        it  ne
 804a70c:   2900        cmpne   r1, #0
 804a70e:   4901        ldr r1, [pc, #4]    ; (804a714 <core::fmt::Write::write_fmt+0x10>)
 804a710:   f7ff ba68   b.w 8049be4 <core::fmt::write>
 804a714:   0804b024    .word   0x0804b024
```

This PR adds basic tail call support to the disassembler. The main
effect is adding 80 bytes to panic stack depth.

Before:
```
udprpc: 2464 bytes (limit is 4096)
     [+8] _start
  [+2264] main
    [+16] core::array::<impl core::ops::index::IndexMut<I> for [T; N]>::index_mut
    [+16] <core::ops::range::Range<usize> as core::slice::index::SliceIndex<[T]>>::index_mut
    [+56] core::slice::index::slice_end_index_len_fail
    [+40] core::panicking::panic_fmt
    [+56] rust_begin_unwind
     [+8] userlib::sys_panic_stub
```

After:
```
udprpc: 2544 bytes (limit is 4096)
     [+8] _start
  [+2264] main
    [+16] core::array::<impl core::ops::index::IndexMut<I> for [T; N]>::index_mut
    [+16] <core::ops::range::Range<usize> as core::slice::index::SliceIndex<[T]>>::index_mut
    [+56] core::slice::index::slice_end_index_len_fail
    [+40] core::panicking::panic_fmt
    [+56] rust_begin_unwind
     [+0] core::fmt::Write::write_fmt
    [+88] core::fmt::write
```

Unfortunately, this _does not_ detect the stack overflow in #2004, which
happens during dynamic dispatch. Still, it's a step in the right
direction!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant