Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in malloc with Pico 2 after upgrade from SDK 2.0 to SDK 2.1 #2198

Open
ikjordan opened this issue Jan 19, 2025 · 9 comments
Open

Hang in malloc with Pico 2 after upgrade from SDK 2.0 to SDK 2.1 #2198

ikjordan opened this issue Jan 19, 2025 · 9 comments

Comments

@ikjordan
Copy link

I have an application that uses scanvideo on core 1 and processing logic on core 0. After update to SDK2.1 I see a likely race condition which sometimes results in core 1 hanging at the first malloc. Core 1 is blocked at mutex_enter_blocking(&malloc_mutex).

I have only seen this issue when using a Pico 2. This condition only manifests itself on the Pico 2 after a cold boot, with a release build. It does not occur when connected to a debugger, or if a reboot is triggered via the watchdog. The hanging line in malloc.c was identified by changing the pico LED state before and after the call.

The issue has been reproduced using the master and develop SDK2.1 branches. However, it does not occur if I move back to the version of the TinyUSB library that was in SDK 2.0. I therefore suspect that TinyUSB 17.0. may be triggering this. As further evidence of this, I can reproduce the issue with SDK 2.0 if I use it with the TinyUSB submodule version that is in SDK 2.1. It does not occur if I use SDK2.0 with the SDK 2.0 version of tinyUSB.

I am at a loss as to how to trace this further. For now my workaround is to use SDK2.1 with the TinyUSB submodule from 2.0. I've raised the issue in case others experience it too.

@ikjordan ikjordan changed the title Hang in Malloc with Pico 2 after upgrade for SDK 2.0 to SDK 2.1 Hang in Malloc with Pico 2 after upgrade from SDK 2.0 to SDK 2.1 Jan 19, 2025
@ikjordan ikjordan changed the title Hang in Malloc with Pico 2 after upgrade from SDK 2.0 to SDK 2.1 Hang in malloc with Pico 2 after upgrade from SDK 2.0 to SDK 2.1 Jan 19, 2025
@peterharperuk
Copy link
Contributor

Do you have a stack trace of both cores?

@ikjordan
Copy link
Author

Unfortunately this issue only occurs when the Pico 2 starts from power on. It does not occur if I launch the executable from a debugger or trigger a restart via the hardware watchdog. Therefore I cannot attach a Debug Probe prior to the crash occurring and retrieve the call stack.

Is there a way to attach a Debug Probe post crash to obtain the call stacks?

@peterharperuk
Copy link
Contributor

peterharperuk commented Jan 20, 2025

I would open gdb

gdb-multiarch <name of elf file>

Then tell gdb how to talk to reach openocd, e.g.

target remote 127.0.0.1:3333

It might be worth testing with the very tip of the develop branch (from today) just to rule that out.

@ikjordan
Copy link
Author

The pico 2 is in a Pimoroni vga demo board. I trigger the cold reboot by pressing the Run button on the Pimoroni board.

I tried the tip of develop with the fixes submitted today. The program locked on each of the 10 cold boots I tried.

Annoyingly it started correctly every time I downloaded the executable via the debugger. It also restarted correctly every time I triggered a restart from the debug menu in VSC.

I then reverted tinyusb to '4232642899362fa5e9cf0dc59bad6f1f6d32c563' (the version in 2.0.0), keeping pico-sdk on the tip of develop. Cold start succeeded on each of the 10 attempts I made.

I did attempt to reconnect to gdb as described above. when entering the target remote 127.0.0.1:3333 from inside gdb I got 127.0.0.1:3333: Connection timed out.

I'll work more on trying to reattach gdb tomorrow. If there is any guide you could recommend to help with that please let me know :-)

@ikjordan
Copy link
Author

Some more positive news.
I moved the tinyusb submodule forward to tag 0.18.0 and rebuilt. The application runs correctly. If I then move to tag 0.17.0 and rebuild it hangs.

I will attempt to bisect the tinyusb submissions between 0.17.0 and 0.18.0 to see if I can find the submission that "fixes" the issue I see with 0.17.0. That may take some time, as there are many submissions.

I double checked with an rp2040 today. This issue only occurs with a rp2350.

@ikjordan
Copy link
Author

After bisecting the submits into TinyUSB release 0.18.0 the submit after which my programs starts correctly is:
hathach/tinyusb@be25aa3

The earlier code submit after which the program hangs is:
hathach/tinyusb@dd1822b

I suppose this issue will be fixed in pico-sdk when tag 0.18.0 of TinyUSB is taken.

@peterharperuk
Copy link
Contributor

Hmm, interesting indeed. Can you give me some idea what the program is doing with USB on core 0? I feel like I should try and reproduce this?

@ikjordan
Copy link
Author

The program is a Sinclair ZX81 emulator. It is using USB to read a keyboard (and optionally) a joystick. The code is at: https://github.com/ikjordan/picozx81

The hang occurs very early in the start-up process. The code is in pico81.cpp

At start-up the code reads info from the SD Card. It then sets the processor speed (typically 270MHz), the display resolution and the refresh rate based on the info it has read.

The code is hanging with core1 in malloc attempting to initialise scanvideo and core 0 waiting for core 1 to set a semaphore indicating scanvideo start-up complete. This is called from line 41 of pico81.cpp. Strangely, all of this occurs before the first call to initialise the USB (tuh_init) at line 47. Although possibly there is some static initialisation in TinyUSB?

@ikjordan
Copy link
Author

Some more information :
Using the LED as an indicator I can see that the malloc always fails in core 1 waiting for the malloc mutex. However, it is not always the first core 1 malloc that fails.

I saw that core 0 always releases the mutex (so no race condition with a core 0 interrupt).

The malloc mutex is created via auto_init_mutex. If I changed the code that so the malloc mutex was initialised at the start of main the code started correctly.

This seems to be an very tricky issue to pin down. I cannot rule out that it may be caused by a subtle memory issue in the application. However the issue is very reproducible. Also I ran many different compilations of the code bisecting the TinyUSB submissions. They also behaved in a reproducible manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants