-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition between WiFi receive and consume #1614
Conversation
The change seems fine, since it's essentially a noop. But I'm really puzzled how the condition could happen. It seems like this would be caused by an issue at a higher layer in the stack, somewhere between Ethernet packet reception. But generally that would end up blowing things up pretty frequently. (Since as you saw this just makes a Would you be able to patch ClientContext to cause a
Something like that would help pinpoint if there was something obvious. like a re-entrant LWIP call (i.e. illegal) in the code somewhere... |
I'm not certain how this happens either, haven't been able to nail that down. Would arguably be better to fix the root cause rather than implementing this patch, but it at least addresses our problem and shouldn't negatively affect anyone else. Don't have my debugger with me right now, but yesterday I did manage to stop execution once it got stuck in There's some asynchronous function calls in there, so I wonder if there's some kind of race condition occurring. Since our application is sending lots of small packets very quickly, it may be possible for 2 to be received at the same time, and both attempt to make a new |
One other sanity check, are you doing any multicore calls to anything at all WiFi/WiFiClient related? That's definitely guaranteed to go poorly as there's IRQ protection, but not multicore protection throughout the library and the SDK. Looking at your trace, I think you caught it after things went wrong. The pbuf linked list is already circular so it's gonna loop forever trying to find an end to update the end element's FWIW, the |
Our main project does have multi-core functionality, but all the wireless processing is done on just one core. This repo contains a stripped down version of the project (no multi-core), and the issue exists there too. It's basically just a single main.cpp that sets up a websocket server with its own AP, and just does nothing with the received messages. After anywhere from 10 seconds to >15 minutes of receiving messages, it would hit that infinite loop. I suppose it's possible that the websocket library could be causing this bug. Although this is the second library we've used, and both exhibit the same behavior, so we stopped investigating the websocket libraries. Links2004/arduinoWebSockets is what's used in the stripped down repo above, and gilmaimon/ArduinoWebsockets is what we used before. Yes, that stack trace was from after the first call to |
Ok, managed to catch it before calling So both Here's what the call stack shows: Looks like it's not actually showing all the way up the call stack for some reason. Clicking "Load More Stack Frames" doesn't actually do anything. I've not done much debugging in VS Code, and no debugging with the Pico W, so not sure how to fix that. For comparison, I also set a break point just before the call to
It's gets there through a very different path. I've tested the "normal" situation a few times, and this call stack is always the same. So it seems that only the other path is problematic. Interesting! Hope this helps! Please let me know if you need anything else, thank you! |
Thanks! For some reason I didn't see the email with your update a couple days ago, sorry. My first hunch looking at the trace is a race condition between the ClientContext returning some data to the app and taking it out of the pbuf, and another packet coming in. So the CC is in the middle of updating it's state to say "I've read this bit" but before it can clear the existing pbuf another packet comes in. Because the pbuf allocator is very simple the free'd pbuf address and the newly received pbuf are the same (i.e. the pbuf was free'd and a new one allocated between when the CC did If that's the case, the pbuf being added really is new and your workaround is really doing the right thing by just assigning the pbuf! 👏 On the ESP8266, where I stole this from, there is no async LWIP processing (i.e. you need to call But as of now that's only my conjecture and I need to examine the code a bit to be sure... |
No problem! We've got this implemented in our application now that it's been identified, so no rush on making any changes.
Oh! I think I might know what's actually happening! In every test I've done, when the code freezes, it appears to get stuck exactly here: arduino-pico/libraries/WiFi/src/include/ClientContext.h Lines 614 to 615 in 36839cb
I believe it really is a race condition - another packet comes in during
This would make it so |
Great! Please update with any results you get from that reordering so we can fit it at the root cause. :) |
If another packet comes in between freeing `_rx_buf` and setting `_rx_buf` to 0, that new packet could get put into the same memory address and get concatenated to itself, which leads to an infinite loop. New solution assigns a temp pointer, sets `rx_buf` to 0, then frees the memory, which guarantees `_rx_buf` always points to valid data.
Yep, that was it! The proposed change appears to behave well, so I think this is the best solution. My latest commit implements the change, so please give it a quick read and test to verify it works on your end. Thank you for taking the time to discuss this with me, and for all the other work you've done with this project! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again!
I have a project that runs smooth on ESP32, having used WebSockets library, installed from https://github.com/Links2004/arduinoWebSockets, but it is not workable on PicoW, while the compilation gone error free in latest to data Arduino IDE 2.3.2. The WiFi Access Point works on PicoW, reachable from client app, but Websocket does not work. |
Fixes #1613
No idea if this is actually the best solution, please feel free to suggest alternative implementations!After discussion below, discovered the root problem is actually a race condition in the WiFi ClientContext.h. In these 2 lines:
arduino-pico/libraries/WiFi/src/include/ClientContext.h
Lines 614 to 615 in 36839cb
If a new packet is received after the
_rx_buf
is free'd but before the_rx_buf
pointer is set to 0, that new packet could get stored at the exact same memory location that was just free'd. The receive handler would try to concatenate the new data onto the buffer here:arduino-pico/libraries/WiFi/src/include/ClientContext.h
Line 651 in 36839cb
This would assign the
next
pointer of_rx_buf
to itself, resulting inpbuf_cat()
going into an infinite loop.The solution here is to change the consume function to first assign a temporary pointer to
_rx_buf
, then set_rx_buf
to 0, then free the memory using the temporary pointer. This guarantees that_rx_buf
doesn't point to memory that was just free'd.