Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition between WiFi receive and consume #1614

Merged
merged 4 commits into from
Aug 4, 2023

Conversation

sfe-SparkFro
Copy link
Contributor

@sfe-SparkFro sfe-SparkFro commented Jul 29, 2023

Fixes #1613

No idea if this is actually the best solution, please feel free to suggest alternative implementations!

After discussion below, discovered the root problem is actually a race condition in the WiFi ClientContext.h. In these 2 lines:

pbuf_free(_rx_buf);
_rx_buf = 0;

If a new packet is received after the _rx_buf is free'd but before the _rx_buf pointer is set to 0, that new packet could get stored at the exact same memory location that was just free'd. The receive handler would try to concatenate the new data onto the buffer here:

This would assign the next pointer of _rx_buf to itself, resulting in pbuf_cat() going into an infinite loop.

The solution here is to change the consume function to first assign a temporary pointer to _rx_buf, then set _rx_buf to 0, then free the memory using the temporary pointer. This guarantees that _rx_buf doesn't point to memory that was just free'd.

@earlephilhower
Copy link
Owner

The change seems fine, since it's essentially a noop. But I'm really puzzled how the condition could happen.

It seems like this would be caused by an issue at a higher layer in the stack, somewhere between Ethernet packet reception. But generally that would end up blowing things up pretty frequently. (Since as you saw this just makes a pbuf that's got a loop in it so anything using that data would hang...)

Would you be able to patch ClientContext to cause a panic and get a stack dump thru SWD/GDB when this happens?

        if (_rx_buf == pb)  {
            panic("boom");
        }

Something like that would help pinpoint if there was something obvious. like a re-entrant LWIP call (i.e. illegal) in the code somewhere...

@sfe-SparkFro
Copy link
Contributor Author

I'm not certain how this happens either, haven't been able to nail that down. Would arguably be better to fix the root cause rather than implementing this patch, but it at least addresses our problem and shouldn't negatively affect anyone else.

Don't have my debugger with me right now, but yesterday I did manage to stop execution once it got stuck in pbuf_cat() and got all this info:

image

There's some asynchronous function calls in there, so I wonder if there's some kind of race condition occurring. Since our application is sending lots of small packets very quickly, it may be possible for 2 to be received at the same time, and both attempt to make a new pbuf at the same memory location.

@earlephilhower
Copy link
Owner

earlephilhower commented Jul 29, 2023

One other sanity check, are you doing any multicore calls to anything at all WiFi/WiFiClient related? That's definitely guaranteed to go poorly as there's IRQ protection, but not multicore protection throughout the library and the SDK.
-edit- Actually, to be clear, only core0 can do any WiFi related tasks because that's where the CYW43 handler lives. So Even if you did all your WiFi on core1 it would still eventually go boom... -edit-

Looking at your trace, I think you caught it after things went wrong. The pbuf linked list is already circular so it's gonna loop forever trying to find an end to update the end element's next ptr. What we really need to see is when the first call to pbuf_cat(a,a) happens which is where the original fault lies.

FWIW, the __wrap* calls do handle the async exclusion (that's why they're wrapped in lwip_wrap.cpp) and the async handler is how the received Ethernet packet processing is handled (as opposed to requiring the user to do polls).

@sfe-SparkFro
Copy link
Contributor Author

Our main project does have multi-core functionality, but all the wireless processing is done on just one core. This repo contains a stripped down version of the project (no multi-core), and the issue exists there too. It's basically just a single main.cpp that sets up a websocket server with its own AP, and just does nothing with the received messages. After anywhere from 10 seconds to >15 minutes of receiving messages, it would hit that infinite loop.

I suppose it's possible that the websocket library could be causing this bug. Although this is the second library we've used, and both exhibit the same behavior, so we stopped investigating the websocket libraries. Links2004/arduinoWebSockets is what's used in the stripped down repo above, and gilmaimon/ArduinoWebsockets is what we used before.

Yes, that stack trace was from after the first call to pbuf_cat() unfortunately. If it helps, I'll see if I can get a new trace before the first call to it. Will be Monday at the earliest!

@sfe-SparkFro
Copy link
Contributor Author

Ok, managed to catch it before calling pbuf_cat() by adding that print statement like you suggested when _rx_buf == pb. Here's the variables window:

image

So both _rx_buf and pb point to the same address like discussed, and even appear to have the same payload. Not actually sure what the payload is:

image

Here's what the call stack shows:

image

Looks like it's not actually showing all the way up the call stack for some reason. Clicking "Load More Stack Frames" doesn't actually do anything. I've not done much debugging in VS Code, and no debugging with the Pico W, so not sure how to fix that.

For comparison, I also set a break point just before the call to pbuf_cat() in a "normal" situation when _rx_buf != pb. Here's the variables window:

image

_rx_buf and pb are completely different, which is what I'd expect. And here's the call stack:

image

It's gets there through a very different path. I've tested the "normal" situation a few times, and this call stack is always the same. So it seems that only the other path is problematic. Interesting!

Hope this helps! Please let me know if you need anything else, thank you!

@earlephilhower
Copy link
Owner

Thanks! For some reason I didn't see the email with your update a couple days ago, sorry.

My first hunch looking at the trace is a race condition between the ClientContext returning some data to the app and taking it out of the pbuf, and another packet coming in.

So the CC is in the middle of updating it's state to say "I've read this bit" but before it can clear the existing pbuf another packet comes in.

Because the pbuf allocator is very simple the free'd pbuf address and the newly received pbuf are the same (i.e. the pbuf was free'd and a new one allocated between when the CC did pbuf_free but before clearing _rx_buf)

If that's the case, the pbuf being added really is new and your workaround is really doing the right thing by just assigning the pbuf! 👏

On the ESP8266, where I stole this from, there is no async LWIP processing (i.e. you need to call delay() or yield() or you'll never get any Ethernet packets received there) so this corner case couldn't happen. But here we use the Pico SDK's awesome async stuff to make it easier for users, but things like this can pop up.

But as of now that's only my conjecture and I need to examine the code a bit to be sure...

@sfe-SparkFro
Copy link
Contributor Author

I didn't see the email

No problem! We've got this implemented in our application now that it's been identified, so no rush on making any changes.

My first hunch looking at the trace is a race condition

CC is in the middle of updating it's state to say "I've read this bit" but before it can clear the existing pbuf another packet comes in.

Oh! I think I might know what's actually happening! In every test I've done, when the code freezes, it appears to get stuck exactly here:

pbuf_free(_rx_buf);
_rx_buf = 0;

I believe it really is a race condition - another packet comes in during pbuf_free(_rx_buf);, so the memory has been free'd, but _rx_buf still points to that memory location. I believe a "better" solution than what I've proposed would be to instead change the above 2 lines to something like this:

auto head = _rx_buf;
_rx_buf = 0;
pbuf_free(head);

This would make it so _rx_buf doesn't accidentally point to data that's just been freed. I'll see if I can test this solution in the next couple days!

@earlephilhower
Copy link
Owner

Great! Please update with any results you get from that reordering so we can fit it at the root cause. :)

If another packet comes in between freeing `_rx_buf` and setting `_rx_buf` to 0, that new packet could get put into the same memory address and get concatenated to itself, which leads to an infinite loop.
New solution assigns a temp pointer, sets `rx_buf` to 0, then frees the memory, which guarantees `_rx_buf` always points to valid data.
@sfe-SparkFro
Copy link
Contributor Author

Yep, that was it! The proposed change appears to behave well, so I think this is the best solution. My latest commit implements the change, so please give it a quick read and test to verify it works on your end.

Thank you for taking the time to discuss this with me, and for all the other work you've done with this project!

@sfe-SparkFro sfe-SparkFro changed the title Prevent concatenation of same buffer onto itself Fix race condition between WiFi receive and consume Aug 4, 2023
Copy link
Owner

@earlephilhower earlephilhower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again!

@earlephilhower earlephilhower merged commit 1231317 into earlephilhower:master Aug 4, 2023
12 checks passed
@woodlist
Copy link

woodlist commented Apr 5, 2024

Our main project does have multi-core functionality, but all the wireless processing is done on just one core. This repo contains a stripped down version of the project (no multi-core), and the issue exists there too. It's basically just a single main.cpp that sets up a websocket server with its own AP, and just does nothing with the received messages. After anywhere from 10 seconds to >15 minutes of receiving messages, it would hit that infinite loop.

I suppose it's possible that the websocket library could be causing this bug. Although this is the second library we've used, and both exhibit the same behavior, so we stopped investigating the websocket libraries. Links2004/arduinoWebSockets is what's used in the stripped down repo above, and gilmaimon/ArduinoWebsockets is what we used before.

Yes, that stack trace was from after the first call to pbuf_cat() unfortunately. If it helps, I'll see if I can get a new trace before the first call to it. Will be Monday at the earliest!

I have a project that runs smooth on ESP32, having used WebSockets library, installed from https://github.com/Links2004/arduinoWebSockets, but it is not workable on PicoW, while the compilation gone error free in latest to data Arduino IDE 2.3.2. The WiFi Access Point works on PicoW, reachable from client app, but Websocket does not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WiFi receive can concatenate buffer to itself, resulting in infinite loop
3 participants