Fix race condition between WiFi receive and consume #1614

sfe-SparkFro · 2023-07-29T02:34:41Z

~~No idea if this is actually the best solution, please feel free to suggest alternative implementations!~~

After discussion below, discovered the root problem is actually a race condition in the WiFi ClientContext.h. In these 2 lines:

arduino-pico/libraries/WiFi/src/include/ClientContext.h

Lines 614 to 615 in 36839cb

    
           pbuf_free(_rx_buf); 
        
           _rx_buf = 0;

If a new packet is received after the _rx_buf is free'd but before the _rx_buf pointer is set to 0, that new packet could get stored at the exact same memory location that was just free'd. The receive handler would try to concatenate the new data onto the buffer here:

arduino-pico/libraries/WiFi/src/include/ClientContext.h

Line 651 in 36839cb

pbuf_cat(_rx_buf, pb);

This would assign the next pointer of _rx_buf to itself, resulting in pbuf_cat() going into an infinite loop.

The solution here is to change the consume function to first assign a temporary pointer to _rx_buf, then set _rx_buf to 0, then free the memory using the temporary pointer. This guarantees that _rx_buf doesn't point to memory that was just free'd.

earlephilhower · 2023-07-29T18:45:12Z

The change seems fine, since it's essentially a noop. But I'm really puzzled how the condition could happen.

It seems like this would be caused by an issue at a higher layer in the stack, somewhere between Ethernet packet reception. But generally that would end up blowing things up pretty frequently. (Since as you saw this just makes a pbuf that's got a loop in it so anything using that data would hang...)

Would you be able to patch ClientContext to cause a panic and get a stack dump thru SWD/GDB when this happens?

        if (_rx_buf == pb)  {
            panic("boom");
        }

Something like that would help pinpoint if there was something obvious. like a re-entrant LWIP call (i.e. illegal) in the code somewhere...

sfe-SparkFro · 2023-07-29T19:01:52Z

I'm not certain how this happens either, haven't been able to nail that down. Would arguably be better to fix the root cause rather than implementing this patch, but it at least addresses our problem and shouldn't negatively affect anyone else.

Don't have my debugger with me right now, but yesterday I did manage to stop execution once it got stuck in pbuf_cat() and got all this info:

There's some asynchronous function calls in there, so I wonder if there's some kind of race condition occurring. Since our application is sending lots of small packets very quickly, it may be possible for 2 to be received at the same time, and both attempt to make a new pbuf at the same memory location.

earlephilhower · 2023-07-29T19:11:06Z

One other sanity check, are you doing any multicore calls to anything at all WiFi/WiFiClient related? That's definitely guaranteed to go poorly as there's IRQ protection, but not multicore protection throughout the library and the SDK.
-edit- Actually, to be clear, only core0 can do any WiFi related tasks because that's where the CYW43 handler lives. So Even if you did all your WiFi on core1 it would still eventually go boom... -edit-

Looking at your trace, I think you caught it after things went wrong. The pbuf linked list is already circular so it's gonna loop forever trying to find an end to update the end element's next ptr. What we really need to see is when the first call to pbuf_cat(a,a) happens which is where the original fault lies.

FWIW, the __wrap* calls do handle the async exclusion (that's why they're wrapped in lwip_wrap.cpp) and the async handler is how the received Ethernet packet processing is handled (as opposed to requiring the user to do polls).

sfe-SparkFro · 2023-07-29T20:32:16Z

Our main project does have multi-core functionality, but all the wireless processing is done on just one core. This repo contains a stripped down version of the project (no multi-core), and the issue exists there too. It's basically just a single main.cpp that sets up a websocket server with its own AP, and just does nothing with the received messages. After anywhere from 10 seconds to >15 minutes of receiving messages, it would hit that infinite loop.

I suppose it's possible that the websocket library could be causing this bug. Although this is the second library we've used, and both exhibit the same behavior, so we stopped investigating the websocket libraries. Links2004/arduinoWebSockets is what's used in the stripped down repo above, and gilmaimon/ArduinoWebsockets is what we used before.

Yes, that stack trace was from after the first call to pbuf_cat() unfortunately. If it helps, I'll see if I can get a new trace before the first call to it. Will be Monday at the earliest!

sfe-SparkFro · 2023-07-31T15:48:42Z

Ok, managed to catch it before calling pbuf_cat() by adding that print statement like you suggested when _rx_buf == pb. Here's the variables window:

So both _rx_buf and pb point to the same address like discussed, and even appear to have the same payload. Not actually sure what the payload is:

Here's what the call stack shows:

Looks like it's not actually showing all the way up the call stack for some reason. Clicking "Load More Stack Frames" doesn't actually do anything. I've not done much debugging in VS Code, and no debugging with the Pico W, so not sure how to fix that.

For comparison, I also set a break point just before the call to pbuf_cat() in a "normal" situation when _rx_buf != pb. Here's the variables window:

_rx_buf and pb are completely different, which is what I'd expect. And here's the call stack:

It's gets there through a very different path. I've tested the "normal" situation a few times, and this call stack is always the same. So it seems that only the other path is problematic. Interesting!

Hope this helps! Please let me know if you need anything else, thank you!

earlephilhower · 2023-08-02T23:59:39Z

Thanks! For some reason I didn't see the email with your update a couple days ago, sorry.

My first hunch looking at the trace is a race condition between the ClientContext returning some data to the app and taking it out of the pbuf, and another packet coming in.

So the CC is in the middle of updating it's state to say "I've read this bit" but before it can clear the existing pbuf another packet comes in.

Because the pbuf allocator is very simple the free'd pbuf address and the newly received pbuf are the same (i.e. the pbuf was free'd and a new one allocated between when the CC did pbuf_free but before clearing _rx_buf)

If that's the case, the pbuf being added really is new and your workaround is really doing the right thing by just assigning the pbuf! 👏

On the ESP8266, where I stole this from, there is no async LWIP processing (i.e. you need to call delay() or yield() or you'll never get any Ethernet packets received there) so this corner case couldn't happen. But here we use the Pico SDK's awesome async stuff to make it easier for users, but things like this can pop up.

But as of now that's only my conjecture and I need to examine the code a bit to be sure...

sfe-SparkFro · 2023-08-03T01:29:07Z

I didn't see the email

No problem! We've got this implemented in our application now that it's been identified, so no rush on making any changes.

My first hunch looking at the trace is a race condition

CC is in the middle of updating it's state to say "I've read this bit" but before it can clear the existing pbuf another packet comes in.

Oh! I think I might know what's actually happening! In every test I've done, when the code freezes, it appears to get stuck exactly here:

arduino-pico/libraries/WiFi/src/include/ClientContext.h

Lines 614 to 615 in 36839cb

    
           pbuf_free(_rx_buf); 
        
           _rx_buf = 0;

I believe it really is a race condition - another packet comes in during pbuf_free(_rx_buf);, so the memory has been free'd, but _rx_buf still points to that memory location. I believe a "better" solution than what I've proposed would be to instead change the above 2 lines to something like this:

auto head = _rx_buf;
_rx_buf = 0;
pbuf_free(head);

This would make it so _rx_buf doesn't accidentally point to data that's just been freed. I'll see if I can test this solution in the next couple days!

earlephilhower · 2023-08-03T01:47:56Z

Great! Please update with any results you get from that reordering so we can fit it at the root cause. :)

If another packet comes in between freeing `_rx_buf` and setting `_rx_buf` to 0, that new packet could get put into the same memory address and get concatenated to itself, which leads to an infinite loop. New solution assigns a temp pointer, sets `rx_buf` to 0, then frees the memory, which guarantees `_rx_buf` always points to valid data.

sfe-SparkFro · 2023-08-04T00:02:25Z

Yep, that was it! The proposed change appears to behave well, so I think this is the best solution. My latest commit implements the change, so please give it a quick read and test to verify it works on your end.

Thank you for taking the time to discuss this with me, and for all the other work you've done with this project!

earlephilhower

Thanks again!

woodlist · 2024-04-05T07:46:31Z

Our main project does have multi-core functionality, but all the wireless processing is done on just one core. This repo contains a stripped down version of the project (no multi-core), and the issue exists there too. It's basically just a single main.cpp that sets up a websocket server with its own AP, and just does nothing with the received messages. After anywhere from 10 seconds to >15 minutes of receiving messages, it would hit that infinite loop.

I suppose it's possible that the websocket library could be causing this bug. Although this is the second library we've used, and both exhibit the same behavior, so we stopped investigating the websocket libraries. Links2004/arduinoWebSockets is what's used in the stripped down repo above, and gilmaimon/ArduinoWebsockets is what we used before.

Yes, that stack trace was from after the first call to pbuf_cat() unfortunately. If it helps, I'll see if I can get a new trace before the first call to it. Will be Monday at the earliest!

I have a project that runs smooth on ESP32, having used WebSockets library, installed from https://github.com/Links2004/arduinoWebSockets, but it is not workable on PicoW, while the compilation gone error free in latest to data Arduino IDE 2.3.2. The WiFi Access Point works on PicoW, reachable from client app, but Websocket does not work.

Prevent concatenation of same buffer onto itself

083822a

Merge branch 'master' into master

f94e86e

sfe-SparkFro changed the title ~~Prevent concatenation of same buffer onto itself~~ Fix race condition between WiFi receive and consume Aug 4, 2023

earlephilhower approved these changes Aug 4, 2023

View reviewed changes

Merge branch 'master' into master

d9d576f

earlephilhower merged commit 1231317 into earlephilhower:master Aug 4, 2023
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition between WiFi receive and consume #1614

Fix race condition between WiFi receive and consume #1614

sfe-SparkFro commented Jul 29, 2023 •

edited

Loading

earlephilhower commented Jul 29, 2023

sfe-SparkFro commented Jul 29, 2023

earlephilhower commented Jul 29, 2023 •

edited

Loading

sfe-SparkFro commented Jul 29, 2023

sfe-SparkFro commented Jul 31, 2023

earlephilhower commented Aug 2, 2023

sfe-SparkFro commented Aug 3, 2023

earlephilhower commented Aug 3, 2023

sfe-SparkFro commented Aug 4, 2023

earlephilhower left a comment

woodlist commented Apr 5, 2024

Fix race condition between WiFi receive and consume #1614

Fix race condition between WiFi receive and consume #1614

Conversation

sfe-SparkFro commented Jul 29, 2023 • edited Loading

earlephilhower commented Jul 29, 2023

sfe-SparkFro commented Jul 29, 2023

earlephilhower commented Jul 29, 2023 • edited Loading

sfe-SparkFro commented Jul 29, 2023

sfe-SparkFro commented Jul 31, 2023

earlephilhower commented Aug 2, 2023

sfe-SparkFro commented Aug 3, 2023

earlephilhower commented Aug 3, 2023

sfe-SparkFro commented Aug 4, 2023

earlephilhower left a comment

Choose a reason for hiding this comment

woodlist commented Apr 5, 2024

sfe-SparkFro commented Jul 29, 2023 •

edited

Loading

earlephilhower commented Jul 29, 2023 •

edited

Loading