Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restricting tasks to CPU0 breaks async tls #296

Open
torkleyy opened this issue Aug 31, 2023 · 7 comments
Open

Restricting tasks to CPU0 breaks async tls #296

torkleyy opened this issue Aug 31, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@torkleyy
Copy link
Contributor

If I restrict threads to run on CPU0:

CONFIG_ESP_MAIN_TASK_AFFINITY=CPU0
CONFIG_PTHREAD_TASK_CORE_DEFAULT=0
CONFIG_PTHREAD_DEFAULT_CORE_0=y

the AsyncEspTls::negotiate won't become ready.

I would expect the main thread (async_io::block_on) to yield at some point to let the IO thread make progress.

@ivmarkov
Copy link
Collaborator

ivmarkov commented Aug 31, 2023

Might be caused by this terrible hack which is trying to mitigate this oddity in esp-tls which is masking out whether mbedtls wants us to wait on the socket to become readable, or writeable.

Can you increase the delay from 0, to - say - 20 (= 20ms), and then remove the await on the socket to become writable? Shooting in the dark here, but if that turns out to be the culprit, I'm at a loss how to implement a proper workaround.

@ivmarkov
Copy link
Collaborator

Additional info: IF the main task had higher priority than the pthread-based "poll my sockets" task which async-io silently runs in the background, that delay_ms(0) wouldn't help, as the main task gives the rest of its time-slice only to same or higher-priority tasks.

I hope you haven't changed your main task priority. It is by default with priority 1, which is only higher than the idle tasks which run at prio 0. pthread tasks on the other hand by default run with priority 5, which is a higher priority.

But if you DID change your main task prio to something higher than 5... that would explain what is going on there...

@torkleyy
Copy link
Contributor Author

I cannot currently check anything, but I'm sure my main task priority is the default one.

@ivmarkov
Copy link
Collaborator

It will take me a couple of days to check myself, but I guess, no rush.

@torkleyy
Copy link
Contributor Author

torkleyy commented Sep 6, 2023

This is the stack trace when the watchdog interrupts the thread:

0x3FFC0160 - port_IntStack
    at ??:??
0x40083CDD - _xt_lowint1
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/freertos/FreeRTOS-Kernel/portable/xtensa\xtensa_vectors.S:1236
0x3FFC0190 - port_IntStack
    at ??:??
0x4021A6AB - esp_vfs_safe_fd_isset
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/vfs\vfs.c:838
0x3FFDEEC0 - _bss_end
    at ??:??
0x4017322E - esp_vfs_select
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/vfs\vfs.c:929
0x3FFDEEE0 - _bss_end
    at ??:??
0x4016F153 - poll
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/newlib\poll.c:63
0x3FFDEF50 - _bss_end
    at ??:??
0x40125A04 - polling::poll::poll
    at C:\Users\me\.cargo\git\checkouts\polling-353792340800df62\0a01608\src\poll.rs:445
0x3FFDEFA0 - _bss_end
    at ??:??
0x401270F4 - polling::Poller::wait
    at C:\Users\me\.cargo\git\checkouts\polling-353792340800df62\0a01608\src\lib.rs:394
0x3FFDF180 - _bss_end
    at ??:??
0x4011DA5C - async_io::reactor::ReactorLock::react
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\reactor.rs:270
0x3FFDF1F0 - _bss_end
    at ??:??
0x4011FD61 - async_io::driver::main_loop
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\driver.rs:65
0x3FFDF2B0 - _bss_end
    at ??:??
0x4011EAB4 - async_io::driver::unparker::{{closure}}::{{closure}}
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\driver.rs:32
0x3FFDF330 - _bss_end
    at ??:??
0x4011F1E7 - std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
    at C:\Users\me\.rustup\toolchains\esp\lib\rustlib\src\rust\library\std\src\thread\mod.rs:529
0x3FFDF350 - _bss_end
    at ??:??

I've increased the delay in the EWOULDBLOCK_I32 case. With a delay of 20, the watchdog still triggered, but the stack trace changed:

0x3FFC0160 - port_IntStack
    at ??:??
0x40083CDD - _xt_lowint1
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/freertos/FreeRTOS-Kernel/portable/xtensa\xtensa_vectors.S:1236
0x3FFC0190 - port_IntStack
    at ??:??
0x4000BFED - _xtos_set_intlevel
    at ??:??
0x3FFDF6A0 - _heap_start
    at ??:??
0x400943D2 - vPortClearInterruptMaskFromISR
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/freertos/FreeRTOS-Kernel/portable/xtensa/include/freertos\portmacro.h:568
0x3FFDF6B0 - _heap_start
    at ??:??
0x400930EF - xQueueGenericSend
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/freertos/FreeRTOS-Kernel\queue.c:882
0x3FFDF6D0 - _heap_start
    at ??:??
0x40083F8A - lock_release_generic
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/newlib\locks.c:192
0x3FFDF710 - _heap_start
    at ??:??
0x4008403D - _lock_release
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/newlib\locks.c:198
0x3FFDF740 - _heap_start
    at ??:??
0x4017315E - esp_vfs_select
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/vfs\vfs.c:932
0x3FFDF760 - _heap_start
    at ??:??
0x4016F067 - poll
    at S:\eval-esp-rust\target\xtensa-esp32-espidf\debug\build\esp-idf-sys-fcec888798d06ca6\out\build\C:/Users/me/.espressif/esp-idf/v5.1.1/components/newlib\poll.c:63
0x3FFDF7D0 - _heap_start
    at ??:??
0x40125918 - polling::poll::poll
    at C:\Users\me\.cargo\git\checkouts\polling-353792340800df62\0a01608\src\poll.rs:445
0x3FFDF820 - _heap_start
    at ??:??
0x40127008 - polling::Poller::wait
    at C:\Users\me\.cargo\git\checkouts\polling-353792340800df62\0a01608\src\lib.rs:394
0x3FFDFA00 - _heap_start
    at ??:??
0x4011D970 - async_io::reactor::ReactorLock::react
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\reactor.rs:270
0x3FFDFA70 - _heap_start
    at ??:??
0x4011FC75 - async_io::driver::main_loop
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\driver.rs:65
0x3FFDFB30 - _heap_start
    at ??:??
0x4011E9C8 - async_io::driver::unparker::{{closure}}::{{closure}}
    at C:\Users\me\.cargo\registry\src\index.crates.io-6f17d22bba15001f\async-io-1.12.0\src\driver.rs:32
0x3FFDFBB0 - _heap_start
    at ??:??
0x4011F0FB - std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
    at C:\Users\me\.rustup\toolchains\esp\lib\rustlib\src\rust\library\std\src\thread\mod.rs:529

Interestingly, I've had one case where it did manage to do TLS negotation (and afterwards communicated normally with the server). However, it took abnormally long (7 seconds) and I wasn't able to reproduce it a second time.

@ivmarkov
Copy link
Collaborator

@torkleyy TL;DR: If you wrap your async_io::block_on call in another thread, like this:

    thread::Builder::new().stack_size(15000).spawn(move || {
        smol::block_on(<... your code driving AsyncEspTls ...>)
    })?
    .join()
    .unwrap()

... it would work.

Long story: still to get to the exact root cause, but in the meantime pretty sure the problem is caused by the combination of two factors:

  • The main task priority is too small - 1 - and you cannot raise it; on the other hand, the hidden pthread task async-io - which is created with the pthread API, has priority 5
  • The hidden async-io task does not sleep enough - this is what I need to see why is happening; most likely it is simply because the EspTls socket is ready, and it does busy looping rather than proper sleeping

As a result, both the main task, as well as the idle tasks are actually starving and never being scheduled. So - in a way - the EspTls code never gets a chance to run, even though its socket is ready!

The workaround of running your code in a pthread task works, because this way you raise the priority of your task to 5, and allow it to pre-empt the busy-looping async-io hidden task, as ESP IDF allows time-slicing across tasks with equal priorities.

I have a deja-vu feeling that I'm discovering this for the second time, and have experienced this on the single-core esp32c3 a while ago. I think there is a reason why I've put all the async code in rust-esp32-std-demo in pthread tasks, and not in the main task. :)

@torkleyy
Copy link
Contributor Author

Ha! I see. Thanks a bunch for all the information!

@Vollbrecht Vollbrecht added the bug Something isn't working label Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

3 participants