-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve interrupt latency (especially on Xtensa) #1162
Comments
While it's probably still worth to optimize things, regarding Replacing/adding this to the top of if level == 1 {
if interrupt::get() & (1 << 6) != 0 {
interrupt::clear(1 << 6);
xtensa_lx_rt::Timer0(1, save_frame);
}
}
if level == 3 {
if interrupt::get() & (1 << 29) != 0 {
interrupt::clear(1 << 29);
xtensa_lx_rt::Software1(3, save_frame);
}
}
let status = get_status(crate::get_core());
if status == 0 {
return;
}
if level == 1 {
if status & (1 << Interrupt::WIFI_MAC as u128) != 0 {
handle_interrupt(1, Interrupt::WIFI_MAC, save_frame);
}
if status & (1 << Interrupt::WIFI_PWR as u128) != 0 {
handle_interrupt(1, Interrupt::WIFI_PWR, save_frame);
}
}
if level == 2 {
if status & (1 << Interrupt::TG1_T0_LEVEL as u128) != 0 {
handle_interrupt(2, Interrupt::TG1_T0_LEVEL, save_frame);
}
}
if level == 3 {
if status & (1 << Interrupt::FROM_CPU_INTR0 as u128) != 0 {
handle_interrupt(3, Interrupt::FROM_CPU_INTR0, save_frame);
}
if status & (1 << Interrupt::TG0_T0_LEVEL as u128) != 0 {
handle_interrupt(3, Interrupt::TG0_T0_LEVEL, save_frame);
}
}
return; but using this with |
Do you see the same results on the esp32 or esp32s2? In my testing is see much worse results on the s3, which makes me think that we have a general xtensa issue, and a esp32s3 specific issue. Maybe this might help the other chips? |
Tried on ESP32 and it doesn't seem to make a difference there, too. Unfortunately, the results vary a lot for each run even if I don't change anything so even if there are minor differences in the numbers it might not be caused by any code change On S2 I currently cannot get any benchmark example to work on S2 (even without my changes) |
Maybe my results were not correct yesterday - testing again today using my Android phone as the bench-server (https://github.com/bjoernQ/android-tcp-bench-server) shows a clear indication that at least the download rates get much better, (tested on S3, again) Original
With the optimized/hacked interrupt handling
|
Just for reference (on my network), current main with an esp32: Got IP: 192.168.0.106/24
Testing download...
connecting to Address([192, 168, 0, 24]):4321...
connected, testing...
download: 943 kB/s
Testing upload...
connecting to Address([192, 168, 0, 24]):4322...
connected, testing...
upload: 771 kB/s
Testing upload+download...
connecting to Address([192, 168, 0, 24]):4323...
connected, testing...
upload+download: 616 kB/s ESP32S3 (powered through uart, not USB serial jtag): Testing download...
connecting to Address([192, 168, 0, 24]):4321...
connected, testing...
download: 562 kB/s
Testing upload...
connecting to Address([192, 168, 0, 24]):4322...
connected, testing...
upload: 387 kB/s
Testing upload+download...
connecting to Address([192, 168, 0, 24]):4323...
connected, testing...
upload+download: 311 kB/s Both with the following configuration:
I feel like there must be some kind of CPU/Flash/Cache issue with the S3 to see this much of a difference, and maybe the more optimized interrupt you've been handling masks this issue a bit because it's less CPU bound 🤔. |
Is is possible that one of the critical code path is not in IRAM for the ESP32-S3? I pretty much discovered the same thing, interrupt (and thus embassy) is very slow |
Cache misses would definitely increase latency a lot - #1169 should make a huge difference |
I decided to look at this more in general and less in the direction of My test setup is as this: have an GPIO interrupt trigger at falling edge of the boot button. The handler just toggles GPIO 2. Attached a logic analyzer and configured accordingly (400MHz sample rate, setting the right trigger) to measure the time from the falling edge of the boot button to the toggle of GPIO2. This measures a lot more than just the interrupt latency but should be good enough. To get reasonable results things called by the user-handler is moved to RAM - for critical-section it was done by cloning the repo locally. (But could and should be done via linker-script like outlined at the end of this text) I again checked how much the Xtensa register spilling changes things. In my experiments it's 0.5us. Sounds a lot but compared to the overall latency it's not really much. First tested ESP32-S3 The first triggering gives 62.5us (!!!). From the second triggering on, it's down to 14.5us. Since besides the interrupts it's just running an empty Let's verify: set flash frequency to 20 MHz and .... the first triggering is 110us, after that it's 14.5 again. So, it is flash access, definitely. I changed HAL code to just call the user-handler directly in We assumed RISC-V is fine, let's check ESP32-C3 First invocation is 58.3us, then it stays at 18us. Seems like the same problem, then. Double checking with 20MHz flash frequency shows 98.5us for the first invocation, then 18us. Looking at the dis-assembly of the vectoring code revealed it's also calling code in flash
We can get rid of the panicking stuff via What worked for ESP32-C3 was omitting compiler-builtins from That together with panic-immediate-abort gets the first triggering on C3 down to 22.9us. Still more than the next invocations but from the disassembly there shouldn't be calls to code in flash. 🤔 On ESP32-S3 gnu-ld ignored not placing
(but that in turn wasn't working for lld) On S3 it gets down to 25.8us for the first invocation, then down to 14.5 So, there is still some flash access! Probably not code, but data. Looking at the assembly again showed some access to DROM addresses. Probably having an option to place We need to investigate what is actually accessed in DROM and what we can do about it. We could also rewrite the |
So, I looked a bit into the DROM access. One easy thing is What worked was making it But that is not enough. It seems that at least switch-tables are placed in I don't see a way to completely avoid hitting flash currently. (Me not seeing it doesn't mean there isn't a way) Falling back to (inline) assembly would give us full control and is the only option I see right now. Besides the downside of having less maintainable code it's quite probable that user code will hit the flash when we call it. But might still we worth it just for the async handlers we include in the HAL (if we can make them flash access free) However, it's at least possible to cut down the worst-case latency with the (moderate) changes described in the comment above UPDATE: I changed the order of text+rodata and rwtext+rwdata in esp32s3.x and let |
Thank you for looking into this @bjoernQ! Really helpful! I think we want to avoid writing this all in assembly, I think later we should open up the option for "high" level interrupts written in asm, but in general, I think we want to keep our normal interrupt handling in Rust. I looked around the esp forums for interrupt latency topics, I saw values from 2us to 15us, at least we're heading this way now, and going from 62us to 15 is a big win! I think some actionable items from your findings would be: Moved the list to the top level issue. @bjoernQ, if possible, would you be able to your benchmarking code, either in a gist or whatever is more suitable so we can have a reference when we begin to implement these things? cc: @sethp, I know its been a while since you've poked your head into esp-land, but I noticed you made this post: https://discourse.llvm.org/t/moving-jump-tables-sections-on-embedded-systems/70226 and I was wondering if you made any headway in your project. |
Here is the code for what I used to benchmark: https://gist.github.com/bjoernQ/4bc8236b926803e6aa22960880af70cd it's good to use a logic-analyzer which supports a reasonable high sample rate (I used DSLogic Pro at 400MHz) |
For not emitting jump-tables there is
Currently it seems there is no support for doing that on a function level: https://github.com/rust-lang/rust/blob/master/compiler/rustc_middle/src/middle/codegen_fn_attrs.rs |
Wow, awesome work @bjoernQ ! We never got as precise as measuring latency in seconds, just by proxy (i.e. whether or not our output was "flickering" when we were blowing our real-time budget on cache misses). It's very cool to see these numbers, and thank you for sharing the benchmark! Your analysis seems like it matches more or less what I found working on #534 : flash is cache thrash, and
Yeah, that's a LLVM limitation, I believe: I'm not quite sure what optimization pass is producing those jump tables, but I couldn't find a way to configure it at a more granular level than global. The unfortunate part is that setting that in the build flags for esp-hal does nothing for any consumers of the crate, IIRC: the build flags aren't carried over. There is There's also a very suspiciously named "jump tables in function section".... thing, down at the LLVM level; I didn't fully trace that through to identify where/how to turn it on, but this signature suggests it's also configurable per-function: https://github.com/llvm/llvm-project/blob/9a9aa41dea83039154601082b1aa2c56e35a5a17/llvm/include/llvm/CodeGen/TargetLoweringObjectFileImpl.h#L83-L84 (though maybe not at the rust source level).
The results I found were similar to yours; I thought there might be a path through LLVM, which leads me to:
Not the ELF post-processor or LLVM change1, but I did find a way to route around it. I tried a few different rust source-level constructs to avoid the jump table, but the optimizer was too clever. That did lead me to the realization that we already have implemented a jump table at the rust source level, though: So changing the troublesome match from: match code {
1 => interrupt1(...),
...
} to something like: HANDLER_VECTOR[code](...) // given HANDLER_VECTOR[1] == interrupt1 , etc. ought to avoid the whole jump table placement question entirely.2 Populating Footnotes
|
vectored
feature (especially Xtensa)
Had another look at this. "Replace consts with static's if they're referenced in interrupt handling code" doesn't have a huge impact but increases the RAM usage - we probably don't want to do this by default (would mean a new feature or we need the build-time configuration) Not emitting jump tables has some impact on the initial invocation but impacts performance in a negative way overall. Also, it's something only the binary crate can configure. Moving jump tables etc. to RAM might be a good option but again, increases RAM usage which should be an opt-in / opt-out but we shouldn't just do it unconditionally. |
It's a maze of tradeoffs, to be sure; for us the question was "how much RAM can we trade for latency & reduced variability," which made the approach feel a lot more crisp. A small note, though—the micro-benchmark you're using doesn't look like it'll exert any cache pressure, which is probably why your results differed from ours on the jump tables:
Binary searching through 32 possibilities will take 5 branches, whereas the table approach costs only a single load-dependent jump, which will be cheaper if (& only if) the load is cached in SRAM. The trick is that a more involved program is likely to clobber the cache in between interrupt entry, so every time looks a lot more like the initial invocation than the subsequent ones in the benchmark. |
I will definitely get back to this once we have a sane way to do fine grained configuration by the user (i.e. not hundreds of Cargo features 😄) |
I wonder how hard it is to make a feature that puts the jump table in RAM? I am trying to build a Rust-based flight controller and kept bumping into unexpected latencies because I am running the two cores on the ESP32-S3 almost in full, so as @sethp said I have huge cache pressure. Since we have 4bytes for each interrupt and total 128 interrupts max, a 512 byte RAM price is a great deal for deterministic latency. |
Such functionality wouldn't be too hard to do via the linker script. It's just that we currently try to avoid introducing new features but want that build-time config system |
Going to close this in favor of #1747 |
Actionable items from this investigation:
const
s thisstatic
's if they're referenced in interrupt handling codeOriginal issue:
With these two non-scientifical tests
Vectored
Non-vectored
I get these results
with saving floats
vectored = 1111 cycles
non-vectored = 214
w/o saving floats
vectored = 1096
non-vectored = 199
w/o saving floats, w/o spilling registers
vectored = 1043
non-vectored = 146
Latency is more than five times with vectoring enabled.
While it's more or less possible to use
vectored
and non-vectored interrupts together on RISC-V that is not possible in a sane way for Xtensa currently. This hurtsesp-wifi
a lot but also hurts async performance.The text was updated successfully, but these errors were encountered: