-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: arch: move arch_swap() declaration #82454
base: main
Are you sure you want to change the base?
kernel: arch: move arch_swap() declaration #82454
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok by me. Just a minor thing and a nit.
dc33678
to
dcaad72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems very reasonable.
dcaad72
to
b4171dc
Compare
Fixed for ARM builds. |
Hmmm ... something weird is going on with the arch.arm.swap.common.fpu_sharing test on the mps2/an521/cpu1 board. It fails with this patch, and passes without. However, the non-optimized version of the test (arch.arm.swap.common.fpu_sharing.no_optimizations) passes. Both scenarios are reproducible on my dev box. I am hoping that there is something off with the test, but I am digging into this to find out. |
I am leaning towards thinking that this failure is a test failure. In the optimized version of this test (using -0s), we are getting garbage values for the v1..v8 registers, and some of the routines for initializing data are just absent. It looks like the compiler is optimizing them away. I think we some way to keep them in the code. Continuing to dig and experiment. |
I think I see what is going on in the failing test now ... In the failing test, it makes a call to arch_swap(). Previously, this would be a function call and the test was written for that. However, with arch_swap() now being inlined, this changes what the compiler generates just enough so that when alt_thread checks the registers, r5/v2 is 0 instead of what it had previously saved and is expecting. I have an idea about how to work around this ... |
Moves the arch_swap() declaration out of kernel_arch_interface.h and into the various architectures' kernel_arch_func.h. This permits the arch_swap() to be inlined on ARM, but extern'd on the other architectures that still implement arch_swap(). Inlining this function on ARM has shown at least a +5% performance boost according to the thread_metric benchmark on the disco_l475_iot1 board. Signed-off-by: Peter Mitsis <[email protected]>
b4171dc
to
2eb582a
Compare
FYI, "v1-v8" (which are IMHO needlessly confusing aliases for r4-r11) are the caller-save registers in the ARM ABI. It's likely that the earlier swap routine was written to assume that the caller had already spilled them and that they don't have to be saved. In which case you might as well give up for ARM; the routine wasn't written to be legally inlined. I mean, one could "fix" it, but only at the cost of adding back all the spills the compiler was generating before. And that's worse and not better, as usually the compiler is really good about spill/fill logic (e.g. finding registers that don't actually need to be saved), where a context switch is forced to be conservative/pessimal and save everything. |
Also, as far as rewriting ARM Cortex M swap: I already claim that spot as soon as I can get MTK work submitted SOF-side. Hold my beer, as it were. |
I am looking forward to your rewrite. I doubt that my proposed commit will have a long life as I anticipate your work to supersede it, but should it take longer than anticipated we at least have an interim boost. |
Moves the arch_swap() declaration out of kernel_arch_interface.h and into the various architectures' kernel_arch_func.h. This permits the arch_swap() to be inlined on ARM, but extern'd on the other architectures that still implement arch_swap().
Inlining this function on ARM has shown at least a +5% performance boost according to the thread_metric benchmark on the disco_l475_iot1 board.
At the time of creating this PR, mainline results for the thread_metric w/ multiq on the disco_l475_iot1 ..
Preemptive: 7051317, Cooperative: 12436712
With this PR ...
Preemptive: 7417390, Cooperative: 13188390
The new preemptive numbers should put us a little ahead of ThreadX on the same hardware.