You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use thread_local for loader_life_support to improve performance (#5830)
* Use thread_local for loader_life_support to improve performance
As explained in a new code comment, `loader_life_support` needs to be
`thread_local` but does not need to be isolated to a particular
interpreter because any given function call is already going to only
happen on a single interpreter by definiton.
Performance before:
- on M4 Max using pybind/pybind11_benchmark unmodified repo:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 63.8 nsec per loop
```
- Linux server:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch)
2000000 loops, best of 5: 120 nsec per loop
```
After:
- M4 Max:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 53.1 nsec per loop
```
- Linux server:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch)
2000000 loops, best of 5: 101 nsec per loop
```
A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone.
Open questions:
- How do we determine whether we can safely use `thread_local`? I see
concerns about old iOS versions on
#5705 (comment)
and #5709; is there anything
else?
- Do we have a test that covers "function called in one interpreter
calls a C++ function that causes a function call in another
interpreter"? I think it's fine, but can it happen?
- Are we happy with what we think will happen in the case where
multiple extensions compiled with and without this PR interoperate?
I think it's fine -- each dispatch pushes and cleans up its own
state -- but a second opinion is certainly welcome.
* Remove PYBIND11_CAN_USE_THREAD_LOCAL
* clarify comment
* Simplify loader_life_support TLS storage
Replace the `fake_thread_specific_storage` struct with a direct
thread-local pointer managed via a function-local static:
static loader_life_support *& tls_current_frame()
This retains the "stack of frames" behavior via the `parent` link. It also
reduces indirection and clarifies intent.
Note: this form is C++11-compatible; once pybind11 requires C++17, the
helper can be simplified to:
inline static thread_local loader_life_support *tls_current_frame = nullptr;
* loader_life_support: avoid duplicate tls_current_frame() calls
Replace repeated calls with a single local reference:
auto &frame = tls_current_frame();
This ensures the thread_local initialization guard is checked only once
per constructor/destructor call site, avoids potential clang-tidy
complaints, and makes the code more readable. Functional behavior is
unchanged.
* Add REMINDER for next version bump in internals.h
---------
Co-authored-by: Ralf W. Grosse-Kunstleve <[email protected]>
0 commit comments