Skip to content

Commit 937552f

Browse files
swolchokrwgk
andauthored
Use thread_local for loader_life_support to improve performance (#5830)
* Use thread_local for loader_life_support to improve performance As explained in a new code comment, `loader_life_support` needs to be `thread_local` but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before: - on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` - Linux server: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 120 nsec per loop ``` After: - M4 Max: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` - Linux server: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 101 nsec per loop ``` A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone. Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on #5705 (comment) and #5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter"? I think it's fine, but can it happen? - Are we happy with what we think will happen in the case where multiple extensions compiled with and without this PR interoperate? I think it's fine -- each dispatch pushes and cleans up its own state -- but a second opinion is certainly welcome. * Remove PYBIND11_CAN_USE_THREAD_LOCAL * clarify comment * Simplify loader_life_support TLS storage Replace the `fake_thread_specific_storage` struct with a direct thread-local pointer managed via a function-local static: static loader_life_support *& tls_current_frame() This retains the "stack of frames" behavior via the `parent` link. It also reduces indirection and clarifies intent. Note: this form is C++11-compatible; once pybind11 requires C++17, the helper can be simplified to: inline static thread_local loader_life_support *tls_current_frame = nullptr; * loader_life_support: avoid duplicate tls_current_frame() calls Replace repeated calls with a single local reference: auto &frame = tls_current_frame(); This ensures the thread_local initialization guard is checked only once per constructor/destructor call site, avoids potential clang-tidy complaints, and makes the code more readable. Functional behavior is unchanged. * Add REMINDER for next version bump in internals.h --------- Co-authored-by: Ralf W. Grosse-Kunstleve <[email protected]>
1 parent 68cbae6 commit 937552f

File tree

2 files changed

+25
-8
lines changed

2 files changed

+25
-8
lines changed

include/pybind11/detail/internals.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
/// further ABI-incompatible changes may be made before the ABI is officially
4040
/// changed to the new version.
4141
#ifndef PYBIND11_INTERNALS_VERSION
42+
// REMINDER for next version bump: remove loader_life_support_tls
4243
# define PYBIND11_INTERNALS_VERSION 11
4344
#endif
4445

@@ -260,7 +261,7 @@ struct internals {
260261
PyObject *instance_base = nullptr;
261262
// Unused if PYBIND11_SIMPLE_GIL_MANAGEMENT is defined:
262263
thread_specific_storage<PyThreadState> tstate;
263-
thread_specific_storage<loader_life_support> loader_life_support_tls;
264+
thread_specific_storage<loader_life_support> loader_life_support_tls; // OBSOLETE (PR #5830)
264265
// Unused if PYBIND11_SIMPLE_GIL_MANAGEMENT is defined:
265266
PyInterpreterState *istate = nullptr;
266267

include/pybind11/detail/type_caster_base.h

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -42,24 +42,40 @@ PYBIND11_NAMESPACE_BEGIN(detail)
4242
/// Adding a patient will keep it alive up until the enclosing function returns.
4343
class loader_life_support {
4444
private:
45+
// Thread-local top-of-stack for loader_life_support frames (linked via parent).
46+
// Observation: loader_life_support needs to be thread-local,
47+
// but we don't need to go to extra effort to keep it
48+
// per-interpreter (i.e., by putting it in internals) since
49+
// individual function calls are already isolated to a single
50+
// interpreter, even though they could potentially call into a
51+
// different interpreter later in the same call chain. This
52+
// saves a significant cost per function call spent in
53+
// loader_life_support destruction.
54+
// Note for future C++17 simplification:
55+
// inline static thread_local loader_life_support *tls_current_frame = nullptr;
56+
static loader_life_support *&tls_current_frame() {
57+
static thread_local loader_life_support *frame_ptr = nullptr;
58+
return frame_ptr;
59+
}
60+
4561
loader_life_support *parent = nullptr;
4662
std::unordered_set<PyObject *> keep_alive;
4763

4864
public:
4965
/// A new patient frame is created when a function is entered
5066
loader_life_support() {
51-
auto &stack_top = get_internals().loader_life_support_tls;
52-
parent = stack_top.get();
53-
stack_top = this;
67+
auto &frame = tls_current_frame();
68+
parent = frame;
69+
frame = this;
5470
}
5571

5672
/// ... and destroyed after it returns
5773
~loader_life_support() {
58-
auto &stack_top = get_internals().loader_life_support_tls;
59-
if (stack_top.get() != this) {
74+
auto &frame = tls_current_frame();
75+
if (frame != this) {
6076
pybind11_fail("loader_life_support: internal error");
6177
}
62-
stack_top = parent;
78+
frame = parent;
6379
for (auto *item : keep_alive) {
6480
Py_DECREF(item);
6581
}
@@ -68,7 +84,7 @@ class loader_life_support {
6884
/// This can only be used inside a pybind11-bound function, either by `argument_loader`
6985
/// at argument preparation time or by `py::cast()` at execution time.
7086
PYBIND11_NOINLINE static void add_patient(handle h) {
71-
loader_life_support *frame = get_internals().loader_life_support_tls.get();
87+
loader_life_support *frame = tls_current_frame();
7288
if (!frame) {
7389
// NOTE: It would be nice to include the stack frames here, as this indicates
7490
// use of pybind11::cast<> outside the normal call framework, finding such

0 commit comments

Comments
 (0)