Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support 3.13t free-threaded python #471

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

davidhewitt
Copy link
Member

First pass at adding CI and tests for freethreaded Python.

Locally I get some test failures, so I will have to investigate further before this is ready even if CI is green. cc @ngoldbaum

failures:

---- borrow::shared::tests::borrow_multiple_views stdout ----
thread 'borrow::shared::tests::borrow_multiple_views' panicked at src/borrow/shared.rs:836:17:
assertion `left == right` failed
  left: 3
 right: 1

---- borrow::shared::tests::borrow_multiple_arrays stdout ----
thread 'borrow::shared::tests::borrow_multiple_arrays' panicked at src/borrow/shared.rs:786:17:
assertion `left == right` failed
  left: 2
 right: 1


failures:
    borrow::shared::tests::borrow_multiple_arrays
    borrow::shared::tests::borrow_multiple_views

test result: FAILED. 27 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.13s

@davidhewitt davidhewitt marked this pull request as draft November 22, 2024 16:33
src/strings.rs Outdated
Comment on lines 181 to 182
// FIXME probably a deadlock risk here due to the GIL? Might need MutexExt trait in PyO3
let mut dtypes = self.dtypes.lock().expect("dtype cache poisoned");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choice to use Mutex here makes me want to add a MutexExt trait to PyO3 similar to OnceLockExt which we already added.

Given that writes to this should be infrequent (it's a global cache, as far as I can tell), I also wonder if RwLock is appropriate here. Readers are extremely short-lived so the analysis in https://blog.nelhage.com/post/rwlock-contention/ seems to be a non-issue (cc @alex)

... in which case I want RwLockExt too 😂

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess I should finally write MutexExt...

@davidhewitt
Copy link
Member Author

Based on the failures here, I propose rust-numpy 0.23.0 releases without free-threading support, and we seek to follow up in 0.23.1.

@davidhewitt davidhewitt added the CI-no-fail-fast If one job fails, allow the rest to keep testing label Nov 22, 2024
@Icxolu
Copy link
Contributor

Icxolu commented Nov 24, 2024

I did a bit of investigating here. I believe the fundamental problem is that the borrow checking API relies on the GIL for synchronization of exclusive access

unsafe extern "C" fn acquire_shared(flags: *mut c_void, array: *mut PyArrayObject) -> c_int {
// SAFETY: GIL must be held when calling `acquire_shared`.
let py = Python::assume_gil_acquired();
let flags = &mut *(flags as *mut BorrowFlags);

On the free-threaded build this is not true anymore, and therefore unsound.

An easy (but probably not optimal) solution is to wrap BorrowFlagsInner into a Mutex (I testet with parking_lot to not have to deal with poisoning) and change everything to shared access. I believe that should be enough to make it sound. The tests still don't pass, but I think that is related to their implicit assumption of running serially. The GIL enforced each tests running to completion before the next one can start, but on the free-threaded build the can run interleaved, and some assertions do not hold because of that.

@ngoldbaum
Copy link

@Icxolu @davidhewitt this just showed up as a dependency for one of the libraries I'm working on and as a NumPy maintainer maybe it's natural for me to work on this. Would you two mind if I took over work in this PR?

Also to everyone linking to this issue - sorry for not getting to this sooner! For some reason I thought this was done already.

@Icxolu
Copy link
Contributor

Icxolu commented Feb 11, 2025

I'm good with that, thank you very much 🙏. I think I still have the branch from my experiments above (on top of this branch), I could push them here (or to a separate branch) if you think that would be helpful.

@ngoldbaum
Copy link

That would be! I can take a look at making the tests thread safe, I did a lot of that for pyo3. I also want to run the tests on the free-threaded build with TSAN enabled. I wouldn’t be surprised if that elicits some races in NumPy.

Verified

This commit was signed with the committer’s verified signature.
@Icxolu
Copy link
Contributor

Icxolu commented Feb 11, 2025

There you go. I think this is fine, but it would be great if you could double check that. I guess it would also be nice if we could use a std Mutex instead of pulling in parking_lot, but I wasn't sure how to properly deal with the poisoning. At least locally tests pass for me with this, I just disabled the assertion about borrow flags on the free-threaded build, since it can now be higher due to concurrent tests.

@davidhewitt
Copy link
Member Author

Please do take this over; I had hoped to make progress though expect to be a few more weeks away from doing so myself.

@ngoldbaum
Copy link

ngoldbaum commented Feb 12, 2025

So good news: I ran the rust-numpy tests with TSAN using a hacked-together version of cargo stress to see the TSAN output from "successful" stress test runs. I didn't see any warnings besides ones I already found and reported yesterday in PyO3: PyO3/pyo3#4904

I want to try again using TSAN using a version of Python compiled with the 3.14 branch, which will hopefully be less noisy and allow me to determine if any of these issues still need to be fixed on CPython main.

For now I'm going to ignore TSAN on rust-numpy until the issues seen in PyO3 are fixed.

Also incidentally I didn't see any test failures. I'll push a change that resolves the merge conflict so CI can re-run, along with a couple more fixes I found.

@ngoldbaum
Copy link

Oh actually since I'm not a rust-numpy maintainer I can't just push to the PR.

ERROR: Permission to davidhewitt/rust-numpy.git denied to ngoldbaum.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

@davidhewitt can you give me push access to your rust-numpy fork? Or maybe a rust-numpy commit bit.... 😜

@davidhewitt
Copy link
Member Author

Or maybe a rust-numpy commit bit.... 😜

Done. Given you already have write access to both PyO3 and numpy, this seemed like a no-brainer solution :)

@ngoldbaum
Copy link

ngoldbaum commented Feb 13, 2025

Looks like CI is passing.

I think we probably don't want to add parking_lot as a dependency, so I'll look at converting the use in this PR to use standard library mutexes and try to reason a little about panics and what to do if the mutex gets poisoned.

I also do want to write MutexExt and it would be nice to get that into PyO3 0.24, so I might at least take an earnest shot at writing that. If it ends up being more complicated then I expect to write that we can reconsider. Worst case scenario we manually write the deadlock-avoidance code using parking_lot so we don't need to worry about poisoning and then circle back once MutexExt is in PyO3.

@ngoldbaum
Copy link

ngoldbaum commented Feb 13, 2025

I think all of the parking_lot uses can be replaced with the stdlib mutex and just use lock().unwrap(). In all cases if there is a panic I think we just want to propagate it.

I also "manually" added deadlock avoidance with raw FFI calls following what OnceExt does in pyo3 internals.

It occurs to me that we could use PyMutex as well, but the PR adding a rust wrapper for PyMutex to PyO3 got stalled on questions of whether the rust wrapper should add poisoning.

I did a once-over on the library and didn't spot anything thread-unsafe. There is some use of impl unsafe Send, but it's all for immutable types that appear to be constructed in a thread-safe way (e.g. using GILOnceCell).

Of course there are many ways to break ndarray by mutating an array simultaneously in multiple threads, but NumPy doesn't support shared mutation yet and that's not rust-numpy's fault.

@ngoldbaum ngoldbaum marked this pull request as ready for review February 13, 2025 22:23
@Icxolu
Copy link
Contributor

Icxolu commented Feb 13, 2025

I think all of the parking_lot uses can be replaced with the stdlib mutex and just use lock().unwrap(). In all cases if there is a panic I think we just want to propagate it.

I thought I had that initially but ran into problems with panics crossing the ffi boundary... But maybe I misremember or did something wrong back then.

@ngoldbaum
Copy link

Let me try manually inserting panics in these functions to see what happens.

@Icxolu
Copy link
Contributor

Icxolu commented Feb 13, 2025

I'll try to have a look again this weekend, maybe I can reproduce what caused me to use parking_lot.

@ngoldbaum
Copy link

Oh darn, it does look like it panics instead of getting converted into an error:

thread 'borrow::shared::tests::borrow_multiple_views' panicked at library/core/src/panicking.rs:218:5:
panic in a function that cannot unwind
tack backtrace:
   0:        0x104115758 - std::backtrace_rs::backtrace::libunwind::trace::h7e7c6fb9cc2aee21
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/../../backtrace/src/backtrace/libunwind.rs:116:5
   1:        0x104115758 - std::backtrace_rs::backtrace::trace_unsynchronized::h38e1d3319c935bb3
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x104115758 - std::sys::backtrace::_print_fmt::h8cadd2d9e5d75617
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/sys/backtrace.rs:66:9
   3:        0x104115758 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h81a9aa16593d72f2
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/sys/backtrace.rs:39:26
   4:        0x104130594 - core::fmt::rt::Argument::fmt::hcad2756fa35ef8a6
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/core/src/fmt/rt.rs:177:76
   5:        0x104130594 - core::fmt::write::h609394e7daf0d74e
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/core/src/fmt/mod.rs:1449:21
   6:        0x104112d80 - std::io::Write::write_fmt::h1074bc402491d7d3
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/io/mod.rs:1890:15
   7:        0x10411560c - std::sys::backtrace::BacktraceLock::print::hc4fb1b18a0c8b387
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/sys/backtrace.rs:42:9
   8:        0x10411690c - std::panicking::default_hook::{{closure}}::hb6b08cd6a6fb20b7
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/panicking.rs:298:22
   9:        0x104116748 - std::panicking::default_hook::hdc338d6bdae60e32
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/panicking.rs:325:9
  10:        0x1040bbc54 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h54abf3d6ab284356
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/alloc/src/boxed.rs:2030:9
  11:        0x1040bbc54 - test::test_main::{{closure}}::h834692d4c837be8a
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/test/src/lib.rs:135:21
  12:        0x104117478 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h3548180eba054279
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/alloc/src/boxed.rs:2030:9
  13:        0x104117478 - std::panicking::rust_panic_with_hook::hdf41f43107580f80
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/panicking.rs:839:13
  14:        0x104116fe0 - std::panicking::begin_panic_handler::{{closure}}::h51b40281bc5ae809
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/panicking.rs:697:13
  15:        0x104115c0c - std::sys::backtrace::__rust_end_short_backtrace::h648e053ccec260a4
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/sys/backtrace.rs:168:18
  16:        0x104116cc0 - rust_begin_unwind
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/panicking.rs:695:5
  17:        0x10413c7b0 - core::panicking::panic_nounwind_fmt::runtime::h5c7a415b82efe83a
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/core/src/panicking.rs:117:22
  18:        0x10413c7b0 - core::panicking::panic_nounwind_fmt::h27dced4559c784b4
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/core/src/intrinsics/mod.rs:3886:9
  19:        0x10413c828 - core::panicking::panic_nounwind::h4415803809b32d64
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/core/src/panicking.rs:218:5
  20:        0x10413c990 - core::panicking::panic_cannot_unwind::h0f7c614a43265ccb
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/core/src/panicking.rs:323:5
  21:        0x10407f080 - numpy::borrow::shared::acquire_shared::h74dcccb00002bc48
                               at /Users/goldbaum/Documents/rust-numpy/src/borrow/shared.rs:43:1
  22:        0x10407f4f4 - numpy::borrow::shared::acquire::hc5aa9a4abd38486b
                               at /Users/goldbaum/Documents/rust-numpy/src/borrow/shared.rs:173:23
  23:        0x10407e668 - numpy::borrow::PyReadonlyArray<T,D>::try_new::h3a37eed9b060444e
                               at /Users/goldbaum/Documents/rust-numpy/src/borrow/mod.rs:253:9
  24:        0x10405d454 - <pyo3::instance::Bound<numpy::array::PyArray<T,D>> as numpy::array::PyArrayMethods<T,D>>::try_readonly::h7ce0ee3485666e51
                               at /Users/goldbaum/Documents/rust-numpy/src/array.rs:1582:9
  25:        0x10405bfac - numpy::array::PyArrayMethods::readonly::hef8fec9a85738eaf
                               at /Users/goldbaum/Documents/rust-numpy/src/array.rs:1094:9
  26:        0x10403b84c - numpy::borrow::shared::tests::borrow_multiple_views::{{closure}}::haa975291053176c5
                               at /Users/goldbaum/Documents/rust-numpy/src/borrow/shared.rs:851:27
  27:        0x104070ec4 - pyo3::marker::Python::with_gil::h3e7ce323c50d4d62
                               at /Users/goldbaum/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/pyo3-0.23.4/src/marker.rs:412:9
  28:        0x10403dcd8 - numpy::borrow::shared::tests::borrow_multiple_views::h7c52265a34336057
                               at /Users/goldbaum/Documents/rust-numpy/src/borrow/shared.rs:817:9
  29:        0x10403affc - numpy::borrow::shared::tests::borrow_multiple_views::{{closure}}::ha867acb48c71993d
                               at /Users/goldbaum/Documents/rust-numpy/src/borrow/shared.rs:816:31
  30:        0x1040321bc - core::ops::function::FnOnce::call_once::hb5d94b92beb032e7
                               at /Users/goldbaum/.rustup/toolchains/nightly-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
  31:        0x1040c004c - core::ops::function::FnOnce::call_once::haa951a39bd20a04f
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/core/src/ops/function.rs:250:5
  32:        0x1040c004c - test::__rust_begin_short_backtrace::hb7ddaba54b385818
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/test/src/lib.rs:633:18
  33:        0x1040bf29c - test::run_test_in_process::{{closure}}::hdb20d330baa985a4
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/test/src/lib.rs:656:60
  34:        0x1040bf29c - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hd579cd404a890e01
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/core/src/panic/unwind_safe.rs:272:9
  35:        0x1040bf29c - std::panicking::try::do_call::h6916657b84d12525
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/panicking.rs:587:40
  36:        0x1040bf29c - std::panicking::try::h613b54f79dc5e335
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/panicking.rs:550:19
  37:        0x1040bf29c - std::panic::catch_unwind::h3ce2f0b61427f0a6
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/panic.rs:358:14
  38:        0x1040bf29c - test::run_test_in_process::h67b1595e42b5f904
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/test/src/lib.rs:656:27
  39:        0x1040bf29c - test::run_test::{{closure}}::hadf95aa65ebdfd49
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/test/src/lib.rs:577:43
  40:        0x10408e7e8 - test::run_test::{{closure}}::h58715f4d1ae61e71
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/test/src/lib.rs:607:41
  41:        0x10408e7e8 - std::sys::backtrace::__rust_begin_short_backtrace::h08e0fa692ea251b5
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/sys/backtrace.rs:152:18
  42:        0x1040919e4 - std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}::hfc986a0debdb70b9
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/thread/mod.rs:559:17
  43:        0x1040919e4 - <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h90550040c3460c34
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/core/src/panic/unwind_safe.rs:272:9
  44:        0x1040919e4 - std::panicking::try::do_call::ha0b6ecb20dfefa44
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/panicking.rs:587:40
  45:        0x1040919e4 - std::panicking::try::h9c54585a462c554d
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/panicking.rs:550:19
  46:        0x1040919e4 - std::panic::catch_unwind::h89e8256c979eefe3
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/panic.rs:358:14
  47:        0x1040919e4 - std::thread::Builder::spawn_unchecked_::{{closure}}::h158e078a65f9eff1
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/thread/mod.rs:557:30
  48:        0x1040919e4 - core::ops::function::FnOnce::call_once{{vtable.shim}}::ha4ca1fedddfaddca
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/core/src/ops/function.rs:250:5
  49:        0x10411ab7c - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h118da967bfe3d7e6
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/alloc/src/boxed.rs:2016:9
  50:        0x10411ab7c - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hade7b2909668b303
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/alloc/src/boxed.rs:2016:9
  51:        0x10411ab7c - std::sys::pal::unix::thread::Thread::new::thread_start::h3c2a21319777cd23
                               at /rustc/124cc92199ffa924f6b4c7cc819a85b65e0c3984/library/std/src/sys/pal/unix/thread.rs:106:17
  52:        0x18f39c2e4 - __pthread_deallocate
thread caused non-unwinding panic. aborting.
error: test failed, to rerun pass `--lib`

So it is a little painful, I guess rust-numpy would need to install its own panic trampoline similar to what PyO3 has for this to work like it does with PyO3?

@ngoldbaum
Copy link

Does it actually make a difference if we use a parking_lot mutex and ignore panics or a standard library mutex and ignore panics? Either way a user would see a Python crash if they were using rust from Python.

A standard library mutex would just let us handle the panic if we wanted to, but I don't think anyone wants that.

Sadly the stdout from intentionally crashing the rust_ext example doesn't contain the panic, unless I run pytest with -s and then it looks like:

tests/test_ext.py
thread '<unnamed>' panicked at /Users/goldbaum/Documents/rust-numpy/src/borrow/shared.rs:258:9:
Should be a python error
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread '<unnamed>' panicked at library/core/src/panicking.rs:218:5:
panic in a function that cannot unwind
stack backtrace:
   0:        0x10475734c - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h81a9aa16593d72f2
   1:        0x10476b87c - core::fmt::write::h609394e7daf0d74e
   2:        0x10475571c - std::io::Write::write_fmt::h1074bc402491d7d3
   3:        0x104757200 - std::sys::backtrace::BacktraceLock::print::hc4fb1b18a0c8b387
   4:        0x1047580e4 - std::panicking::default_hook::{{closure}}::hb6b08cd6a6fb20b7
   5:        0x104757f20 - std::panicking::default_hook::hdc338d6bdae60e32
   6:        0x104758c14 - std::panicking::rust_panic_with_hook::hdf41f43107580f80
   7:        0x1047587b8 - std::panicking::begin_panic_handler::{{closure}}::h51b40281bc5ae809
   8:        0x104757800 - std::sys::backtrace::__rust_end_short_backtrace::h648e053ccec260a4
   9:        0x104758498 - _rust_begin_unwind
  10:        0x1047754b0 - core::panicking::panic_nounwind_fmt::h27dced4559c784b4
  11:        0x104775528 - core::panicking::panic_nounwind::h4415803809b32d64
  12:        0x1047755cc - core::panicking::panic_cannot_unwind::h0f7c614a43265ccb
  13:        0x10472b3ec - numpy::borrow::shared::acquire_shared::h019840fba14ba9d3
  14:        0x10472c36c - numpy::borrow::shared::acquire::hdd5bafa90888668d
  15:        0x104727584 - <T as pyo3::conversion::FromPyObjectBound>::from_py_object_bound::h09ef71d31966f2ab
  16:        0x10471f4ac - rust_ext::rust_ext::__pyfunction_head_py::h704f4ef0d92bc693
  17:        0x10471f000 - pyo3::impl_::trampoline::trampoline::hc4710061af0d8f15
  18:        0x10471f41c - rust_ext::rust_ext::<impl rust_ext::rust_ext::head_py::MakeDef>::_PYO3_DEF::trampoline::hb2b1f8f8573724aa
  19:        0x104a4a2fc - _cfunction_vectorcall_FASTCALL_KEYWORDS
  20:        0x1049e6cb8 - _PyObject_Vectorcall
  21:        0x104b250a8 - __PyEval_EvalFrameDefault
  22:        0x1049e6308 - __PyObject_VectorcallDictTstate
  23:        0x1049e7558 - __PyObject_Call_Prepend
  24:        0x104a8b228 - _slot_tp_call
  25:        0x1049e6510 - __PyObject_MakeTpCall
  26:        0x104b26824 - __PyEval_EvalFrameDefault
  27:        0x1049e6308 - __PyObject_VectorcallDictTstate
  28:        0x1049e7558 - __PyObject_Call_Prepend
  29:        0x104a8b228 - _slot_tp_call
  30:        0x1049e6e78 - __PyObject_Call
  31:        0x104b260d4 - __PyEval_EvalFrameDefault
  32:        0x1049e6308 - __PyObject_VectorcallDictTstate
  33:        0x1049e7558 - __PyObject_Call_Prepend
  34:        0x104a8b228 - _slot_tp_call
  35:        0x1049e6510 - __PyObject_MakeTpCall
  36:        0x104b26824 - __PyEval_EvalFrameDefault
  37:        0x1049e6308 - __PyObject_VectorcallDictTstate
  38:        0x1049e7558 - __PyObject_Call_Prepend
  39:        0x104a8b228 - _slot_tp_call
  40:        0x1049e6510 - __PyObject_MakeTpCall
  41:        0x104b26824 - __PyEval_EvalFrameDefault
  42:        0x1049e6308 - __PyObject_VectorcallDictTstate
  43:        0x1049e7558 - __PyObject_Call_Prepend
  44:        0x104a8b228 - _slot_tp_call
  45:        0x1049e6510 - __PyObject_MakeTpCall
  46:        0x104b26824 - __PyEval_EvalFrameDefault
  47:        0x104b229bc - _PyEval_EvalCode
  48:        0x104b9b3bc - _run_eval_code_obj
  49:        0x104b9ae5c - _run_mod
  50:        0x104b97d20 - __PyRun_SimpleFileObject
  51:        0x104b976b8 - __PyRun_AnyFileObject
  52:        0x104bbdc80 - _Py_RunMain
  53:        0x104bbe3d0 - _pymain_main
  54:        0x104bbe470 - _Py_BytesMain
thread caused non-unwinding panic. aborting.
Fatal Python error: Aborted

Current thread 0x00000001f8d0c840 (most recent call first):
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/tests/test_ext.py", line 15 in test_head
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/python.py", line 159 in pytest_pyfunc_call
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/python.py", line 1627 in runtest
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/runner.py", line 174 in pytest_runtest_call
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/runner.py", line 242 in <lambda>
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/runner.py", line 341 in from_call
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/runner.py", line 241 in call_and_report
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/runner.py", line 132 in runtestprotocol
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/runner.py", line 113 in pytest_runtest_protocol
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/main.py", line 362 in pytest_runtestloop
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/main.py", line 337 in _main
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/main.py", line 283 in wrap_session
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/main.py", line 330 in pytest_cmdline_main
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/config/__init__.py", line 175 in main
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/lib/python3.13t/site-packages/_pytest/config/__init__.py", line 201 in console_main
  File "/Users/goldbaum/Documents/rust-numpy/examples/simple/.nox/tests/bin/pytest", line 8 in <module>

@Icxolu
Copy link
Contributor

Icxolu commented Feb 15, 2025

Does it actually make a difference if we use a parking_lot mutex and ignore panics or a standard library mutex and ignore panics? Either way a user would see a Python crash if they were using rust from Python.

Good question, I wonder whether it is possible for user code to poison the BorrowFlags lock. If I remember correctly my problem was in the tests that use BorrowFlags directly and it was super annoying that a failed assertion would kill the whole test-runner instead of just failing that test. With parking_lot this was not the case. If it can be triggered from user code, I think we should try to avoid failing this way. It's just not really descriptive or user friedly to understand whats going on. If we can only trigger it internally, I guess it would be less bad, but we should still try to avoid aborting the test-runner.

Sadly the stdout from intentionally crashing the rust_ext example doesn't contain the panic, unless I run pytest with -s and then it looks like:

Where did you insert the panic!()? In the example code or somewhere within rust-numpy?

@ngoldbaum
Copy link

You should be able to trigger it with nox -f examples/simple/noxfile.py after applying this diff:

index 6154c951..f345000a 100644
--- a/examples/simple/noxfile.py
+++ b/examples/simple/noxfile.py
@@ -5,4 +5,4 @@ import nox
 def tests(session):
     session.install("pip", "numpy", "pytest")
     session.run("pip", "install", ".", "-v")
-    session.run("pytest")
+    session.run("pytest", "-s")
diff --git a/src/borrow/shared.rs b/src/borrow/shared.rs
index 52b3c70f..7b14e960 100644
--- a/src/borrow/shared.rs
+++ b/src/borrow/shared.rs
@@ -255,7 +255,7 @@ struct BorrowFlags(BorrowFlagsInner);
 impl BorrowFlags {
     fn acquire(&self, address: *mut c_void, key: BorrowKey) -> Result<(), ()> {
         let mut borrow_flags = self.0.lock().unwrap();
-
+        panic!("Should be a python error");
         match borrow_flags.entry(address) {
             Entry::Occupied(entry) => {
                 let same_base_arrays = entry.into_mut();

@Icxolu
Copy link
Contributor

Icxolu commented Feb 17, 2025

Right, I looked through it again now. I believe there is currently no way for the lock to get poisoned from normal user code. It is exclusively taken in the BorrowFlags methods, which are exclusively called from the corresponding extern "C" <>_shared methods. So any panic that could poison the lock would already cause an abort, because it would unwind through the extern "C", right?

So it's just the tests that acquire the lock via get_borrow_flags for a longer time, with asserts while holding it, that would trigger it. Maybe we can build some test APIs that reads out the value we want to assert and releases the lock again before actually asserting?

Does that make sense to you?

@ngoldbaum
Copy link

Maybe we can build some test APIs that reads out the value we want to assert and releases the lock again before actually asserting?

Does that make sense to you?

I refactored the tests so they grab a copy of the state and the lock isn't held while any asserts happen. While working on this I actually triggered a panic by unwrapping None in the utility function and I saw exactly the annoying behavior you saw. Definitely better to try really hard not to panic with this particular mutex held, especially if many tests are running simultaneously.

putting my NumPy maintainer hat on Ultimately IMO NumPy needs a way to borrow-check ndarrays, which would make this all unnecessary. This information should live on the array itself, not in a global cache. Of course you all already know that 😄.

Copy link
Contributor

@Icxolu Icxolu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, just a small suggestion for BorrowFlagsState otherwise this looks good to me now 🚀

@@ -452,10 +447,27 @@ mod tests {
use crate::untyped_array::PyUntypedArrayMethods;
use pyo3::ffi::c_str;

fn get_borrow_flags<'py>(py: Python<'py>) -> &'py BorrowFlagsInner {
struct BorrowFlagsState(usize, usize, Option<isize>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice solution! I think I would prefer if we gave the fields descriptive names, it looks quite ambiguous and easy to mess up currently. What do you think?

Copy link
Member Author

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly looks good to me, thank you. I guess MutexExt would be great to get into 0.24 upstream so that we can solve the FIXME.

}
}

#[allow(clippy::wrong_self_convention)]
fn from_unit<'py>(&self, py: Python<'py>, unit: NPY_DATETIMEUNIT) -> Bound<'py, PyArrayDescr> {
let mut dtypes = self.dtypes.get(py).borrow_mut();
// FIXME probably a deadlock risk here due to the GIL? Might need MutexExt trait in PyO3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this on my first pass, this should use the manual deadlock avoidance used in datetime.rs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-no-fail-fast If one job fails, allow the rest to keep testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants