-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support 3.13t free-threaded python #471
base: main
Are you sure you want to change the base?
Conversation
src/strings.rs
Outdated
// FIXME probably a deadlock risk here due to the GIL? Might need MutexExt trait in PyO3 | ||
let mut dtypes = self.dtypes.lock().expect("dtype cache poisoned"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choice to use Mutex
here makes me want to add a MutexExt
trait to PyO3 similar to OnceLockExt
which we already added.
Given that writes to this should be infrequent (it's a global cache, as far as I can tell), I also wonder if RwLock
is appropriate here. Readers are extremely short-lived so the analysis in https://blog.nelhage.com/post/rwlock-contention/ seems to be a non-issue (cc @alex)
... in which case I want RwLockExt
too 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess I should finally write MutexExt
...
Based on the failures here, I propose rust-numpy 0.23.0 releases without free-threading support, and we seek to follow up in 0.23.1. |
I did a bit of investigating here. I believe the fundamental problem is that the borrow checking API relies on the GIL for synchronization of exclusive access rust-numpy/src/borrow/shared.rs Lines 45 to 48 in b34bc29
On the free-threaded build this is not true anymore, and therefore unsound. An easy (but probably not optimal) solution is to wrap |
@Icxolu @davidhewitt this just showed up as a dependency for one of the libraries I'm working on and as a NumPy maintainer maybe it's natural for me to work on this. Would you two mind if I took over work in this PR? Also to everyone linking to this issue - sorry for not getting to this sooner! For some reason I thought this was done already. |
I'm good with that, thank you very much 🙏. I think I still have the branch from my experiments above (on top of this branch), I could push them here (or to a separate branch) if you think that would be helpful. |
That would be! I can take a look at making the tests thread safe, I did a lot of that for pyo3. I also want to run the tests on the free-threaded build with TSAN enabled. I wouldn’t be surprised if that elicits some races in NumPy. |
There you go. I think this is fine, but it would be great if you could double check that. I guess it would also be nice if we could use a |
Please do take this over; I had hoped to make progress though expect to be a few more weeks away from doing so myself. |
So good news: I ran the rust-numpy tests with TSAN using a hacked-together version of cargo stress to see the TSAN output from "successful" stress test runs. I didn't see any warnings besides ones I already found and reported yesterday in PyO3: PyO3/pyo3#4904 I want to try again using TSAN using a version of Python compiled with the 3.14 branch, which will hopefully be less noisy and allow me to determine if any of these issues still need to be fixed on CPython For now I'm going to ignore TSAN on Also incidentally I didn't see any test failures. I'll push a change that resolves the merge conflict so CI can re-run, along with a couple more fixes I found. |
Oh actually since I'm not a rust-numpy maintainer I can't just push to the PR.
@davidhewitt can you give me push access to your rust-numpy fork? Or maybe a rust-numpy commit bit.... 😜 |
Done. Given you already have write access to both PyO3 and numpy, this seemed like a no-brainer solution :) |
Looks like CI is passing. I think we probably don't want to add I also do want to write |
I think all of the I also "manually" added deadlock avoidance with raw FFI calls following what It occurs to me that we could use I did a once-over on the library and didn't spot anything thread-unsafe. There is some use of Of course there are many ways to break ndarray by mutating an array simultaneously in multiple threads, but NumPy doesn't support shared mutation yet and that's not rust-numpy's fault. |
I thought I had that initially but ran into problems with panics crossing the ffi boundary... But maybe I misremember or did something wrong back then. |
Let me try manually inserting panics in these functions to see what happens. |
I'll try to have a look again this weekend, maybe I can reproduce what caused me to use |
Oh darn, it does look like it panics instead of getting converted into an error:
So it is a little painful, I guess rust-numpy would need to install its own panic trampoline similar to what PyO3 has for this to work like it does with PyO3? |
Does it actually make a difference if we use a A standard library mutex would just let us handle the panic if we wanted to, but I don't think anyone wants that. Sadly the stdout from intentionally crashing the
|
Good question, I wonder whether it is possible for user code to poison the
Where did you insert the |
You should be able to trigger it with
|
Right, I looked through it again now. I believe there is currently no way for the lock to get poisoned from normal user code. It is exclusively taken in the So it's just the tests that acquire the lock via Does that make sense to you? |
I refactored the tests so they grab a copy of the state and the lock isn't held while any asserts happen. While working on this I actually triggered a panic by unwrapping putting my NumPy maintainer hat on Ultimately IMO NumPy needs a way to borrow-check ndarrays, which would make this all unnecessary. This information should live on the array itself, not in a global cache. Of course you all already know that 😄. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, just a small suggestion for BorrowFlagsState
otherwise this looks good to me now 🚀
src/borrow/shared.rs
Outdated
@@ -452,10 +447,27 @@ mod tests { | |||
use crate::untyped_array::PyUntypedArrayMethods; | |||
use pyo3::ffi::c_str; | |||
|
|||
fn get_borrow_flags<'py>(py: Python<'py>) -> &'py BorrowFlagsInner { | |||
struct BorrowFlagsState(usize, usize, Option<isize>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice solution! I think I would prefer if we gave the fields descriptive names, it looks quite ambiguous and easy to mess up currently. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly looks good to me, thank you. I guess MutexExt
would be great to get into 0.24 upstream so that we can solve the FIXME.
} | ||
} | ||
|
||
#[allow(clippy::wrong_self_convention)] | ||
fn from_unit<'py>(&self, py: Python<'py>, unit: NPY_DATETIMEUNIT) -> Bound<'py, PyArrayDescr> { | ||
let mut dtypes = self.dtypes.get(py).borrow_mut(); | ||
// FIXME probably a deadlock risk here due to the GIL? Might need MutexExt trait in PyO3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this on my first pass, this should use the manual deadlock avoidance used in datetime.rs
.
First pass at adding CI and tests for freethreaded Python.
Locally I get some test failures, so I will have to investigate further before this is ready even if CI is green. cc @ngoldbaum