-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moar threadsafe moar better #101
Moar threadsafe moar better #101
Conversation
I tracked the test failures down to: JuliaLang/julia#53326 I think this should be fixed in Base, made a PR here: JuliaLang/julia#54571 |
9a8469d
to
ec8bce0
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #101 +/- ##
==========================================
+ Coverage 79.24% 79.50% +0.25%
==========================================
Files 10 10
Lines 1913 1922 +9
==========================================
+ Hits 1516 1528 +12
+ Misses 397 394 -3 ☔ View full report in Codecov by Sentry. |
Alrighty, after a force push to trigger CI with latest nightly we are back in the green 🥳 I think this is ready to be merged now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that the original code snippet from #73 (comment) is working now. 🚀
Will this be compatible to or backported, if necessary, to the next Julia LTS?
src/cluster.jl
Outdated
@async manage(w.manager, w.id, w.config, :register) | ||
# wait for rr_ntfy_join with timeout | ||
timedout = false | ||
@async (sleep($timeout); timedout = true; put!(rr_ntfy_join, 1)) | ||
@async begin | ||
sleep($timeout) | ||
timedout = true | ||
put!(rr_ntfy_join, 1) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these tasks need an errormonitor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that makes sense, added in 03d7384.
test/threads.jl
Outdated
ws = ts = product(1:2, 1:2) | ||
@testset "from worker $w1 to $w2 via 1" for (w1, w2) in ws | ||
@testset "from thread $w1.$t1 to $w2.$t2" for (t1, t2) in ts | ||
# We want (the default) lazyness, so that we wait for `Worker.c_state`! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# We want (the default) lazyness, so that we wait for `Worker.c_state`! | |
# We want (the default) laziness, so that we wait for `Worker.c_state`! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in f4576aa.
test/threads.jl
Outdated
end | ||
|
||
# Wait on the spawned tasks on the owner | ||
@sync begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this sync point should fail fast, if necessary:
@sync begin | |
Base.Experimental.@sync begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that makes sense, I refactored the code to use timedwait()
in f4576aa.
I'll leave that for someone more qualified to properly answer, but FWIW if 1.11 is chosen as the next LTS then it'll be possible to upgrade Distributed now that it's an excised stdlib 🐙 |
One other thing I noticed is that this should probably be using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any updates on this?
I believe @JBlaschke was going to have a look at it. In the meantime I see the branch is out of date, so I'll rebase it. |
e3205d8
to
fa9d645
Compare
@JamesWrigley @jonas-schulze I'll be working on this this week. Just getting back up to speed after long travel... |
Are there any updates on this? @JamesWrigley @JBlaschke |
Not from me, still need someone to review it. I believe the hesitation to merge comes from Distributed being used to run the Julia tests, so it's quite critical that this works properly. But in the meantime you can |
fa9d645
to
b140754
Compare
c_state::Condition # wait for state changes | ||
ct_time::Float64 # creation time | ||
conn_func::Any # used to setup connections lazily | ||
@atomic state::WorkerState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if state is always read/written from inside a lock this doesn't need to be atomic as the lock should have the correct barriers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's guaranteed? From a cursory grep through cluster.jl
I see plenty of reads outside of a lock.
Since we are making things threadsafe I would look at all |
Ok, I replaced all uses of |
After thinking about it for a bit, I can't come up with a decent replacement short of basically reimplementing I'd suggest keeping it in for now, we can add support for unwrapping exceptions to |
7cac33f
to
01c33e9
Compare
test/threads.jl
Outdated
# Wait on the spawned tasks on the owner. Note that we use | ||
# timedwait() instead of @sync to avoid deadlocks. | ||
t1 = Threads.@spawn fetch_from_owner(wait, recv) | ||
t2 = Threads.@spawn fetch_from_owner(wait, send) | ||
@test timedwait(() -> istaskdone(t1), 5) == :ok | ||
@test timedwait(() -> istaskdone(t2), 5) == :ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to chime in so late after #101 (comment); I noticed because GitHub unfolded all my previous comments.
I like the timedwait
, which is what I used in JuliaLang/julia#37905. However, the timedwait
has been the main reason (I think) why my first PR was reverted (JuliaLang/julia#38112). The second attempt (https://github.com/JuliaLang/julia/pull/38134/files) didn't use timedwait
. I remain in favor of timedwait
but wanted to refresh the information, as it has been a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks. My view is that if this fails we're going to end up with a timeout somewhere no matter what, either in CI or the tests themselves. And my preference would be to have the timeout in the tests so we can have some control over it. I bumped it to 60s in abafb79 but I'm happy to increase that if people think it's too low. @vchuravy, @vtjnash, does that sound ok?
Wee progress update for those following this, after some discussion with @vchuravy on Slack I decided to:
Which I'll get to at... some point. |
Wee update on this, the Julia tests 'failing' last time means they were hanging. In DistributedNext I've fixed a couple of hangs related to lingering tasks, perhaps that will solve the hangs. About the code audit, if someone from the core team commits to reviewing this then I can try to do the audit before the 1.12 release if that's desired. |
I'm happy to review (in addition to the core devs) |
Seems like there's a small conflict |
abafb79
to
612e3f2
Compare
(should be fixed now) |
src/cluster.jl
Outdated
@@ -161,15 +163,16 @@ function check_worker_state(w::Worker) | |||
else | |||
w.ct_time = time() | |||
if myid() > w.id | |||
t = @async exec_conn_func(w) | |||
t = Threads.@spawn exec_conn_func(w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given the tasks created by Distributed are generally orchestration and not work, I think they should stay on the same threadpool as the caller, which is likely thread 1, and may be :interactive
so I think these all should be
t = Threads.@spawn exec_conn_func(w) | |
t = Threads.@spawn Threads.threadpool() exec_conn_func(w) |
for which I've proposed shorthand of @spawn :same
here JuliaLang/julia#57109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but then why ensure that they stay on the same threadpool as the caller if they're generally lightweight orchestration tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess because the worker threads could be tied up doing blocking stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I set most tasks to use Threads.threadpool()
in 7f1d879. Some I didn't because they're intended to be executed on a worker, which I feel should 'own' the worker process and not have to be sensitive to any non-Distributed code.
Apart from my suggestions above, this seems like it's a big improvement, and given it's lingered a bit waiting for reviews, it may be reasonable to just go ahead and fix any issues in the 1.12 pre-release process |
I agree :) And as I said almost a year ago when I first started pestering people about it, I'm willing to debug any issues in Julia CI that arise. |
If this is merged I will also upstream another hang fix: JuliaParallel/DistributedNext.jl#17 (comment) (or upstream it eventually anyway when that PR is merged) |
Btw, should we rename this PR to "Thread safety improvements v2"? |
What do you mean the title makes perfect sense 😁 (I kid, feel free to change it if you like) |
7f1d879
to
50bdaeb
Compare
This is to avoid them accidentally running in another (potentially busy) threadpool.
50bdaeb
to
90041ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've proposed that we merge this, and @JamesWrigley mentioned a couple of other fixes from DistributedNext that can be added imminently, then bumped in time for 1.12 testing.
This is a rebased version of #4, it should be ready to merge. Fixes #73.
(made after discussing with @jpsamaroo)
CC @vchuravy, @vtjnash