Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moar threadsafe moar better #101

Merged

Conversation

JamesWrigley
Copy link
Collaborator

This is a rebased version of #4, it should be ready to merge. Fixes #73.

(made after discussing with @jpsamaroo)

CC @vchuravy, @vtjnash

@JamesWrigley
Copy link
Collaborator Author

I tracked the test failures down to: JuliaLang/julia#53326
The workers are loading the builtin version of Distributed from the sysimg while the master is using the development version, and because we changed the definition of the Worker struct in this PR we get a serialization error when doing a remotecall to another worker because the master and worker are using different types.

I think this should be fixed in Base, made a PR here: JuliaLang/julia#54571
That should be merged before this one.

@JamesWrigley JamesWrigley force-pushed the jps/threadsafe_workerstate branch from 9a8469d to ec8bce0 Compare May 25, 2024 10:57
Copy link

codecov bot commented May 25, 2024

Codecov Report

Attention: Patch coverage is 84.74576% with 9 lines in your changes missing coverage. Please review.

Project coverage is 79.50%. Comparing base (3b9e4fd) to head (90041ca).
Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
src/cluster.jl 86.95% 6 Missing ⚠️
src/managers.jl 50.00% 1 Missing ⚠️
src/messages.jl 0.00% 1 Missing ⚠️
src/process_messages.jl 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
+ Coverage   79.24%   79.50%   +0.25%     
==========================================
  Files          10       10              
  Lines        1913     1922       +9     
==========================================
+ Hits         1516     1528      +12     
+ Misses        397      394       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JamesWrigley
Copy link
Collaborator Author

Alrighty, after a force push to trigger CI with latest nightly we are back in the green 🥳 I think this is ready to be merged now.

Copy link
Contributor

@jonas-schulze jonas-schulze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that the original code snippet from #73 (comment) is working now. 🚀

Will this be compatible to or backported, if necessary, to the next Julia LTS?

src/cluster.jl Outdated
Comment on lines 709 to 716
@async manage(w.manager, w.id, w.config, :register)
# wait for rr_ntfy_join with timeout
timedout = false
@async (sleep($timeout); timedout = true; put!(rr_ntfy_join, 1))
@async begin
sleep($timeout)
timedout = true
put!(rr_ntfy_join, 1)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tasks need an errormonitor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that makes sense, added in 03d7384.

test/threads.jl Outdated
ws = ts = product(1:2, 1:2)
@testset "from worker $w1 to $w2 via 1" for (w1, w2) in ws
@testset "from thread $w1.$t1 to $w2.$t2" for (t1, t2) in ts
# We want (the default) lazyness, so that we wait for `Worker.c_state`!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We want (the default) lazyness, so that we wait for `Worker.c_state`!
# We want (the default) laziness, so that we wait for `Worker.c_state`!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f4576aa.

test/threads.jl Outdated
end

# Wait on the spawned tasks on the owner
@sync begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this sync point should fail fast, if necessary:

Suggested change
@sync begin
Base.Experimental.@sync begin

See JuliaLang/julia#42239 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that makes sense, I refactored the code to use timedwait() in f4576aa.

@JamesWrigley
Copy link
Collaborator Author

Will this be compatible to or backported, if necessary, to the next Julia LTS?

I'll leave that for someone more qualified to properly answer, but FWIW if 1.11 is chosen as the next LTS then it'll be possible to upgrade Distributed now that it's an excised stdlib 🐙

@JamesWrigley
Copy link
Collaborator Author

One other thing I noticed is that this should probably be using Threads.@spawn everywhere instead of @async, but I'll leave that for another PR.

Copy link
Contributor

@jonas-schulze jonas-schulze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any updates on this?

@JamesWrigley
Copy link
Collaborator Author

I believe @JBlaschke was going to have a look at it. In the meantime I see the branch is out of date, so I'll rebase it.

@JamesWrigley JamesWrigley force-pushed the jps/threadsafe_workerstate branch 2 times, most recently from e3205d8 to fa9d645 Compare June 24, 2024 14:11
@JBlaschke
Copy link

@JamesWrigley @jonas-schulze I'll be working on this this week. Just getting back up to speed after long travel...

@zsz00
Copy link

zsz00 commented Jul 21, 2024

Are there any updates on this? @JamesWrigley @JBlaschke

@JamesWrigley
Copy link
Collaborator Author

JamesWrigley commented Jul 21, 2024

Not from me, still need someone to review it. I believe the hesitation to merge comes from Distributed being used to run the Julia tests, so it's quite critical that this works properly.

But in the meantime you can ] dev https://github.com/JamesWrigley/Distributed.jl.git#jps/threadsafe_workerstate on 1.11 to use this branch.

@JamesWrigley JamesWrigley force-pushed the jps/threadsafe_workerstate branch from fa9d645 to b140754 Compare July 21, 2024 09:36
c_state::Condition # wait for state changes
ct_time::Float64 # creation time
conn_func::Any # used to setup connections lazily
@atomic state::WorkerState
Copy link
Member

@gbaraldi gbaraldi Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if state is always read/written from inside a lock this doesn't need to be atomic as the lock should have the correct barriers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's guaranteed? From a cursory grep through cluster.jl I see plenty of reads outside of a lock.

@gbaraldi
Copy link
Member

Since we are making things threadsafe I would look at all @async uses because we want to deprecate it.

@JamesWrigley
Copy link
Collaborator Author

Ok, I replaced all uses of @async with Threads.@spawn in 7cac33f. But there are still a couple places where Base.@async_unwrap is used and I can't find an easy analogue for Threads.@spawn, I'll see if I can write a replacement.

@JamesWrigley
Copy link
Collaborator Author

After thinking about it for a bit, I can't come up with a decent replacement short of basically reimplementing Threads.@spawn. We could replace @async_unwrap with Threads.@spawn, but it would change the type of any exceptions thrown from whatever they currently are to a TaskFailedException, which is technically breaking.

I'd suggest keeping it in for now, we can add support for unwrapping exceptions to Threads.@spawn and switch in a future release.

@JamesWrigley JamesWrigley force-pushed the jps/threadsafe_workerstate branch from 7cac33f to 01c33e9 Compare July 25, 2024 07:51
test/threads.jl Outdated
Comment on lines 47 to 52
# Wait on the spawned tasks on the owner. Note that we use
# timedwait() instead of @sync to avoid deadlocks.
t1 = Threads.@spawn fetch_from_owner(wait, recv)
t2 = Threads.@spawn fetch_from_owner(wait, send)
@test timedwait(() -> istaskdone(t1), 5) == :ok
@test timedwait(() -> istaskdone(t2), 5) == :ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to chime in so late after #101 (comment); I noticed because GitHub unfolded all my previous comments.

I like the timedwait, which is what I used in JuliaLang/julia#37905. However, the timedwait has been the main reason (I think) why my first PR was reverted (JuliaLang/julia#38112). The second attempt (https://github.com/JuliaLang/julia/pull/38134/files) didn't use timedwait. I remain in favor of timedwait but wanted to refresh the information, as it has been a while.

Copy link
Collaborator Author

@JamesWrigley JamesWrigley Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. My view is that if this fails we're going to end up with a timeout somewhere no matter what, either in CI or the tests themselves. And my preference would be to have the timeout in the tests so we can have some control over it. I bumped it to 60s in abafb79 but I'm happy to increase that if people think it's too low. @vchuravy, @vtjnash, does that sound ok?

@JamesWrigley
Copy link
Collaborator Author

Wee progress update for those following this, after some discussion with @vchuravy on Slack I decided to:

  1. Try to run the Julia tests in the Distributed.jl CI since that's what was failing last time.
  2. Attempt to audit the code for correctness.

Which I'll get to at... some point.

@JamesWrigley
Copy link
Collaborator Author

Wee update on this, the Julia tests 'failing' last time means they were hanging. In DistributedNext I've fixed a couple of hangs related to lingering tasks, perhaps that will solve the hangs.

About the code audit, if someone from the core team commits to reviewing this then I can try to do the audit before the 1.12 release if that's desired.

@IanButterworth
Copy link
Member

I'm happy to review (in addition to the core devs)

@IanButterworth
Copy link
Member

Seems like there's a small conflict

@JamesWrigley JamesWrigley force-pushed the jps/threadsafe_workerstate branch from abafb79 to 612e3f2 Compare January 20, 2025 13:48
@JamesWrigley
Copy link
Collaborator Author

(should be fixed now)

src/cluster.jl Outdated
@@ -161,15 +163,16 @@ function check_worker_state(w::Worker)
else
w.ct_time = time()
if myid() > w.id
t = @async exec_conn_func(w)
t = Threads.@spawn exec_conn_func(w)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given the tasks created by Distributed are generally orchestration and not work, I think they should stay on the same threadpool as the caller, which is likely thread 1, and may be :interactive so I think these all should be

Suggested change
t = Threads.@spawn exec_conn_func(w)
t = Threads.@spawn Threads.threadpool() exec_conn_func(w)

for which I've proposed shorthand of @spawn :same here JuliaLang/julia#57109

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but then why ensure that they stay on the same threadpool as the caller if they're generally lightweight orchestration tasks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess because the worker threads could be tied up doing blocking stuff.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I set most tasks to use Threads.threadpool() in 7f1d879. Some I didn't because they're intended to be executed on a worker, which I feel should 'own' the worker process and not have to be sensitive to any non-Distributed code.

src/cluster.jl Outdated Show resolved Hide resolved
@IanButterworth
Copy link
Member

Apart from my suggestions above, this seems like it's a big improvement, and given it's lingered a bit waiting for reviews, it may be reasonable to just go ahead and fix any issues in the 1.12 pre-release process

@JamesWrigley
Copy link
Collaborator Author

JamesWrigley commented Jan 21, 2025

I agree :) And as I said almost a year ago when I first started pestering people about it, I'm willing to debug any issues in Julia CI that arise.

@JamesWrigley
Copy link
Collaborator Author

If this is merged I will also upstream another hang fix: JuliaParallel/DistributedNext.jl#17 (comment)

(or upstream it eventually anyway when that PR is merged)

@IanButterworth
Copy link
Member

Btw, should we rename this PR to "Thread safety improvements v2"?

@JamesWrigley
Copy link
Collaborator Author

What do you mean the title makes perfect sense 😁

(I kid, feel free to change it if you like)

@JamesWrigley JamesWrigley force-pushed the jps/threadsafe_workerstate branch from 7f1d879 to 50bdaeb Compare January 21, 2025 22:01
@IanButterworth IanButterworth force-pushed the jps/threadsafe_workerstate branch from 50bdaeb to 90041ca Compare January 21, 2025 23:55
Copy link
Member

@IanButterworth IanButterworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've proposed that we merge this, and @JamesWrigley mentioned a couple of other fixes from DistributedNext that can be added imminently, then bumped in time for 1.12 testing.

@IanButterworth IanButterworth merged commit 0e27cd1 into JuliaLang:master Jan 22, 2025
8 checks passed
@JamesWrigley JamesWrigley deleted the jps/threadsafe_workerstate branch January 22, 2025 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concurrency violation on interplay between Distributed and Base.Threads
8 participants