-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Multi-threaded assembly #54
Conversation
@wcwitt -- I had to write a manual "scheduler" to assemble the lsq blocks in the right order. This becomes an issue when a dataset has structures of very different sizes. I don't know how pmap works, but with @threads this strange fix was needed. I wonder whether the bad performance in the distributed assembly for me is something similar. |
@tjjarvinen --- my implementation in this PR sometimes hangs at the very end of the LSQ system assembly. Would you be willing to take a quick look and see if something jumps out at you? When I CTRL-C interrupt it, then it seems it got stuck waiting for the lock to become free. (also is there a more elegant way to achieve this?) |
follow-up I think I may have found the bug. testing now ... I'll let you know if I can't fix it after all. |
I tried with this mt implementation and I feel much more pleasant to use this than the current |
I'm ok with merging as is, but alternatively also to first integrate it into the user-interface. @wcwitt wasn't too keen on this in the first place, but I think it is sufficiently useful for us that we can ask him again to consider it. In the near future we should merge the dist and MT assembly into a single framework as discussed in other issues. |
Letting you know that I've been working on this today, partly in response to @CheukHinHoJerry's experience.
I don't understand how/why this would happen - can you elaborate?
Recently, I have been seeing this on large datasets with both the distributed and multithreaded. Not sure what changed, but it feels like a memory leak. Adding |
when you work interactively as we do, you first add the workers, then have to instantiate the environment on each worker. Then you might as well cycle to the next coffee shop and take a break before you can continue since that operation can take a long time. |
If gc helps, then I don't think it can be a memory leak? It is possible that during multi-threading the gc doesn't turn itself on as often. I haven't done any reading on this, but this has been my impression during benchmarking. Maybe I'm completely wrong. @tjjarvinen can you please comment? Also in general, what are typical reasons to get OOMs in Julia? How is this even possible? |
The main reason for OOM is that you allocate too much data. GC helps a little in these cases, because it clears out some unneeded data. To me this issue sounds like that Julia is starting to use swap, which results as a slow down. Once swap is used up too, it will cause OOM. Have you looked how much Julia is using memory (+ memory per process) and what is swap usage? Also how much is the estimate of data usage for assembling? |
I believe you, but I still don't understand. This rarely takes more than a few seconds for me interactively (and the
|
this is news to me, I was unaware of this. Maybe we should just test this again more carefully. In principle if firing up 50 workers is more or less instantaneous then I we can discuss dropping the mt again. |
Instantiate is not needed, if using the same node or when using Julia install that has access to the same local data (.julia folder). If you have different nodes that do not have access to same storage space, then you need to instantiate each worker. |
that wasn't my experience. I don't know why but simply |
So maybe one of you can just put together a script for us so we can try using multiple workers interactively. We will try and it and if it works fine we continue from there? |
You've convinced me the mt is worthwhile ... as soon as I'm done experimenting with it we can merge. Otherwise, at this point I'm just making sure I understand your workflow - like whether your workers are somehow on another machine. More importantly, I now understand the timing of this OOM stuff. I used to have this line
which I removed recently during some rearranging[54b7b2e]. I'll put it back. In principle, I don't think it should be necessary, but the forums are full of people complaining of parallel-related memory issues, such that it's a little hard to figure out what is current/relevant. |
I looked the code in this PR and while next <= length(packets)
# retrieve the next packet
if next > length(packets)
break
end
lock(_lock)
cur = next
next += 1
unlock(_lock)
if cur > length(packets)
break
end
p = packets[cur] This part is attempting to reimplement Channels. Rather than trying to rediscover Channels, just use the existing Channels. It makes the code better in all the possible ways. |
On the mt, I'm close to something I'm happy with (influenced by @tjjarvinen's recommendation of Folds.jl). Can we just pause that discussion until I'm done and then you can critique it from there. |
When you spawns new processes they will get the project dir from the host. But every process needs to load all the packages separately, so you need to start using Distributed
addprocs(20)
@everywhere using List_of_packages |
Our docs are actually relatively decent here - the "script" approach should work interactively. I don't say this to be annoying - if they aren't sufficient I will improve them. https://acesuit.github.io/ACE1pack.jl/dev/gettingstarted/parallel-fitting/ |
I've never seen this before. My mistake, sorry. |
@tjjarvinen -- I wrote the part above that you quote. I agree with you of course. But remember this was a temporary hack and for me it is faster to write this than to learn about Channels. Let's see what Chuck comes up with and then discuss. |
@wcwitt --- just to confirm that with your instructions above it becomes more convenient to assemble the LSQ system distributed instead of multi-threaded. The barrier is still slightly bigger, but not nearly as bad as it used to be. So from my end, we can make this low priority. |
(also I can confirm that putting the gc() back into the distributed assembly prevented some OOMs for me just now. I tried with version of the package before and after...) |
Thanks - I'm glad, but we're deep enough into this now that I'm going to try to finish it off. |
Would someone please look at -- or possibly try -- this example? It requires the
For me, the results, after starting Julia with
indicating a huge slowdown when I garbage collect from each thread. |
interesting - your script managed to crash my laptop all the way to reboot ... |
independent of my problem - can you look at Julia 1.10? I read somewhere that it has a multi-threaded GC. I wonder whether this solves your problem. |
Here are my times on Julia 1.9:
and on Julia 1.10:
|
So interestingly, the distributed is hands-down the faster version. The GC could be kept on as default, but good to have an option to turn it off when user wants it? |
I think all things considered - it's maybe good to keep distributed the default. Especially as we move towards multi-threaded acceleration of model evaluation. |
Thank you very much for taking the time to do this. Naturally, I like the distributed and I would be happy for it to be default, but I remain unsettled by this: #49 (comment). Your observation that the threading performs better for very small configs seems correct, and I won't be satisfied with default-distributed until I've managed to resove that part. |
Did you rerun this test above with latest ACEfit / ACE1pack / Julia 1.9 and 1.10? |
Another thing we can do is feed training structures to processes in batches. |
@wcwitt -- is this now obsolete - can we close it? |
I'd rather keep it open for a bit longer, if that's okay. |
I'm fine to close this now, if you still want to. Linking #49 for reference. |
So the MT assembly is now in the package or removed again because of the poor performance? |
To answer your question, it's not in the package, but it does live on in a branch. Eventually I will solve the performance dilemma. |
For now this is just an experiment so we see what it might look like, the interface probably needs to be changed.