Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Multi-threaded assembly #54

Closed
wants to merge 5 commits into from
Closed

WIP: Multi-threaded assembly #54

wants to merge 5 commits into from

Conversation

cortner
Copy link
Member

@cortner cortner commented Jun 21, 2023

For now this is just an experiment so we see what it might look like, the interface probably needs to be changed.

@cortner
Copy link
Member Author

cortner commented Jun 23, 2023

@wcwitt -- I had to write a manual "scheduler" to assemble the lsq blocks in the right order. This becomes an issue when a dataset has structures of very different sizes. I don't know how pmap works, but with @threads this strange fix was needed. I wonder whether the bad performance in the distributed assembly for me is something similar.

@cortner
Copy link
Member Author

cortner commented Jun 24, 2023

@tjjarvinen --- my implementation in this PR sometimes hangs at the very end of the LSQ system assembly. Would you be willing to take a quick look and see if something jumps out at you? When I CTRL-C interrupt it, then it seems it got stuck waiting for the lock to become free.

(also is there a more elegant way to achieve this?)

@cortner
Copy link
Member Author

cortner commented Jun 24, 2023

follow-up I think I may have found the bug. testing now ... I'll let you know if I can't fix it after all.

@CheukHinHoJerry
Copy link
Collaborator

CheukHinHoJerry commented Jul 14, 2023

I tried with this mt implementation and I feel much more pleasant to use this than the current ACEfit.assemble. This does not have a overhead in Pkg.instantiate and OOM problem. It is possible that we can have this as an experimental feature and merge it even if it is not user-facing yet?

CC @cortner @wcwitt

@cortner
Copy link
Member Author

cortner commented Jul 14, 2023

I'm ok with merging as is, but alternatively also to first integrate it into the user-interface.

@wcwitt wasn't too keen on this in the first place, but I think it is sufficiently useful for us that we can ask him again to consider it.

In the near future we should merge the dist and MT assembly into a single framework as discussed in other issues.

@wcwitt
Copy link
Collaborator

wcwitt commented Jul 18, 2023

Letting you know that I've been working on this today, partly in response to @CheukHinHoJerry's experience.

a overhead in Pkg.instantiate

I don't understand how/why this would happen - can you elaborate?

OOM problem

Recently, I have been seeing this on large datasets with both the distributed and multithreaded. Not sure what changed, but it feels like a memory leak. Adding GC.gc() after assembling each block alleviates things somewhat, although I don't know what it does to the performance.

@cortner
Copy link
Member Author

cortner commented Jul 18, 2023

I don't understand how/why this would happen - can you elaborate?

when you work interactively as we do, you first add the workers, then have to instantiate the environment on each worker. Then you might as well cycle to the next coffee shop and take a break before you can continue since that operation can take a long time.

@cortner
Copy link
Member Author

cortner commented Jul 18, 2023

Recently, I have been seeing this on large datasets with both the distributed and multithreaded. Not sure what changed, but it feels like a memory leak.

If gc helps, then I don't think it can be a memory leak? It is possible that during multi-threading the gc doesn't turn itself on as often. I haven't done any reading on this, but this has been my impression during benchmarking. Maybe I'm completely wrong.

@tjjarvinen can you please comment? Also in general, what are typical reasons to get OOMs in Julia? How is this even possible?

@tjjarvinen
Copy link
Collaborator

The main reason for OOM is that you allocate too much data. GC helps a little in these cases, because it clears out some unneeded data.

To me this issue sounds like that Julia is starting to use swap, which results as a slow down. Once swap is used up too, it will cause OOM.

Have you looked how much Julia is using memory (+ memory per process) and what is swap usage?

Also how much is the estimate of data usage for assembling?

@wcwitt
Copy link
Collaborator

wcwitt commented Jul 18, 2023

when you work interactively as we do, you first add the workers, then have to instantiate the environment on each worker. Then you might as well cycle to the next coffee shop and take a break before you can continue since that operation can take a long time

I believe you, but I still don't understand. This rarely takes more than a few seconds for me interactively (and the exeflags thing shouldn't be necessary since 1.9) - when/why is an instantiate required?

> using Distributed
> using ACE1pack
> addprocs(50, exeflags="--project=$(Base.active_project())")
> @everywhere using ACE1pack

@cortner
Copy link
Member Author

cortner commented Jul 18, 2023

addprocs(50, exeflags="--project=$(Base.active_project())")

this is news to me, I was unaware of this.

Maybe we should just test this again more carefully. In principle if firing up 50 workers is more or less instantaneous then I we can discuss dropping the mt again.

@tjjarvinen
Copy link
Collaborator

when/why is an instantiate required?

Instantiate is not needed, if using the same node or when using Julia install that has access to the same local data (.julia folder). If you have different nodes that do not have access to same storage space, then you need to instantiate each worker.

@cortner
Copy link
Member Author

cortner commented Jul 18, 2023

that wasn't my experience. I don't know why but simply using did not work for me. Maybe things have changed in the past few versions of Julia and I missed those changes.

@cortner
Copy link
Member Author

cortner commented Jul 18, 2023

So maybe one of you can just put together a script for us so we can try using multiple workers interactively. We will try and it and if it works fine we continue from there?

@wcwitt
Copy link
Collaborator

wcwitt commented Jul 18, 2023

You've convinced me the mt is worthwhile ... as soon as I'm done experimenting with it we can merge. Otherwise, at this point I'm just making sure I understand your workflow - like whether your workers are somehow on another machine.

More importantly, I now understand the timing of this OOM stuff. I used to have this line

GC.gc()  # not sure if this is necessary?

which I removed recently during some rearranging[54b7b2e]. I'll put it back. In principle, I don't think it should be necessary, but the forums are full of people complaining of parallel-related memory issues, such that it's a little hard to figure out what is current/relevant.

@tjjarvinen
Copy link
Collaborator

I looked the code in this PR and

while next <= length(packets)
             # retrieve the next packet 
             if next > length(packets)
                 break
             end
             lock(_lock)
             cur = next 
             next += 1
             unlock(_lock)
             if cur > length(packets)
                 break 
             end 
             p = packets[cur]

This part is attempting to reimplement Channels. Rather than trying to rediscover Channels, just use the existing Channels. It makes the code better in all the possible ways.

@wcwitt
Copy link
Collaborator

wcwitt commented Jul 18, 2023

On the mt, I'm close to something I'm happy with (influenced by @tjjarvinen's recommendation of Folds.jl). Can we just pause that discussion until I'm done and then you can critique it from there.

@tjjarvinen
Copy link
Collaborator

that wasn't my experience. I don't know why but simply using did not work for me. Maybe things have changed in the past few versions of Julia and I missed those changes.

When you spawns new processes they will get the project dir from the host. But every process needs to load all the packages separately, so you need to start

using Distributed
addprocs(20)

@everywhere using List_of_packages

@wcwitt
Copy link
Collaborator

wcwitt commented Jul 18, 2023

So maybe one of you can just put together a script for us so we can try using multiple workers interactively.

Our docs are actually relatively decent here - the "script" approach should work interactively. I don't say this to be annoying - if they aren't sufficient I will improve them. https://acesuit.github.io/ACE1pack.jl/dev/gettingstarted/parallel-fitting/

@cortner
Copy link
Member Author

cortner commented Jul 18, 2023

Our docs are actually relatively decent here ...

I've never seen this before. My mistake, sorry.

@cortner
Copy link
Member Author

cortner commented Jul 18, 2023

@tjjarvinen -- I wrote the part above that you quote. I agree with you of course. But remember this was a temporary hack and for me it is faster to write this than to learn about Channels. Let's see what Chuck comes up with and then discuss.

@cortner
Copy link
Member Author

cortner commented Jul 19, 2023

@wcwitt --- just to confirm that with your instructions above it becomes more convenient to assemble the LSQ system distributed instead of multi-threaded. The barrier is still slightly bigger, but not nearly as bad as it used to be. So from my end, we can make this low priority.

@cortner
Copy link
Member Author

cortner commented Jul 19, 2023

(also I can confirm that putting the gc() back into the distributed assembly prevented some OOMs for me just now. I tried with version of the package before and after...)

@wcwitt
Copy link
Collaborator

wcwitt commented Jul 20, 2023

Thanks - I'm glad, but we're deep enough into this now that I'm going to try to finish it off.

@wcwitt
Copy link
Collaborator

wcwitt commented Jul 20, 2023

Would someone please look at -- or possibly try -- this example? It requires the wcw/gc-test branch of ACEfit, but I have merged this branch into that one. The Si PRX data are available here and I started Julia with julia -p 9 -t 10. See this diff for the relevant changes to the assembly.

using Distributed
@everywhere using ACE1pack

dataset = "gp_iter6_sparse9k.xml.xyz"
data_keys = [:energy_key => "dft_energy",
             :force_key => "dft_force",
             :virial_key => "dft_virial"]
model = acemodel(elements = [:Si],
                 Eref = [:Si => -158.54496821],
                 rcut = 5.5,
                 order = 4,
                 totaldegree = 16)
data = [ AtomsData(at; data_keys..., v_ref = model.Vref)
            for at in read_extxyz(dataset) ]

# distributed
ACEfit.assemble(data, model.basis)  # to force compilation
time_mp_gc = @elapsed ACEfit.assemble(data, model.basis; do_gc = true)
time_mp_no = @elapsed ACEfit.assemble(data, model.basis; do_gc = false)

# threaded
ACEfit.mt_assemble(data, model.basis)  # to force compilation
time_mt_gc = @elapsed ACEfit.mt_assemble(data, model.basis; do_gc = true)
time_mt_no = @elapsed ACEfit.mt_assemble(data, model.basis; do_gc = false)

@show time_mp_gc, time_mp_no
@show time_mt_gc, time_mt_no

For me, the results, after starting Julia with julia -p 9 -t 10 are

(time_mp_gc, time_mp_no) = (283.623841182, 255.898352052)
(time_mt_gc, time_mt_no) = (848.516291594, 253.484540468)

indicating a huge slowdown when I garbage collect from each thread.

@cortner
Copy link
Member Author

cortner commented Jul 22, 2023

interesting - your script managed to crash my laptop all the way to reboot ...

@cortner
Copy link
Member Author

cortner commented Jul 22, 2023

independent of my problem - can you look at Julia 1.10? I read somewhere that it has a multi-threaded GC. I wonder whether this solves your problem.

@cortner
Copy link
Member Author

cortner commented Jul 22, 2023

Here are my times on Julia 1.9:

(time_mp_gc, time_mp_no) = (275.686350292, 279.586481542)
(time_mt_gc, time_mt_no) = (535.976733791, 236.405905)

and on Julia 1.10:

(time_mp_gc, time_mp_no) = (262.532686458, 220.518536)
(time_mt_gc, time_mt_no) = (372.88618475, 386.81745025)

@cortner
Copy link
Member Author

cortner commented Jul 22, 2023

So interestingly, the distributed is hands-down the faster version. The GC could be kept on as default, but good to have an option to turn it off when user wants it?

@cortner
Copy link
Member Author

cortner commented Jul 22, 2023

I think all things considered - it's maybe good to keep distributed the default. Especially as we move towards multi-threaded acceleration of model evaluation.

@wcwitt
Copy link
Collaborator

wcwitt commented Jul 22, 2023

Thank you very much for taking the time to do this. Naturally, I like the distributed and I would be happy for it to be default, but I remain unsettled by this: #49 (comment).

Your observation that the threading performs better for very small configs seems correct, and I won't be satisfied with default-distributed until I've managed to resove that part.

@cortner
Copy link
Member Author

cortner commented Jul 22, 2023

Did you rerun this test above with latest ACEfit / ACE1pack / Julia 1.9 and 1.10?

@cortner
Copy link
Member Author

cortner commented Jul 22, 2023

Another thing we can do is feed training structures to processes in batches.

@cortner
Copy link
Member Author

cortner commented Aug 1, 2023

@wcwitt -- is this now obsolete - can we close it?

@wcwitt
Copy link
Collaborator

wcwitt commented Aug 2, 2023

I'd rather keep it open for a bit longer, if that's okay.

@wcwitt
Copy link
Collaborator

wcwitt commented Aug 14, 2023

I'm fine to close this now, if you still want to. Linking #49 for reference.

@cortner
Copy link
Member Author

cortner commented Aug 14, 2023

So the MT assembly is now in the package or removed again because of the poor performance?

@cortner cortner closed this Aug 17, 2023
@wcwitt
Copy link
Collaborator

wcwitt commented Aug 18, 2023

To answer your question, it's not in the package, but it does live on in a branch. Eventually I will solve the performance dilemma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants