WIP: Multi-threaded assembly #54

cortner · 2023-06-21T15:32:42Z

For now this is just an experiment so we see what it might look like, the interface probably needs to be changed.

cortner · 2023-06-23T05:16:31Z

@wcwitt -- I had to write a manual "scheduler" to assemble the lsq blocks in the right order. This becomes an issue when a dataset has structures of very different sizes. I don't know how pmap works, but with @threads this strange fix was needed. I wonder whether the bad performance in the distributed assembly for me is something similar.

cortner · 2023-06-24T16:11:54Z

@tjjarvinen --- my implementation in this PR sometimes hangs at the very end of the LSQ system assembly. Would you be willing to take a quick look and see if something jumps out at you? When I CTRL-C interrupt it, then it seems it got stuck waiting for the lock to become free.

(also is there a more elegant way to achieve this?)

cortner · 2023-06-24T22:23:17Z

follow-up I think I may have found the bug. testing now ... I'll let you know if I can't fix it after all.

CheukHinHoJerry · 2023-07-14T20:30:37Z

I tried with this mt implementation and I feel much more pleasant to use this than the current ACEfit.assemble. This does not have a overhead in Pkg.instantiate and OOM problem. It is possible that we can have this as an experimental feature and merge it even if it is not user-facing yet?

CC @cortner @wcwitt

cortner · 2023-07-14T21:06:56Z

I'm ok with merging as is, but alternatively also to first integrate it into the user-interface.

@wcwitt wasn't too keen on this in the first place, but I think it is sufficiently useful for us that we can ask him again to consider it.

In the near future we should merge the dist and MT assembly into a single framework as discussed in other issues.

wcwitt · 2023-07-18T16:49:33Z

Letting you know that I've been working on this today, partly in response to @CheukHinHoJerry's experience.

a overhead in Pkg.instantiate

I don't understand how/why this would happen - can you elaborate?

OOM problem

Recently, I have been seeing this on large datasets with both the distributed and multithreaded. Not sure what changed, but it feels like a memory leak. Adding GC.gc() after assembling each block alleviates things somewhat, although I don't know what it does to the performance.

cortner · 2023-07-18T16:53:23Z

I don't understand how/why this would happen - can you elaborate?

when you work interactively as we do, you first add the workers, then have to instantiate the environment on each worker. Then you might as well cycle to the next coffee shop and take a break before you can continue since that operation can take a long time.

cortner · 2023-07-18T16:54:58Z

Recently, I have been seeing this on large datasets with both the distributed and multithreaded. Not sure what changed, but it feels like a memory leak.

If gc helps, then I don't think it can be a memory leak? It is possible that during multi-threading the gc doesn't turn itself on as often. I haven't done any reading on this, but this has been my impression during benchmarking. Maybe I'm completely wrong.

@tjjarvinen can you please comment? Also in general, what are typical reasons to get OOMs in Julia? How is this even possible?

tjjarvinen · 2023-07-18T17:03:34Z

The main reason for OOM is that you allocate too much data. GC helps a little in these cases, because it clears out some unneeded data.

To me this issue sounds like that Julia is starting to use swap, which results as a slow down. Once swap is used up too, it will cause OOM.

Have you looked how much Julia is using memory (+ memory per process) and what is swap usage?

Also how much is the estimate of data usage for assembling?

wcwitt · 2023-07-18T17:20:15Z

when you work interactively as we do, you first add the workers, then have to instantiate the environment on each worker. Then you might as well cycle to the next coffee shop and take a break before you can continue since that operation can take a long time

I believe you, but I still don't understand. This rarely takes more than a few seconds for me interactively (and the exeflags thing shouldn't be necessary since 1.9) - when/why is an instantiate required?

> using Distributed
> using ACE1pack
> addprocs(50, exeflags="--project=$(Base.active_project())")
> @everywhere using ACE1pack

cortner · 2023-07-18T17:21:46Z

addprocs(50, exeflags="--project=$(Base.active_project())")

this is news to me, I was unaware of this.

Maybe we should just test this again more carefully. In principle if firing up 50 workers is more or less instantaneous then I we can discuss dropping the mt again.

tjjarvinen · 2023-07-18T17:33:06Z

when/why is an instantiate required?

Instantiate is not needed, if using the same node or when using Julia install that has access to the same local data (.julia folder). If you have different nodes that do not have access to same storage space, then you need to instantiate each worker.

cortner · 2023-07-18T17:47:23Z

that wasn't my experience. I don't know why but simply using did not work for me. Maybe things have changed in the past few versions of Julia and I missed those changes.

cortner · 2023-07-18T17:48:45Z

So maybe one of you can just put together a script for us so we can try using multiple workers interactively. We will try and it and if it works fine we continue from there?

wcwitt · 2023-07-18T17:53:43Z

You've convinced me the mt is worthwhile ... as soon as I'm done experimenting with it we can merge. Otherwise, at this point I'm just making sure I understand your workflow - like whether your workers are somehow on another machine.

More importantly, I now understand the timing of this OOM stuff. I used to have this line

GC.gc()  # not sure if this is necessary?

which I removed recently during some rearranging[54b7b2e]. I'll put it back. In principle, I don't think it should be necessary, but the forums are full of people complaining of parallel-related memory issues, such that it's a little hard to figure out what is current/relevant.

tjjarvinen · 2023-07-18T17:54:26Z

I looked the code in this PR and

while next <= length(packets)
             # retrieve the next packet 
             if next > length(packets)
                 break
             end
             lock(_lock)
             cur = next 
             next += 1
             unlock(_lock)
             if cur > length(packets)
                 break 
             end 
             p = packets[cur]

This part is attempting to reimplement Channels. Rather than trying to rediscover Channels, just use the existing Channels. It makes the code better in all the possible ways.

wcwitt · 2023-07-18T17:56:26Z

On the mt, I'm close to something I'm happy with (influenced by @tjjarvinen's recommendation of Folds.jl). Can we just pause that discussion until I'm done and then you can critique it from there.

tjjarvinen · 2023-07-18T17:57:09Z

that wasn't my experience. I don't know why but simply using did not work for me. Maybe things have changed in the past few versions of Julia and I missed those changes.

When you spawns new processes they will get the project dir from the host. But every process needs to load all the packages separately, so you need to start

using Distributed
addprocs(20)

@everywhere using List_of_packages

wcwitt · 2023-07-18T18:34:36Z

So maybe one of you can just put together a script for us so we can try using multiple workers interactively.

Our docs are actually relatively decent here - the "script" approach should work interactively. I don't say this to be annoying - if they aren't sufficient I will improve them. https://acesuit.github.io/ACE1pack.jl/dev/gettingstarted/parallel-fitting/

cortner · 2023-07-18T18:36:32Z

Our docs are actually relatively decent here ...

I've never seen this before. My mistake, sorry.

cortner · 2023-07-18T18:39:56Z

@tjjarvinen -- I wrote the part above that you quote. I agree with you of course. But remember this was a temporary hack and for me it is faster to write this than to learn about Channels. Let's see what Chuck comes up with and then discuss.

cortner · 2023-07-19T22:38:57Z

@wcwitt --- just to confirm that with your instructions above it becomes more convenient to assemble the LSQ system distributed instead of multi-threaded. The barrier is still slightly bigger, but not nearly as bad as it used to be. So from my end, we can make this low priority.

cortner · 2023-07-19T23:29:27Z

(also I can confirm that putting the gc() back into the distributed assembly prevented some OOMs for me just now. I tried with version of the package before and after...)

wcwitt · 2023-07-20T09:40:44Z

Thanks - I'm glad, but we're deep enough into this now that I'm going to try to finish it off.

wcwitt · 2023-07-20T14:57:04Z

Would someone please look at -- or possibly try -- this example? It requires the wcw/gc-test branch of ACEfit, but I have merged this branch into that one. The Si PRX data are available here and I started Julia with julia -p 9 -t 10. See this diff for the relevant changes to the assembly.

using Distributed
@everywhere using ACE1pack

dataset = "gp_iter6_sparse9k.xml.xyz"
data_keys = [:energy_key => "dft_energy",
             :force_key => "dft_force",
             :virial_key => "dft_virial"]
model = acemodel(elements = [:Si],
                 Eref = [:Si => -158.54496821],
                 rcut = 5.5,
                 order = 4,
                 totaldegree = 16)
data = [ AtomsData(at; data_keys..., v_ref = model.Vref)
            for at in read_extxyz(dataset) ]

# distributed
ACEfit.assemble(data, model.basis)  # to force compilation
time_mp_gc = @elapsed ACEfit.assemble(data, model.basis; do_gc = true)
time_mp_no = @elapsed ACEfit.assemble(data, model.basis; do_gc = false)

# threaded
ACEfit.mt_assemble(data, model.basis)  # to force compilation
time_mt_gc = @elapsed ACEfit.mt_assemble(data, model.basis; do_gc = true)
time_mt_no = @elapsed ACEfit.mt_assemble(data, model.basis; do_gc = false)

@show time_mp_gc, time_mp_no
@show time_mt_gc, time_mt_no

For me, the results, after starting Julia with julia -p 9 -t 10 are

(time_mp_gc, time_mp_no) = (283.623841182, 255.898352052)
(time_mt_gc, time_mt_no) = (848.516291594, 253.484540468)

indicating a huge slowdown when I garbage collect from each thread.

cortner · 2023-07-22T00:05:10Z

interesting - your script managed to crash my laptop all the way to reboot ...

cortner · 2023-07-22T00:07:47Z

independent of my problem - can you look at Julia 1.10? I read somewhere that it has a multi-threaded GC. I wonder whether this solves your problem.

cortner · 2023-07-22T03:36:19Z

Here are my times on Julia 1.9:

(time_mp_gc, time_mp_no) = (275.686350292, 279.586481542)
(time_mt_gc, time_mt_no) = (535.976733791, 236.405905)

and on Julia 1.10:

(time_mp_gc, time_mp_no) = (262.532686458, 220.518536)
(time_mt_gc, time_mt_no) = (372.88618475, 386.81745025)

cortner · 2023-07-22T04:17:24Z

So interestingly, the distributed is hands-down the faster version. The GC could be kept on as default, but good to have an option to turn it off when user wants it?

cortner · 2023-07-22T05:15:55Z

I think all things considered - it's maybe good to keep distributed the default. Especially as we move towards multi-threaded acceleration of model evaluation.

wcwitt · 2023-07-22T13:10:24Z

Thank you very much for taking the time to do this. Naturally, I like the distributed and I would be happy for it to be default, but I remain unsettled by this: #49 (comment).

Your observation that the threading performs better for very small configs seems correct, and I won't be satisfied with default-distributed until I've managed to resove that part.

cortner · 2023-07-22T15:37:03Z

Did you rerun this test above with latest ACEfit / ACE1pack / Julia 1.9 and 1.10?

cortner · 2023-07-22T15:37:25Z

Another thing we can do is feed training structures to processes in batches.

cortner · 2023-08-01T21:52:32Z

@wcwitt -- is this now obsolete - can we close it?

wcwitt · 2023-08-02T16:57:44Z

I'd rather keep it open for a bit longer, if that's okay.

wcwitt · 2023-08-14T21:26:35Z

I'm fine to close this now, if you still want to. Linking #49 for reference.

cortner · 2023-08-14T21:58:33Z

So the MT assembly is now in the package or removed again because of the poor performance?

wcwitt · 2023-08-18T10:50:03Z

To answer your question, it's not in the package, but it does live on in a branch. Eventually I will solve the performance dilemma.

cortner and others added 2 commits June 21, 2023 08:21

multi-threaded assembly

91939ef

better scheduling of data packets

4c0cda2

lock bugfix

31cfd8f

Christoph Ortner and others added 2 commits June 24, 2023 15:30

more bug squashing

f4b3b04

Merge branch 'main' into co/mt

021fefc

cortner mentioned this pull request Jul 20, 2023

Show progress in assembly #60

Open

wcwitt mentioned this pull request Jul 20, 2023

Problem with distributed assembly #59

Closed

cortner closed this Aug 17, 2023

wcwitt mentioned this pull request Sep 17, 2023

Acefit extension ACEsuit/ACEmd.jl#46

Merged

WIP: Multi-threaded assembly #54

WIP: Multi-threaded assembly #54

Conversation

cortner commented Jun 21, 2023

cortner commented Jun 23, 2023

cortner commented Jun 24, 2023 • edited Loading

cortner commented Jun 24, 2023

CheukHinHoJerry commented Jul 14, 2023 • edited Loading

cortner commented Jul 14, 2023

wcwitt commented Jul 18, 2023

cortner commented Jul 18, 2023

cortner commented Jul 18, 2023

tjjarvinen commented Jul 18, 2023

wcwitt commented Jul 18, 2023

cortner commented Jul 18, 2023 • edited Loading

tjjarvinen commented Jul 18, 2023

cortner commented Jul 18, 2023

cortner commented Jul 18, 2023

wcwitt commented Jul 18, 2023

tjjarvinen commented Jul 18, 2023

wcwitt commented Jul 18, 2023

tjjarvinen commented Jul 18, 2023

wcwitt commented Jul 18, 2023

cortner commented Jul 18, 2023

cortner commented Jul 18, 2023

cortner commented Jul 19, 2023

cortner commented Jul 19, 2023

wcwitt commented Jul 20, 2023

wcwitt commented Jul 20, 2023

cortner commented Jul 22, 2023

cortner commented Jul 22, 2023

cortner commented Jul 22, 2023 • edited Loading

cortner commented Jul 22, 2023

cortner commented Jul 22, 2023

wcwitt commented Jul 22, 2023

cortner commented Jul 22, 2023

cortner commented Jul 22, 2023

cortner commented Aug 1, 2023

wcwitt commented Aug 2, 2023

wcwitt commented Aug 14, 2023

cortner commented Aug 14, 2023

wcwitt commented Aug 18, 2023

cortner commented Jun 24, 2023 •

edited

Loading

CheukHinHoJerry commented Jul 14, 2023 •

edited

Loading

cortner commented Jul 18, 2023 •

edited

Loading

cortner commented Jul 22, 2023 •

edited

Loading