-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Acefit extension #46
Acefit extension #46
Conversation
This could be wrong, but cell matrix needs this, so adding it here.
Hi Teemy - Thank you for exploring this. I'll will need to find time to go through the code very carefully. In the meantime can you please explain a few things?
Regarding the bug, I have some doubts about there being a bug in CC @wcwitt - would you mind following this discussion as well? |
another question - why add the example training data to the repo when you can just get it via |
Yes, AtomsBase is defined so that you can store whatever data in the structures. The documentation page of this PR has the details. But in short, if Fitting weights are also explained in the documentation page. In short, general weights are given as keywords, like in above example. In addition you can give each structure a separate weight parameter that is used only for that structure.
I am not completely sure. I suspect that it has something to do with the garbage collection issue. Here is a benchmark comparison with the above posted example # old impelentation
julia> @benchmark ACEfit.assemble($train_old, $basis)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 14.657 s (98.42% GC) to evaluate,
with a memory estimate of 445.05 MiB, over 5253010 allocations.
# new implementation
julia> @benchmark ACEfit.assemble($train_new, $basis; energy_default_weight=$5, energy_ref=$model.Vref)
BenchmarkTools.Trial: 32 samples with 1 evaluation.
Range (min … max): 136.020 ms … 180.292 ms ┊ GC (min … max): 18.63% … 16.11%
Time (median): 156.335 ms ┊ GC (median): 23.22%
Time (mean ± σ): 157.097 ms ± 9.635 ms ┊ GC (mean ± σ): 22.56% ± 2.36%
█ ▃▃▃ █ █
▇▁▁▁▁▁▁▁▁▇▁▁▇▁▇▁▇▁▁█▇███▁▁▁▇▇▇▁▁▇▇▇▁▇█▁▁▁█▇▇▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇ ▁
136 ms Histogram: frequency by time 180 ms <
Memory estimate: 1.03 GiB, allocs estimate: 5453582. Use of multiprocessing in the old implementation makes it hard to see, where the difference comes. But it is probably the garbage collection.
It is the first 3 structures from the ACEpotentials training data. Its function is to implement tests for correctness of assembly. I just thought that it is so little data that it was better to just add here is as a copy rather than load it over network every time you do tests. |
Can someone explain the context here? Is the aim to replace JuLIP?
I am fully prepared to believe this is possible, but we should agree on some larger benchmarks. This could become one: ACEsuit/ACEfit.jl#54 (comment). The assembly, particularly in parallel, has felt confusingly sluggish to me for some time. I've been attributing it to something in ACE1/JuLIP (perhaps involving GC), where I've been reluctant to dig deeply, but not sure. |
Yes we want to kill JuLIP by moving small pieces into ACE packages or into community packages. |
I also had the impression that Teemu s weights are I compatible with ours so this may need to be fixed. But that’s a minor issue. |
It's really funny actually, the old assembly uses less memory but spends the majority of time in GC :
The reducing of GC from 95% to 5% more or less explains the factor-10 speedup. |
@tjjarvinen -- can you also please open a separate issue for the bug you found? If we trace it to ACE1 then I need to look into it. |
One small remark, to make the tests entirely fair one should remove the cached neighbourlist after each assembly,
but this doesn't actually seem to make a measurable difference. |
If this speedup holds, I'm excited, because it likely also means better strong scaling. Currently, we are at least a factor of 5 worse in that dimension than we should be, perhaps more. |
The numbers change a bit when you make the model bigger, to about a factor 3. Still very very nice speedup.
|
I tried on some other datasets and hit errors (e.g., still needs support for custom energy/force/virial keys). But encouraging nonetheless. |
Hm, am I correct that the new assembly routine does not force garbage collection after each structure, unlike the old one? That is likely a factor - perhaps the main factor - particularly for small tests. We know forcing GC hurts performance, but it has been the only way to avoid crashes (for both distributed and threaded) during larger assemblies. |
What if we add a counter and GC every once in a while? |
Yeah, perhaps. Or I will merge in one of the batching ideas and we can GC after batches. Still holding out hope this PR fixes the underlying issue somehow - let's see. |
Just tried the small benchmark again (from the top of this thread):
So for the tiny model the forced GC explains everything. Same trend, but less definitive for the bigger model:
|
That points that the issue might be with the communication. For bigger system more time is spend on ACE evaluations, which will overshadow the other parts of the code. The new code has still unused optimizations that can be done, so we should get more out of it once they have been added. The issue with the old version is that there is a Distributed array to which all processes write at the same time. This by definition needs that there has to be some kind of locking and broadcasting of changes made to the array. This is probably the point where the issue comes. However this would be the case only when there is more than 1 process, but the issue is present even with one process. One point to note here is that you cannot implement the old asseble version with Rust, because in Rust only the owner can change the content and thus the compiler would not allow you to compile. This is in general a huge red fag that there are possible issues in the code and in general should be avoided. |
I added support for customizing data keys ACEfit.assemble(data, basis;
energy_key=:energy,
force_key=:force,
virial_key=:virial
) Changing those should allow you to use the other data. |
I added a small performance improvement that is in use when both force and virial are calculated. The basis here is that both need |
Smart, thanks.
I understand the reasoning, but I don't think it's the main bottleneck (yet). In the past, I've tested by eliminating the write step, such that the assembly blocks are created and then immediately discarded, but the times don't improve significantly. There is some discussion of that here (but in the threading context): ACEsuit/ACEfit.jl#49 (comment) |
Yes - caching virials and forces is a long-standing issue :) - thanks for solving this here. Though we should really have this at a more general level. Until then this is a great improvement. |
Regardless of understanding the origin of the improvement, can we please agree on a few tests that need to pass before we can make these changes the default for ACEfit and ACEpotentials |
We should not hurry for the default yet. The best way is to add this as an alternative first, for some testing period. During this period we tell people to try it and once we are comfortable we can move it to be the default. I also need to fix the issue with ExtXYZ until we can go for full AtomsBase workflow too. But I should be able to fix that this week, after which it needs for James to release a new version, but that should not be an issue.
Yes I know, there is something that I don't understand either. But it is somehow related to the way data is moved/transformed... |
fair enough but it still needs equivalent functionality and identical results on a range of tests. |
@cortner This adds training support for
AtomsBase
input. It has only overloads the function inACEfit
. Later on we can support forACEpotentials
specific functions.There was some issue with implementing this. I did stumble on a bug that caused
assemble
to give some random results (can be triggered by changingpmap
inasseble
toFolds.map
). This is probably some bug in ACE data allocation, but it can also be in Julia general - I don't understand how the bug works. But I am very concerned of it, as appear somewhere else and possible changes in Julia can cause it to emerge in the existing "working" code.Here is a example code and comparison to old way
The new way is way more faster than the old one, between several tens to several hundred (if you don't use parallel processing). The is also a way to make it even faster, but I will skip for now. The main point is that assembly is now faster than solving the linear equation, changing the dynamics of fitting potentials completely.