-
Notifications
You must be signed in to change notification settings - Fork 198
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem when finalizing the training of a pure water model #581
Comments
Others will comment on the PyTorch error, but it does say "CUDA out of memory" ! How big are your training data frames? And how much GPU memory do you have? One thing I'll also say is that there is definitely something wrong with the training itself: you have MAEs on your validation set that is 1000s of eVs per atom! Can you post one of your training frames please? |
and one of your validation/test frames as well. |
The whole dataset is 39 MB and is made by 1000 configurations of 125 water molecules at 10 different densities, 100 configurations per density. I don't know why it completes the training, but it fails before finalizing if it is really a memory problem. These are the all files: Training |
I hope the zip format is right, I can upload also some configurations in txt format if necessary |
So you have a total energy for your first frame that is "-1879.62268026". But you are specifying an E0 for oxygen as "-2041.8396277138045". This does not make sense. The first one looks like an interaction energy already, the second one looks like a total energy including all core electrons. The E0s need to be the isolated atom energies evaluated using exactly the same electronic structure method and code as what you are using for your training configs. |
Thanks, you're right. A question: what is the meaning of ef and universal in the loss parameters? Is Ef energy and forces? |
I think you should compute the energy of isolated H and O atoms (and all others that you need), it takes very little time, and use those as E0s. you will get a more stable model than if you use "average". |
"universal" (poorly named, we're working on that) was the name of the loss function used for our foundation model that was fitted to crystals of the whole periodic table. |
Computing the E0s energies with MACE averages for both Hydrogen and Oxygen gives [0.0, 0.0]
According to this VASP wiki (the software we are using to generate the dataset) the calculations for isolated atoms energy should be 0
So, if I understand correctly, the result that we get from MACE (least square regression) averages is the right theoretical one. Maybe I am missing something? Thanks again. |
Having zeros is very suspicious. Are you sure you used the right keys in your input script? |
This is the input file syntax:
And this is the relevant part of the .yaml training file, It should be right
By the way we computed energies with DFT and we set these values:
|
free_energy is no longer an allowed key in ASE, you should rename it to REF_energy. Currently you are not training on energies. |
Thanks, the error on energy is definitely better now. Maybe a stupid question, why did you write about ASE? Is it used in order to compare the results with the ones coming from a classical potential? |
I feel there may still be some confusion in the above messages. the isolated atom E0s you say you compute from DFT (i.e. -0.11941284 and -0.43787111) are inconsistent with the total energies you said you have, in the thousands. |
maybe the thousands was for a large system I guess |
It was for a 375 system atoms, 125 O and 250 H |
So assuming your isolated atom DFT numbers are right, then you give these as E0s to the MACE training. After you have done the MACE training, with --model MACE, or left as default (not ScaleShiftMace by some chance), please verify that when you compute an isolated atom with Mace, you get back exactly the correct DFT E0s |
ASE used to load the configurations, so we have to follow their quirks and conventions. |
Thanks, I can also try to check what are the E0s values computed with E0s: average, they should not be exactly 0.0 now that we are using the right energy key as @ilyes319 suggested |
I really don't recommend using the "average" option. |
I know, I was just curious to check the differences between those methods. I will use E0s explicitly provided in the training file. |
Sure. If you train with "average", you should find that when you evaluate the isolated atoms with MACE you will get much lower numbers (larger negative numbers) that match the average binding energies per atom. |
I will turn that to a discussion so we can keep track of it. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hi,
I'm training a model with 1000 configurations of pure water taken at different densities generated from DFT calculations (VASP) at 300K.
I am using the following parameters for training:
(I know the swa value of 8 is too low)
Apparently the training ends but something happens after the last epoch, this is the log file.
The relevant lines are the following one I think:
I am left with these two files:
MACE_model_Rev_PBE_D3_1000_confs_v1_gpu_run-1_epoch-118_swa.pt
andMACE_model_Rev_PBE_D3_1000_confs_v1_gpu_run-1_epoch-8_.pt
What can I do to finalize the training? The first one is the model after 118th epoch, right? Why the swa suffix?
I don't know why it crashes after the training completion.
Thank you in advance
The text was updated successfully, but these errors were encountered: