Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem when finalizing the training of a pure water model #581

Closed
nihil39 opened this issue Sep 3, 2024 · 23 comments
Closed

Problem when finalizing the training of a pure water model #581

nihil39 opened this issue Sep 3, 2024 · 23 comments

Comments

@nihil39
Copy link

nihil39 commented Sep 3, 2024

Hi,

I'm training a model with 1000 configurations of pure water taken at different densities generated from DFT calculations (VASP) at 300K.

I am using the following parameters for training:

name: MACE_model_Rev_PBE_D3_1000_confs_v1_gpu

train_file: /leonardo/home/userexternal/mciacchi/MACE_Training/Datasets/output.xyz
valid_fraction: 0.05

# test_file: /leonardo/home/userexternal/mciacchi/MACE_Training/Datasets/test_set_1000_confs.xyz

config_type_weights:
 Default: 1.0

E0s:

  1: -13.62222753701504
  8: -2041.8396277138045

model: MACE 

hidden_irreps: '128x0e + 128x1o' 

#num_channels: 128
r_max: 6.0 #  Because the model has two layers, atoms further than 3.0 A can still communicate by proxy.
#l_max: 3 # angular resolution
forces_weight: 10
energy_weight: 1

max_ell: 3 # order of local spherical harmonics ?, leave 3

num_interactions: 2 # Number of layers 

#statistics_file: /leonardo/home/userexternal/mciacchi/MACE_Training/statistics_gpu.json  
#correlation: 3 
batch_size: 2

max_num_epochs: 120 

scheduler_patience: 5 
patience: 15 
num_workers: 8 

stress_key: REF_stress
energy_key: free_energy
forces_key: REF_forces

#compute_forces: yes
#compute_stress: yes

swa: yes
start_swa: 8
ema: yes
ema_decay: 0.99
error_table: PerAtomMAE
amsgrad: yes
restart_latest: yes
seed: 1
device: cuda
default_dtype: float64
loss: stress
save_cpu: yes

(I know the swa value of 8 is too low)

Apparently the training ends but something happens after the last epoch, this is the log file.

The relevant lines are the following one I think:

2024-09-03 03:05:58.030 INFO: Epoch 116: loss=0.0007, MAE_E_per_atom=690883.5 meV, MAE_F=6.3 meV / A
2024-09-03 03:12:55.521 INFO: Epoch 118: loss=0.0007, MAE_E_per_atom=690883.2 meV, MAE_F=6.2 meV / A
2024-09-03 03:16:22.454 INFO: Training complete
2024-09-03 03:16:22.455 INFO: Computing metrics for training, validation, and test sets
2024-09-03 03:16:22.457 INFO: Loading checkpoint: checkpoints/MACE_model_Rev_PBE_D3_1000_confs_v1_gpu_run-1_epoch-8.pt
2024-09-03 03:16:22.512 INFO: Loaded model from epoch 8
2024-09-03 03:16:22.512 INFO: Evaluating train ...
2024-09-03 03:17:30.062 INFO: Evaluating valid ...
Traceback (most recent call last):
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/cli/run_train.py", line 51, in main
    run(args)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/cli/run_train.py", line 795, in run
    table = create_error_table(
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/tools/scripts_utils.py", line 496, in create_error_table
    _, metrics = evaluate(
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/tools/train.py", line 362, in evaluate
    output = model(
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/modules/models.py", line 366, in forward
    node_feats, sc = interaction(
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/modules/blocks.py", line 635, in forward
    mji = self.conv_tp(
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/e3nn/o3/_tensor_product/_tensor_product.py", line 529, in forward
    return self._compiled_main_left_right(x, y, real_weight)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<eval_with_key>.127", line 115, in forward
    einsum_23 = torch.functional.einsum('dbca,dbc->dba', tensordot_3, reshape_33);  tensordot_3 = reshape_33 = None
    reshape_34 = einsum_23.reshape(getitem_4, 896);  einsum_23 = getitem_4 = None
    cat = torch.cat([reshape_12, reshape_14, reshape_17, reshape_20, reshape_22, reshape_25, reshape_27, reshape_29, reshape_32, reshape_34], dim = 1);  reshape_12 = reshape_14 = reshape_17 = reshape_20 = reshape_22 = reshape_25 = reshape_27 = reshape_29 = reshape_32 = reshape_34 = None
          ~~~~~~~~~ <--- HERE
    reshape_35 = cat.reshape(add_3);  cat = add_3 = None
    return reshape_35
RuntimeError: CUDA out of memory. Tried to allocate 12.48 GiB. GPU 

I am left with these two files: MACE_model_Rev_PBE_D3_1000_confs_v1_gpu_run-1_epoch-118_swa.pt and MACE_model_Rev_PBE_D3_1000_confs_v1_gpu_run-1_epoch-8_.pt

What can I do to finalize the training? The first one is the model after 118th epoch, right? Why the swa suffix?
I don't know why it crashes after the training completion.

Thank you in advance

@gabor1
Copy link
Collaborator

gabor1 commented Sep 3, 2024

Others will comment on the PyTorch error, but it does say "CUDA out of memory" ! How big are your training data frames? And how much GPU memory do you have?

One thing I'll also say is that there is definitely something wrong with the training itself: you have MAEs on your validation set that is 1000s of eVs per atom! Can you post one of your training frames please?

@gabor1
Copy link
Collaborator

gabor1 commented Sep 3, 2024

and one of your validation/test frames as well.

@nihil39
Copy link
Author

nihil39 commented Sep 3, 2024

Others will comment on the PyTorch error, but it does say "CUDA out of memory" ! How big are your training data frames? And how much GPU memory do you have?

One thing I'll also say is that there is definitely something wrong with the training itself: you have MAEs on your validation set that is 1000s of eVs per atom! Can you post one of your training frames please?

The whole dataset is 39 MB and is made by 1000 configurations of 125 water molecules at 10 different densities, 100 configurations per density. I don't know why it completes the training, but it fails before finalizing if it is really a memory problem.

These are the all files:

Training
dataset_1000_confs_clean.xyz.zip

Test
test_set_1000_confs_clean.zip

@nihil39
Copy link
Author

nihil39 commented Sep 3, 2024

I hope the zip format is right, I can upload also some configurations in txt format if necessary

@gabor1
Copy link
Collaborator

gabor1 commented Sep 3, 2024

So you have a total energy for your first frame that is "-1879.62268026". But you are specifying an E0 for oxygen as "-2041.8396277138045". This does not make sense. The first one looks like an interaction energy already, the second one looks like a total energy including all core electrons. The E0s need to be the isolated atom energies evaluated using exactly the same electronic structure method and code as what you are using for your training configs.

@nihil39
Copy link
Author

nihil39 commented Sep 3, 2024

Thanks, you're right.
We don't have isolated atoms, so I think we will just use average, maybe the results will be the same. We will try a L0 training with 64x0e as hidden irreps. And than the L1 with E0s:average

A question: what is the meaning of ef and universal in the loss parameters? Is Ef energy and forces? {ef,weighted,forces_only,virials,stress,dipole,huber,universal,energy_forces_dipole}

@gabor1
Copy link
Collaborator

gabor1 commented Sep 3, 2024

I think you should compute the energy of isolated H and O atoms (and all others that you need), it takes very little time, and use those as E0s. you will get a more stable model than if you use "average".

@gabor1
Copy link
Collaborator

gabor1 commented Sep 3, 2024

"universal" (poorly named, we're working on that) was the name of the loss function used for our foundation model that was fitted to crystals of the whole periodic table.

@nihil39
Copy link
Author

nihil39 commented Sep 7, 2024

I think you should compute the energy of isolated H and O atoms (and all others that you need), it takes very little time, and use those as E0s. you will get a more stable model than if you use "average".

Computing the E0s energies with MACE averages for both Hydrogen and Oxygen gives [0.0, 0.0]

Computing average Atomic Energies using least squares regression
INFO: Atomic energies: [0.0, 0.0]

According to this VASP wiki (the software we are using to generate the dataset) the calculations for isolated atoms energy should be 0
Quoting

In this calculation, the total energy (energy without entropy) of the single, isolated oxygen atom is close to zero. And actually, if the box size were larger and the precision of the calculation higher, it would go to zero. This is only because all > pseudopotentials have been generated for isolated, nonspinpolarized atoms.

So, if I understand correctly, the result that we get from MACE (least square regression) averages is the right theoretical one. Maybe I am missing something?

Thanks again.

@ilyes319
Copy link
Contributor

Having zeros is very suspicious. Are you sure you used the right keys in your input script?

@nihil39
Copy link
Author

nihil39 commented Sep 10, 2024

Having zeros is very suspicious. Are you sure you used the right keys in your input script?

This is the input file syntax:

REF_stress="0.00061665 -0.00145595 0.00196908 -0.00145595 0.00137584 0.00065622 0.00196908 0.000|                                                                                               
65622 0.00223462" free_energy=-1879.62268026 energy=-1879.62268026 Lattice="16.07633347 0.000000|
00 0.00000000 0.00000000 16.07633347 0.00000000 0.00000000 0.00000000 16.07633347" pbc=[T, T, T]|                                                                                               
 Properties=species:S:1:pos:R:3:REF_forces:R:3

And this is the relevant part of the .yaml training file, It should be right

stress_key: REF_stress
energy_key: free_energy
forces_key: REF_forces

By the way we computed energies with DFT and we set these values:

E0s:
   1: -0.11941284
   8: -0.43787111

@ilyes319
Copy link
Contributor

ilyes319 commented Sep 10, 2024

free_energy is no longer an allowed key in ASE, you should rename it to REF_energy. Currently you are not training on energies.

@nihil39
Copy link
Author

nihil39 commented Sep 10, 2024

free_energy is no longer an allowed key in ASE, you should rename it to REF_energy. Currently you are not training on energies.

Thanks, the error on energy is definitely better now.

Maybe a stupid question, why did you write about ASE? Is it used in order to compare the results with the ones coming from a classical potential?

@gabor1
Copy link
Collaborator

gabor1 commented Sep 10, 2024

I feel there may still be some confusion in the above messages. the isolated atom E0s you say you compute from DFT (i.e. -0.11941284 and -0.43787111) are inconsistent with the total energies you said you have, in the thousands.

@gabor1
Copy link
Collaborator

gabor1 commented Sep 10, 2024

maybe the thousands was for a large system I guess

@nihil39
Copy link
Author

nihil39 commented Sep 10, 2024

maybe the thousands was for a large system I guess

It was for a 375 system atoms, 125 O and 250 H

@gabor1
Copy link
Collaborator

gabor1 commented Sep 10, 2024

So assuming your isolated atom DFT numbers are right, then you give these as E0s to the MACE training. After you have done the MACE training, with --model MACE, or left as default (not ScaleShiftMace by some chance), please verify that when you compute an isolated atom with Mace, you get back exactly the correct DFT E0s

@gabor1
Copy link
Collaborator

gabor1 commented Sep 10, 2024

ASE used to load the configurations, so we have to follow their quirks and conventions.

@nihil39
Copy link
Author

nihil39 commented Sep 10, 2024

So assuming your isolated atom DFT numbers are right, then you give these as E0s to the MACE training. After you have done the MACE training, with --model MACE, or left as default (not ScaleShiftMace by some chance), please verify that when you compute an isolated atom with Mace, you get back exactly the correct DFT E0s

Thanks, I can also try to check what are the E0s values computed with E0s: average, they should not be exactly 0.0 now that we are using the right energy key as @ilyes319 suggested

@gabor1
Copy link
Collaborator

gabor1 commented Sep 10, 2024

I really don't recommend using the "average" option.

@nihil39
Copy link
Author

nihil39 commented Sep 10, 2024

I really don't recommend using the "average" option.

I know, I was just curious to check the differences between those methods. I will use E0s explicitly provided in the training file.

@gabor1
Copy link
Collaborator

gabor1 commented Sep 10, 2024

Sure. If you train with "average", you should find that when you evaluate the isolated atoms with MACE you will get much lower numbers (larger negative numbers) that match the average binding energies per atom.

@ilyes319
Copy link
Contributor

ilyes319 commented Oct 3, 2024

I will turn that to a discussion so we can keep track of it.

@ACEsuit ACEsuit locked and limited conversation to collaborators Oct 3, 2024
@ilyes319 ilyes319 converted this issue into discussion #621 Oct 3, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants