Problem when finalizing the training of a pure water model #581

nihil39 · 2024-09-03T15:13:20Z

Hi,

I'm training a model with 1000 configurations of pure water taken at different densities generated from DFT calculations (VASP) at 300K.

I am using the following parameters for training:

name: MACE_model_Rev_PBE_D3_1000_confs_v1_gpu

train_file: /leonardo/home/userexternal/mciacchi/MACE_Training/Datasets/output.xyz
valid_fraction: 0.05

# test_file: /leonardo/home/userexternal/mciacchi/MACE_Training/Datasets/test_set_1000_confs.xyz

config_type_weights:
 Default: 1.0

E0s:

  1: -13.62222753701504
  8: -2041.8396277138045

model: MACE 

hidden_irreps: '128x0e + 128x1o' 

#num_channels: 128
r_max: 6.0 #  Because the model has two layers, atoms further than 3.0 A can still communicate by proxy.
#l_max: 3 # angular resolution
forces_weight: 10
energy_weight: 1

max_ell: 3 # order of local spherical harmonics ?, leave 3

num_interactions: 2 # Number of layers 

#statistics_file: /leonardo/home/userexternal/mciacchi/MACE_Training/statistics_gpu.json  
#correlation: 3 
batch_size: 2

max_num_epochs: 120 

scheduler_patience: 5 
patience: 15 
num_workers: 8 

stress_key: REF_stress
energy_key: free_energy
forces_key: REF_forces

#compute_forces: yes
#compute_stress: yes

swa: yes
start_swa: 8
ema: yes
ema_decay: 0.99
error_table: PerAtomMAE
amsgrad: yes
restart_latest: yes
seed: 1
device: cuda
default_dtype: float64
loss: stress
save_cpu: yes

(I know the swa value of 8 is too low)

Apparently the training ends but something happens after the last epoch, this is the log file.

The relevant lines are the following one I think:

2024-09-03 03:05:58.030 INFO: Epoch 116: loss=0.0007, MAE_E_per_atom=690883.5 meV, MAE_F=6.3 meV / A
2024-09-03 03:12:55.521 INFO: Epoch 118: loss=0.0007, MAE_E_per_atom=690883.2 meV, MAE_F=6.2 meV / A
2024-09-03 03:16:22.454 INFO: Training complete
2024-09-03 03:16:22.455 INFO: Computing metrics for training, validation, and test sets
2024-09-03 03:16:22.457 INFO: Loading checkpoint: checkpoints/MACE_model_Rev_PBE_D3_1000_confs_v1_gpu_run-1_epoch-8.pt
2024-09-03 03:16:22.512 INFO: Loaded model from epoch 8
2024-09-03 03:16:22.512 INFO: Evaluating train ...
2024-09-03 03:17:30.062 INFO: Evaluating valid ...
Traceback (most recent call last):
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/cli/run_train.py", line 51, in main
    run(args)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/cli/run_train.py", line 795, in run
    table = create_error_table(
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/tools/scripts_utils.py", line 496, in create_error_table
    _, metrics = evaluate(
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/tools/train.py", line 362, in evaluate
    output = model(
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/modules/models.py", line 366, in forward
    node_feats, sc = interaction(
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/mace/modules/blocks.py", line 635, in forward
    mji = self.conv_tp(
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/e3nn/o3/_tensor_product/_tensor_product.py", line 529, in forward
    return self._compiled_main_left_right(x, y, real_weight)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/leonardo/home/userexternal/mciacchi/mace_gpu_venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<eval_with_key>.127", line 115, in forward
    einsum_23 = torch.functional.einsum('dbca,dbc->dba', tensordot_3, reshape_33);  tensordot_3 = reshape_33 = None
    reshape_34 = einsum_23.reshape(getitem_4, 896);  einsum_23 = getitem_4 = None
    cat = torch.cat([reshape_12, reshape_14, reshape_17, reshape_20, reshape_22, reshape_25, reshape_27, reshape_29, reshape_32, reshape_34], dim = 1);  reshape_12 = reshape_14 = reshape_17 = reshape_20 = reshape_22 = reshape_25 = reshape_27 = reshape_29 = reshape_32 = reshape_34 = None
          ~~~~~~~~~ <--- HERE
    reshape_35 = cat.reshape(add_3);  cat = add_3 = None
    return reshape_35
RuntimeError: CUDA out of memory. Tried to allocate 12.48 GiB. GPU

I am left with these two files: MACE_model_Rev_PBE_D3_1000_confs_v1_gpu_run-1_epoch-118_swa.pt and MACE_model_Rev_PBE_D3_1000_confs_v1_gpu_run-1_epoch-8_.pt

What can I do to finalize the training? The first one is the model after 118th epoch, right? Why the swa suffix?
I don't know why it crashes after the training completion.

Thank you in advance

The text was updated successfully, but these errors were encountered:

gabor1 · 2024-09-03T15:29:35Z

Others will comment on the PyTorch error, but it does say "CUDA out of memory" ! How big are your training data frames? And how much GPU memory do you have?

One thing I'll also say is that there is definitely something wrong with the training itself: you have MAEs on your validation set that is 1000s of eVs per atom! Can you post one of your training frames please?

gabor1 · 2024-09-03T15:40:50Z

and one of your validation/test frames as well.

nihil39 · 2024-09-03T15:45:47Z

Others will comment on the PyTorch error, but it does say "CUDA out of memory" ! How big are your training data frames? And how much GPU memory do you have?

One thing I'll also say is that there is definitely something wrong with the training itself: you have MAEs on your validation set that is 1000s of eVs per atom! Can you post one of your training frames please?

The whole dataset is 39 MB and is made by 1000 configurations of 125 water molecules at 10 different densities, 100 configurations per density. I don't know why it completes the training, but it fails before finalizing if it is really a memory problem.

These are the all files:

Training
dataset_1000_confs_clean.xyz.zip

Test
test_set_1000_confs_clean.zip

nihil39 · 2024-09-03T15:46:54Z

I hope the zip format is right, I can upload also some configurations in txt format if necessary

gabor1 · 2024-09-03T15:53:39Z

So you have a total energy for your first frame that is "-1879.62268026". But you are specifying an E0 for oxygen as "-2041.8396277138045". This does not make sense. The first one looks like an interaction energy already, the second one looks like a total energy including all core electrons. The E0s need to be the isolated atom energies evaluated using exactly the same electronic structure method and code as what you are using for your training configs.

nihil39 · 2024-09-03T16:43:12Z

Thanks, you're right.
We don't have isolated atoms, so I think we will just use average, maybe the results will be the same. We will try a L0 training with 64x0e as hidden irreps. And than the L1 with E0s:average

A question: what is the meaning of ef and universal in the loss parameters? Is Ef energy and forces? {ef,weighted,forces_only,virials,stress,dipole,huber,universal,energy_forces_dipole}

gabor1 · 2024-09-03T21:58:15Z

I think you should compute the energy of isolated H and O atoms (and all others that you need), it takes very little time, and use those as E0s. you will get a more stable model than if you use "average".

gabor1 · 2024-09-03T21:59:17Z

"universal" (poorly named, we're working on that) was the name of the loss function used for our foundation model that was fitted to crystals of the whole periodic table.

nihil39 · 2024-09-07T09:42:00Z

I think you should compute the energy of isolated H and O atoms (and all others that you need), it takes very little time, and use those as E0s. you will get a more stable model than if you use "average".

Computing the E0s energies with MACE averages for both Hydrogen and Oxygen gives [0.0, 0.0]

Computing average Atomic Energies using least squares regression
INFO: Atomic energies: [0.0, 0.0]

According to this VASP wiki (the software we are using to generate the dataset) the calculations for isolated atoms energy should be 0
Quoting

In this calculation, the total energy (energy without entropy) of the single, isolated oxygen atom is close to zero. And actually, if the box size were larger and the precision of the calculation higher, it would go to zero. This is only because all > pseudopotentials have been generated for isolated, nonspinpolarized atoms.

So, if I understand correctly, the result that we get from MACE (least square regression) averages is the right theoretical one. Maybe I am missing something?

Thanks again.

ilyes319 · 2024-09-10T09:54:39Z

Having zeros is very suspicious. Are you sure you used the right keys in your input script?

nihil39 · 2024-09-10T10:06:03Z

Having zeros is very suspicious. Are you sure you used the right keys in your input script?

This is the input file syntax:

REF_stress="0.00061665 -0.00145595 0.00196908 -0.00145595 0.00137584 0.00065622 0.00196908 0.000|                                                                                               
65622 0.00223462" free_energy=-1879.62268026 energy=-1879.62268026 Lattice="16.07633347 0.000000|
00 0.00000000 0.00000000 16.07633347 0.00000000 0.00000000 0.00000000 16.07633347" pbc=[T, T, T]|                                                                                               
 Properties=species:S:1:pos:R:3:REF_forces:R:3

And this is the relevant part of the .yaml training file, It should be right

stress_key: REF_stress
energy_key: free_energy
forces_key: REF_forces

By the way we computed energies with DFT and we set these values:

E0s:
   1: -0.11941284
   8: -0.43787111

ilyes319 · 2024-09-10T10:13:37Z

free_energy is no longer an allowed key in ASE, you should rename it to REF_energy. Currently you are not training on energies.

nihil39 · 2024-09-10T14:40:37Z

free_energy is no longer an allowed key in ASE, you should rename it to REF_energy. Currently you are not training on energies.

Thanks, the error on energy is definitely better now.

Maybe a stupid question, why did you write about ASE? Is it used in order to compare the results with the ones coming from a classical potential?

gabor1 · 2024-09-10T14:45:04Z

I feel there may still be some confusion in the above messages. the isolated atom E0s you say you compute from DFT (i.e. -0.11941284 and -0.43787111) are inconsistent with the total energies you said you have, in the thousands.

gabor1 · 2024-09-10T14:46:25Z

maybe the thousands was for a large system I guess

nihil39 · 2024-09-10T14:46:59Z

maybe the thousands was for a large system I guess

It was for a 375 system atoms, 125 O and 250 H

gabor1 · 2024-09-10T14:47:50Z

So assuming your isolated atom DFT numbers are right, then you give these as E0s to the MACE training. After you have done the MACE training, with --model MACE, or left as default (not ScaleShiftMace by some chance), please verify that when you compute an isolated atom with Mace, you get back exactly the correct DFT E0s

gabor1 · 2024-09-10T14:48:44Z

ASE used to load the configurations, so we have to follow their quirks and conventions.

nihil39 · 2024-09-10T14:52:24Z

So assuming your isolated atom DFT numbers are right, then you give these as E0s to the MACE training. After you have done the MACE training, with --model MACE, or left as default (not ScaleShiftMace by some chance), please verify that when you compute an isolated atom with Mace, you get back exactly the correct DFT E0s

Thanks, I can also try to check what are the E0s values computed with E0s: average, they should not be exactly 0.0 now that we are using the right energy key as @ilyes319 suggested

gabor1 · 2024-09-10T14:54:11Z

I really don't recommend using the "average" option.

nihil39 · 2024-09-10T14:57:50Z

I really don't recommend using the "average" option.

I know, I was just curious to check the differences between those methods. I will use E0s explicitly provided in the training file.

gabor1 · 2024-09-10T14:59:52Z

Sure. If you train with "average", you should find that when you evaluate the isolated atoms with MACE you will get much lower numbers (larger negative numbers) that match the average binding energies per atom.

ilyes319 · 2024-10-03T09:27:21Z

I will turn that to a discussion so we can keep track of it.

ACEsuit locked and limited conversation to collaborators Oct 3, 2024

ilyes319 converted this issue into discussion #621 Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Problem when finalizing the training of a pure water model #581

Problem when finalizing the training of a pure water model #581

nihil39 commented Sep 3, 2024 •

edited

Loading

gabor1 commented Sep 3, 2024

gabor1 commented Sep 3, 2024

nihil39 commented Sep 3, 2024 •

edited

Loading

nihil39 commented Sep 3, 2024

gabor1 commented Sep 3, 2024

nihil39 commented Sep 3, 2024

gabor1 commented Sep 3, 2024

gabor1 commented Sep 3, 2024

nihil39 commented Sep 7, 2024

ilyes319 commented Sep 10, 2024

nihil39 commented Sep 10, 2024

ilyes319 commented Sep 10, 2024 •

edited

Loading

nihil39 commented Sep 10, 2024

gabor1 commented Sep 10, 2024

gabor1 commented Sep 10, 2024

nihil39 commented Sep 10, 2024 •

edited

Loading

gabor1 commented Sep 10, 2024

gabor1 commented Sep 10, 2024

nihil39 commented Sep 10, 2024

gabor1 commented Sep 10, 2024

nihil39 commented Sep 10, 2024

gabor1 commented Sep 10, 2024

ilyes319 commented Oct 3, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Problem when finalizing the training of a pure water model #581

Problem when finalizing the training of a pure water model #581

Comments

nihil39 commented Sep 3, 2024 • edited Loading

gabor1 commented Sep 3, 2024

gabor1 commented Sep 3, 2024

nihil39 commented Sep 3, 2024 • edited Loading

nihil39 commented Sep 3, 2024

gabor1 commented Sep 3, 2024

nihil39 commented Sep 3, 2024

gabor1 commented Sep 3, 2024

gabor1 commented Sep 3, 2024

nihil39 commented Sep 7, 2024

ilyes319 commented Sep 10, 2024

nihil39 commented Sep 10, 2024

ilyes319 commented Sep 10, 2024 • edited Loading

nihil39 commented Sep 10, 2024

gabor1 commented Sep 10, 2024

gabor1 commented Sep 10, 2024

nihil39 commented Sep 10, 2024 • edited Loading

gabor1 commented Sep 10, 2024

gabor1 commented Sep 10, 2024

nihil39 commented Sep 10, 2024

gabor1 commented Sep 10, 2024

nihil39 commented Sep 10, 2024

gabor1 commented Sep 10, 2024

ilyes319 commented Oct 3, 2024

This issue was moved to a discussion.

nihil39 commented Sep 3, 2024 •

edited

Loading

nihil39 commented Sep 3, 2024 •

edited

Loading

ilyes319 commented Sep 10, 2024 •

edited

Loading

nihil39 commented Sep 10, 2024 •

edited

Loading