I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000 #1645

marcog2020460 · 2024-09-29T18:45:07Z

Summary

I am using dpgen to train an ensemble of 4 models with 3,000,000 steps (stop_batch": 3000000), but the hpc cluster queues have a time limit of two days only for every run; how can I restart the training from the last step 1,000,000 in order to finish the remaining 2,000,000 steps.
I do not want that my training starts from zero again.

-------------------------iter.000000 task 03--------------------------
: -------------------------iter.000000 task 04--

Please help me, I look for answers on the internet, before submitting this request .
How can I modify the param.json file.

DP-GEN Version

v0.12.0

Platform, Python Version, etc

slurm hpc cluster

Details

"training": {
"_set_prefix": "set",
"stop_batch": 3000000,
"_batch_size": "auto",
"disp_file": "lcurve.out",
"disp_freq": 1000,
"numb_test": "5%",
"save_freq": 1000,
"save_ckpt": "model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": false,
"profiling_file": "timeline.json",
"_comment": "that's all"
}

njzjz · 2024-09-30T19:31:48Z

Restarting is supported by default.

chenggoj · 2024-10-05T14:36:07Z

I also have the same question. It will restart from the beginning of the last iteration of the record.dpgen by default, which is not what I want.
I would like to restart it from one checkpoint. I noticed that the documentation mentioned:

"If the process of DP-GEN stops for some reasons, DP-GEN will automatically recover the main process by record.dpgen. You may also change it manually for your purpose, such as removing the last iterations and recovering from one checkpoint."

But it did not provide any example. I am very confused with this statement "such as removing the last iterations and recovering from one checkpoint". I have no idea how to do. Could you please give me any help? Thanks a lot!

Wanwan-Laang · 2024-11-13T18:53:55Z

Question of @marcog2020460 : "How can I restart the training from the last step (1,000,000) to complete the remaining 2,000,000 steps?"

You can directly submit the job as you did initially. DP-GEN will resume training by reading from the last saved checkpoint, allowing it to continue from where it left off.

Question of @chenggoj : "You may also change it manually for your purpose, such as removing the last iterations and recovering from one checkpoint."

DP-GEN automatically manages recovery using the record.dpgen file, which keeps track of the main process. If you want to manually adjust and restart one certain step, you can modify this file, for instance, by removing specific iterations. For example, if you want to restart a DFT calculation within iter.000000, delete the corresponding 0 6 line (and any related lines that follow) in record.dpgen. This will allow DP-GEN to start fresh from that step of iteration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000 #1645

I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000 #1645

marcog2020460 commented Sep 29, 2024

njzjz commented Sep 30, 2024

chenggoj commented Oct 5, 2024

Wanwan-Laang commented Nov 13, 2024

I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000 #1645

I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000 #1645

Comments

marcog2020460 commented Sep 29, 2024

Summary

DP-GEN Version

Platform, Python Version, etc

Details

njzjz commented Sep 30, 2024

chenggoj commented Oct 5, 2024

Wanwan-Laang commented Nov 13, 2024