You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000
#1645
Open
marcog2020460 opened this issue
Sep 29, 2024
· 3 comments
I am using dpgen to train an ensemble of 4 models with 3,000,000 steps (stop_batch": 3000000), but the hpc cluster queues have a time limit of two days only for every run; how can I restart the training from the last step 1,000,000 in order to finish the remaining 2,000,000 steps.
I do not want that my training starts from zero again.
The text was updated successfully, but these errors were encountered:
marcog2020460
changed the title
I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster only gives me time for 2 days how can I restart the training
I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training
Sep 29, 2024
marcog2020460
changed the title
I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training
I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000
Sep 29, 2024
I also have the same question. It will restart from the beginning of the last iteration of the record.dpgen by default, which is not what I want.
I would like to restart it from one checkpoint. I noticed that the documentation mentioned:
"If the process of DP-GEN stops for some reasons, DP-GEN will automatically recover the main process by record.dpgen. You may also change it manually for your purpose, such as removing the last iterations and recovering from one checkpoint."
But it did not provide any example. I am very confused with this statement "such as removing the last iterations and recovering from one checkpoint". I have no idea how to do. Could you please give me any help? Thanks a lot!
Question of @marcog2020460 :"How can I restart the training from the last step (1,000,000) to complete the remaining 2,000,000 steps?"
You can directly submit the job as you did initially. DP-GEN will resume training by reading from the last saved checkpoint, allowing it to continue from where it left off.
Question of @chenggoj :"You may also change it manually for your purpose, such as removing the last iterations and recovering from one checkpoint."
DP-GEN automatically manages recovery using the record.dpgen file, which keeps track of the main process. If you want to manually adjust and restart one certain step, you can modify this file, for instance, by removing specific iterations. For example, if you want to restart a DFT calculation within iter.000000, delete the corresponding 0 6 line (and any related lines that follow) in record.dpgen. This will allow DP-GEN to start fresh from that step of iteration.
Summary
I am using dpgen to train an ensemble of 4 models with 3,000,000 steps (stop_batch": 3000000), but the hpc cluster queues have a time limit of two days only for every run; how can I restart the training from the last step 1,000,000 in order to finish the remaining 2,000,000 steps.
I do not want that my training starts from zero again.
-------------------------iter.000000 task 03--------------------------
: -------------------------iter.000000 task 04--
Please help me, I look for answers on the internet, before submitting this request .
How can I modify the param.json file.
DP-GEN Version
v0.12.0
Platform, Python Version, etc
slurm hpc cluster
Details
"training": {
"_set_prefix": "set",
"stop_batch": 3000000,
"_batch_size": "auto",
"disp_file": "lcurve.out",
"disp_freq": 1000,
"numb_test": "5%",
"save_freq": 1000,
"save_ckpt": "model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": false,
"profiling_file": "timeline.json",
"_comment": "that's all"
}
The text was updated successfully, but these errors were encountered: