Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000 #1645

Open
marcog2020460 opened this issue Sep 29, 2024 · 3 comments

Comments

@marcog2020460
Copy link

Summary

I am using dpgen to train an ensemble of 4 models with 3,000,000 steps (stop_batch": 3000000), but the hpc cluster queues have a time limit of two days only for every run; how can I restart the training from the last step 1,000,000 in order to finish the remaining 2,000,000 steps.
I do not want that my training starts from zero again.

-------------------------iter.000000 task 03--------------------------
: -------------------------iter.000000 task 04--

Please help me, I look for answers on the internet, before submitting this request .
How can I modify the param.json file.

DP-GEN Version

v0.12.0

Platform, Python Version, etc

slurm hpc cluster

Details

"training": {
"_set_prefix": "set",
"stop_batch": 3000000,
"_batch_size": "auto",
"disp_file": "lcurve.out",
"disp_freq": 1000,
"numb_test": "5%",
"save_freq": 1000,
"save_ckpt": "model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": false,
"profiling_file": "timeline.json",
"_comment": "that's all"
}

@marcog2020460 marcog2020460 changed the title I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster only gives me time for 2 days how can I restart the training I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training Sep 29, 2024
@marcog2020460 marcog2020460 changed the title I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000 Sep 29, 2024
@njzjz
Copy link
Member

njzjz commented Sep 30, 2024

Restarting is supported by default.

@chenggoj
Copy link

chenggoj commented Oct 5, 2024

I also have the same question. It will restart from the beginning of the last iteration of the record.dpgen by default, which is not what I want.
I would like to restart it from one checkpoint. I noticed that the documentation mentioned:

"If the process of DP-GEN stops for some reasons, DP-GEN will automatically recover the main process by record.dpgen. You may also change it manually for your purpose, such as removing the last iterations and recovering from one checkpoint."

But it did not provide any example. I am very confused with this statement "such as removing the last iterations and recovering from one checkpoint". I have no idea how to do. Could you please give me any help? Thanks a lot!

@Wanwan-Laang
Copy link

Question of @marcog2020460 : "How can I restart the training from the last step (1,000,000) to complete the remaining 2,000,000 steps?"

  • You can directly submit the job as you did initially. DP-GEN will resume training by reading from the last saved checkpoint, allowing it to continue from where it left off.

Question of @chenggoj : "You may also change it manually for your purpose, such as removing the last iterations and recovering from one checkpoint."

  • DP-GEN automatically manages recovery using the record.dpgen file, which keeps track of the main process. If you want to manually adjust and restart one certain step, you can modify this file, for instance, by removing specific iterations. For example, if you want to restart a DFT calculation within iter.000000, delete the corresponding 0 6 line (and any related lines that follow) in record.dpgen. This will allow DP-GEN to start fresh from that step of iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants