Better restart capability for interrupted runs #5232

camelto2 · 2024-11-15T18:12:28Z

Is your feature request related to a problem? Please describe.

It looks like there is batched restart support capability in the code, which is enabled by the *.cont.xml file and the *.config.h5 files. The cont.xml file is basically just copy/paste of the original input file except that it now includes an mcwalkerset for restart.
The cont.xml is also only written for the last series, and only if the run actually finishes.

The way the restarts and cont.xml are currently written, it looks like they are basically written to enable adding more statistics from a fully completed run. If you are running and you hit the wallclock limit, the cont.xml file is never written so restarting and continuing from the current series isn't as straightforward.

Describe the solution you'd like
instead of only writing cont.xml at the end, and also having the cont.xml file include an exact copy of all the VMC/DMC runs, it would be nice if each series wrote its own cont.xml at the beginning and only included the driver that the series corresponded to.
That would enable both adding more data to each series if you need it for better statistics, and it would enable restarting if the run is interrupted by wallclock limits.

For example, I tend to have
< vmc > (s000)
< dmc tstep1 > (s001)
< dmc tstep2 > (s002)
< dmc tstep3 > (s003)
< dmc tstep4 > (s004)

where vmc is a fully converged VMC run, tstep1 is a large timestep for equilibration, and tsteps 2-4 are subsequently smaller timesteps used for extrapolation.

At the start of each driver, it could write s000.cont.xml with the corresponding < mcwalkerset fileroot="s000" >
and the s000.cont.xml ONLY had the < vmc > section in it. The s001.cont.xml would be written once we start the first DMC, and it would have the < mcwalkerset fileroot="s001" > and only the < dmc tstep1 > driver in it. And so on and so forth.

This way each series would have a *.cont.xml file which only continues with its own driver from its current walkerset. My current issue is that I had a run that finished all of my VMC and series 001 002 003, but the s004 only got through 2-3 blocks and hit the wallclock limit. If the s004.cont.xml was appropriately written, I could have a file to continue just that series from. As it currently stands, I had to do a lot of scripting to enable what I want.

Also, if we have a run where all of them finished successfully, but we need to add more statistics, you could just restart each of them and they would continue on from their own respective walkersets.

Describe alternatives you've considered

Maybe nexus could something like this as well

Additional context
Add any other context or screenshots about the feature request here.

ye-luo · 2024-11-15T18:36:48Z

I like this direction. Some cleanup is definitely needed. We need to define a the continuation file serves.

run more statistics.
continue running qmcpack by rerunning the incomplete series. I feel the current cont.xml serves more like this failure recovery mode.

I saw one issue in the proposed scheme

The s001.cont.xml would be written once we start the first DMC, and it would have the < mcwalkerset fileroot="s001" > and only the < dmc tstep1 > driver in it. And so on and so forth.

if DMC run got killed, we won't have any RNG seed file and configuration file. There is no way to continue. Thus cont.xml should be written when a series completes not at the beginning.

camelto2 · 2024-11-15T18:50:19Z

The case I currently care about is if one of the series gets killed by wallclock time. If I'm understanding you correctly, you need both the random.h5 and the config.h5 to properly continue a run. So if s004 got killed by wallclock, there isn't a clean way to pick up where I left off on that series?

jtkrogel · 2024-11-15T19:48:18Z

While we are wishing, I would like for it to be even simpler: just modify the original input file by setting a single parameter

<parameter name="restart_at_series"> 2 </parameter>

QMCPACK would simply know which files to look for based on this request.

This is similar in spirit to the ease of use offered by Quantum Espresso, where one just states restart = .true..

prckent · 2024-11-15T22:31:53Z

Thanks for bringing this up again Cody - we have discussed this a couple a times over the last few years (!). There might be another issue or two on this topic already; a new one does not hurt. It clearly would be a very useful and friendly feature to have in some form. I would go further than Jaron and prefer to have the code automatically figure out how far it got without any modified input files by checking which files already exist (simpler) and eventually also do restarts within a section, but that is getting far ahead of ourselves.

I think the low hanging fruit here is to enable restarting a run at a series "m" that was not completed by rerunning that section from the state saved by the previous series "m-1". As you point out, provided we have the configs and random state, we can in principle do a clean restart. I would also suggest we do not restart if there is a previous optimization series (since we would need to load the updated optimized coefficients) for simplicity and getting something useful working fast. All that really needs to be implemented is writing the cont files at the end of every series and the restart logic.

The step after this would be to allow restarts within a series and not throw away any existing progress from earlier runs, but for this to be useful with DMC we probably need to write a bit more status info. I have other thoughts but prefer to keep the discussion on the simplest thing(s) that would be useful in the short term.

camelto2 added enhancement discussion labels Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better restart capability for interrupted runs #5232

Better restart capability for interrupted runs #5232

camelto2 commented Nov 15, 2024 •

edited

Loading

ye-luo commented Nov 15, 2024

camelto2 commented Nov 15, 2024

jtkrogel commented Nov 15, 2024 •

edited

Loading

prckent commented Nov 15, 2024

Better restart capability for interrupted runs #5232

Better restart capability for interrupted runs #5232

Comments

camelto2 commented Nov 15, 2024 • edited Loading

ye-luo commented Nov 15, 2024

camelto2 commented Nov 15, 2024

jtkrogel commented Nov 15, 2024 • edited Loading

prckent commented Nov 15, 2024

camelto2 commented Nov 15, 2024 •

edited

Loading

jtkrogel commented Nov 15, 2024 •

edited

Loading