Issue with Second Restart with PESTPP-GLM #231

jrwbell · 2023-01-28T22:38:56Z

jrwbell
Jan 28, 2023

Hi PESTPP Community.

I'm a 2006 onward veteran of PEST and I am trying to overcome a limitation with BEOPEST64, insofar as any millisecond 'hiccup' in network communications seems to cause it to fall over. I've optimised my network at home, for my cluster, as much as I can, but this issue keeps causing grief.

For BEOPEST64, I suspect that it doesn't have sufficent/any time delay in its attempt to communicate between Manager and Agent, hence 'times out' instantly, even though whatever is causing that millisecond 'hiccup' in the network isn't sufficient for Windows to report a problem. I have Static IP's on the Manager and most of the Agents, but not all of them. I run a Gigabit LAN at home and I wonder whether what's happening is to do with how fast network communication has gotten, compared to 10/100 Mbps of the past? I'm speculating as I'm not a network engineer.

Anyhow, I've been experimenting with PESTPP-GLM to see whether I could use it instead of BEOPEST64, since PANTHER seems solid.

The purpose of the experiment is to emulate a situation where the Manager crashes due a power outage or the network properly dies. i.e. major failure. With 5000 runs, reliably restarting a Jacobian run is important.

In my experiment, during an active run, I copy across the .RNJ file (presuming this is the equivalent to .PRF from BEOPEST64) plus everything else to a save folder, just like what I have done for years with BEOPEST64.

After killing the Manager (either through Ctrl+C or just "X" to kill the DOS window; representative of a power failure on the Manager), if I copy the .RNJ back from the save folder into the Manager's folder, I seem to be able to restart the Manager where I left off the first time. i.e. it recognises the runs already completed, as I would hope and all is good.

I then continued my experiment, where I let the Manager progress further from that restart, copying the .RNJ and the other files across to a save folder during the active run.

I then kill the Manager again (either through Ctrl+C or just "X" to kill the DOS window; I've tried both), then copied the .RNJ back from the save folder as before. The problem though is PESTPP-GLM won't restart the second time. It just reports that all the runs need to be redone.

I repeated the experiment several times, using Ctrl+C or "X", copying just the .RNJ or all of the files across from the save folder back to the Manager. Each time it works for the first restart but not for the second restart.

Anyhow, I hope someone could shed some light on how to get this to work reliably, as I think PANTHER looks solid and I'd like to use it for these big Jacobian runs (as it'll forgive these millisecond 'hiccups' in the network, I feel).

regards
Justin

jtwhite79 · 2023-01-30T06:16:25Z

jtwhite79
Jan 30, 2023
Maintainer

Hey @jrwbell - I checked this out and can confirm the multiple-jco restart has a bug and I think I've got it fixed (at least for the base-parameter (not SVDA) use case). So this should be updated in the next release - hopefully later this week.

However, full disclosure: PEST_HP is likely going to be a stronger tool for tikhonov-regularized parameter estimation. There are so many dirty tricks and learnings that are baked into PEST_HP for this use case that pestpp-glm does not have. I'm not trying to dissuade you from using pestpp-glm - it should still minimize the objective function (and it does the FOSM uncertainty thing automatically), just trying to manage expectations...

1 reply

jrwbell Jan 30, 2023
Author

Thanks Jeremy. Tracking that down is appreciated. I'll cool my heels for a bit. No worries re: pestpp-glm, your efforts are greatly appreciated, along with the rest of the development team.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Second Restart with PESTPP-GLM #231

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Issue with Second Restart with PESTPP-GLM #231

jrwbell Jan 28, 2023

Replies: 1 comment · 1 reply

jtwhite79 Jan 30, 2023 Maintainer

jrwbell Jan 30, 2023 Author

jrwbell
Jan 28, 2023

Replies: 1 comment 1 reply

jtwhite79
Jan 30, 2023
Maintainer

jrwbell Jan 30, 2023
Author