Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider fail-safe for checking on terminated models #688

Open
seth127 opened this issue May 9, 2024 · 4 comments
Open

Consider fail-safe for checking on terminated models #688

seth127 opened this issue May 9, 2024 · 4 comments

Comments

@seth127
Copy link
Collaborator

seth127 commented May 9, 2024

In #685 (comment) @kyleam pointed out numerous situations where a model process can be terminated, but not write out either bbi_config.json or write "Stop Time:" to the .lst file.

To address these, we could potentially build in a fail-safe where we hash OUTPUT files every 10 iterations or so, and “consider dead” if a process hasn’t written to it in <N> iterations. This might be kinda hacky and extreme, but we may want to add it if we’re worried about enough cases where a model stops without writing a bbi_config.json (which could lead to endlessly checking back, thinking it's still running). Related thoughts:

  • Would we consider writing our own bbi_config.json (from bbr) that “pronounced it dead”? I kind of like this idea, but probably needs more thought.
  • if we did this ^ and then we were wrong (i.e. the model was still running), and then bbi tried to write its own bbi_config.json later… what would happen? If it just overwrites the previous one, is that actually just ok and reasonable?
@kyleam
Copy link
Collaborator

kyleam commented May 9, 2024

My initial thought is that I see the motivation, but I'm worried that "every N iterations" may not be calibrated right to account for complicated models that take more time than expected to move (perhaps not as much of a worry in the context of bootstrap?). I suppose conceptually it's just inherently racy, which makes me nervous, but I know you're saying it might be good enough and do more good than harm in practice.

Anyway, my main comment is that...

fail-safe where we hash OUTPUT files every 10 iterations or so, and “consider dead” if a process hasn’t written to it in <N> iterations

... for this use case (a file that's just being appended to), I think you could just stat the file and look at whether the size has changed (i.e. file.size() from R). That'll be sufficient for the check and more efficient than hashing the content.

@kyleam
Copy link
Collaborator

kyleam commented May 9, 2024

and then bbi tried to write its own bbi_config.json later… what would happen?

Based on this test, bbi would just overwrite the file:

bbi nonmem run local 1.ctl &
sleep 0.5
echo alien >1/bbi_config.json
echo 'alien placed; exiting'

@seth127
Copy link
Collaborator Author

seth127 commented May 9, 2024

Based on this test, bbi would just overwrite the file

And... do we think this is fine? I'm leaning towards yes, because it would basically be

  • We thought it was dead, so we marked it dead
  • Any checks will not wait for it, and model_summary() calls (while it's still running) will fail (which is what they'd do anyway while it's still running)
    • If we add in some custom handling of our new "it's dead" bbi_config.json then any intermediate run of config_log() or something might give misleading results, but this is still a hypothetical edge case.
  • The model finishes and overwrite the bbi_config.json
  • Future model_summary(), etc. calls will reflect the model finishing successfully
  • I don't think there will be any evidence left behind that this was once "considered dead", which is probably fine.

Do you agree with that assessment?

@kyleam
Copy link
Collaborator

kyleam commented May 9, 2024

@seth127 Yes, that sounds reasonable to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants