-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider fail-safe for checking on terminated models #688
Comments
My initial thought is that I see the motivation, but I'm worried that "every N iterations" may not be calibrated right to account for complicated models that take more time than expected to move (perhaps not as much of a worry in the context of bootstrap?). I suppose conceptually it's just inherently racy, which makes me nervous, but I know you're saying it might be good enough and do more good than harm in practice. Anyway, my main comment is that...
... for this use case (a file that's just being appended to), I think you could just stat the file and look at whether the size has changed (i.e. |
Based on this test,
|
And... do we think this is fine? I'm leaning towards yes, because it would basically be
Do you agree with that assessment? |
@seth127 Yes, that sounds reasonable to me. |
In #685 (comment) @kyleam pointed out numerous situations where a model process can be terminated, but not write out either
bbi_config.json
or write "Stop Time:" to the.lst
file.To address these, we could potentially build in a fail-safe where we hash
OUTPUT
files every 10 iterations or so, and “consider dead” if a process hasn’t written to it in<N>
iterations. This might be kinda hacky and extreme, but we may want to add it if we’re worried about enough cases where a model stops without writing abbi_config.json
(which could lead to endlessly checking back, thinking it's still running). Related thoughts:bbi_config.json
(frombbr
) that “pronounced it dead”? I kind of like this idea, but probably needs more thought.bbi
tried to write its ownbbi_config.json
later… what would happen? If it just overwrites the previous one, is that actually just ok and reasonable?The text was updated successfully, but these errors were encountered: