-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: More robust error recovery #110
Comments
Hey @hectorpal, thanks for the suggestion! I'm not sure how to improve the status quo on this, however. It's easy to detect runs that have not been started. But as you say, it's tricky to check whether a run was successful. I don't see a general way of doing so. Do you? The main problem is that we need to count running out of time or memory as a successful run. Those are "expected errors" so to say. |
The problem we were having was if a job was interrupted due to external reasons (e.g. the cluster preempting the job). I was thinking that perhaps the run.py script could do something at the very end, after closing the run.log and run.err files, write some extra property for example of 'job_finished'. That way it could be easy to check which jobs were terminated in the "normal" way or when run.py was killed externally. |
I agree with Alvaro about this general idea: How? Idea 1I wonder if the right place is the end of Lines 190 to 213 in dfa67fa
That's waiting for the return of Popen call. It should return something no matter what happens with the process. One idea would be to create the Idea 2
|
Your second proposal sounds like it could work. I'll think more about this after the break. |
Good! I was wondering about race conditions when using multiple CPU cores. I guess the iteration over runs is centralized so there isn't much to coordinate. Otherwise, I was wondering if a lock of the directory is necessary or used per run. If that's happening, it would interact with the idea I proposed. Happy holidays! |
Hi there!
The FAQ (latest version says):
It would be nice to have the option that restarting an experiment is idempotent.
That is to automatize that restarting a failed run protects the integrity of the experiment without the manual deletion of files like
driver.log
.That would be useful when using the lab in a computing infrastructure where jobs could be preempted to run another task with higher priority. (This is typical in cases where many other tasks are training jobs that are idempotent).
If that were not convenient as the default behaviour, perhaps this behaviour could be enabled by some additional option.
I understand a potential issue is that some runs can just keep failing, so perhaps reaching idempotence is more subtle, but it'd be a great feature.
/cc @matgreco @alvaro-torralba
The text was updated successfully, but these errors were encountered: