Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: More robust error recovery #110

Open
hectorpal opened this issue Dec 16, 2022 · 5 comments
Open

feature request: More robust error recovery #110

hectorpal opened this issue Dec 16, 2022 · 5 comments

Comments

@hectorpal
Copy link

Hi there!

The FAQ (latest version says):

Some runs failed. How can I rerun them?

If the failed runs were never started, for example, due to grid node failures, you can simply run the “start” experiment step again. It will skip all runs that have already been started. Afterwards, run “fetch” and make reports as usual.
Lab detects which runs have already been started by checking if the driver.log file exists. So if you have failed runs that were already started, but you want to rerun them anyway, go to their run directories, remove the driver.log files and then run the “start” experiment step again as above.

It would be nice to have the option that restarting an experiment is idempotent.
That is to automatize that restarting a failed run protects the integrity of the experiment without the manual deletion of files like driver.log.
That would be useful when using the lab in a computing infrastructure where jobs could be preempted to run another task with higher priority. (This is typical in cases where many other tasks are training jobs that are idempotent).

If that were not convenient as the default behaviour, perhaps this behaviour could be enabled by some additional option.

I understand a potential issue is that some runs can just keep failing, so perhaps reaching idempotence is more subtle, but it'd be a great feature.

/cc @matgreco @alvaro-torralba

@jendrikseipp
Copy link
Collaborator

Hey @hectorpal, thanks for the suggestion! I'm not sure how to improve the status quo on this, however. It's easy to detect runs that have not been started. But as you say, it's tricky to check whether a run was successful. I don't see a general way of doing so. Do you? The main problem is that we need to count running out of time or memory as a successful run. Those are "expected errors" so to say.

@alvaro-torralba
Copy link

The problem we were having was if a job was interrupted due to external reasons (e.g. the cluster preempting the job). I was thinking that perhaps the run.py script could do something at the very end, after closing the run.log and run.err files, write some extra property for example of 'job_finished'. That way it could be easy to check which jobs were terminated in the "normal" way or when run.py was killed externally.
But I'm not sure if this works under the current way of setting the time and memory limits.

@hectorpal
Copy link
Author

hectorpal commented Dec 19, 2022

I agree with Alvaro about this general idea:
at the very end, no matter what happens with the run, idempotency might be achieved by creating a file marking the run is done. File creation is atomic.
If it failed because of memory, it should be possible to create a file –'job_finished'– before liberating the parallel worker. In this setting, it's possible to lose work if a run ends before 'job_finished' is created. That's the price to pay for idempotency.

How?

Idea 1

I wonder if the right place is the end of wait() here:

lab/lab/calls/call.py

Lines 190 to 213 in dfa67fa

def wait(self):
wall_clock_start_time = time.time()
self._redirect_streams()
retcode = self.process.wait()
for stream, _ in self.redirected_streams_and_limits.values():
# Write output to disk before the next Call starts.
stream.flush()
os.fsync(stream.fileno())
# Close files that were opened in the constructor.
for file in self.opened_files:
file.close()
wall_clock_time = time.time() - wall_clock_start_time
logging.info(f"{self.name} wall-clock time: {wall_clock_time:.2f}s")
if (
self.wall_clock_time_limit is not None
and wall_clock_time > self.wall_clock_time_limit
):
logging.error(
f"wall-clock time for {self.name} too high: "
f"{wall_clock_time:.2f} > {self.wall_clock_time_limit}"
)
logging.info(f"{self.name} exit code: {retcode}")
return retcode

That's waiting for the return of Popen call. It should return something no matter what happens with the process. One idea would be to create the job_finished file right before returning. Perhaps that's not enough, as the return value is to be stored later. So here is another idea

Idea 2

job_finished is created at the highest level point right before switching to the next run.
This is compatible with the run producing a plan or failing because of bounded memory, time, or even an execution failure.

@jendrikseipp
Copy link
Collaborator

Your second proposal sounds like it could work. I'll think more about this after the break.

@hectorpal
Copy link
Author

Good!

I was wondering about race conditions when using multiple CPU cores.

I guess the iteration over runs is centralized so there isn't much to coordinate. Otherwise, I was wondering if a lock of the directory is necessary or used per run. If that's happening, it would interact with the idea I proposed.

Happy holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants