feature request: More robust error recovery #110

hectorpal · 2022-12-16T17:06:41Z

Hi there!

The FAQ (latest version says):

Some runs failed. How can I rerun them?

If the failed runs were never started, for example, due to grid node failures, you can simply run the “start” experiment step again. It will skip all runs that have already been started. Afterwards, run “fetch” and make reports as usual.
Lab detects which runs have already been started by checking if the driver.log file exists. So if you have failed runs that were already started, but you want to rerun them anyway, go to their run directories, remove the driver.log files and then run the “start” experiment step again as above.

It would be nice to have the option that restarting an experiment is idempotent.
That is to automatize that restarting a failed run protects the integrity of the experiment without the manual deletion of files like driver.log.
That would be useful when using the lab in a computing infrastructure where jobs could be preempted to run another task with higher priority. (This is typical in cases where many other tasks are training jobs that are idempotent).

If that were not convenient as the default behaviour, perhaps this behaviour could be enabled by some additional option.

I understand a potential issue is that some runs can just keep failing, so perhaps reaching idempotence is more subtle, but it'd be a great feature.

/cc @matgreco @alvaro-torralba

The text was updated successfully, but these errors were encountered:

jendrikseipp · 2022-12-19T06:13:52Z

Hey @hectorpal, thanks for the suggestion! I'm not sure how to improve the status quo on this, however. It's easy to detect runs that have not been started. But as you say, it's tricky to check whether a run was successful. I don't see a general way of doing so. Do you? The main problem is that we need to count running out of time or memory as a successful run. Those are "expected errors" so to say.

alvaro-torralba · 2022-12-19T08:30:23Z

The problem we were having was if a job was interrupted due to external reasons (e.g. the cluster preempting the job). I was thinking that perhaps the run.py script could do something at the very end, after closing the run.log and run.err files, write some extra property for example of 'job_finished'. That way it could be easy to check which jobs were terminated in the "normal" way or when run.py was killed externally.
But I'm not sure if this works under the current way of setting the time and memory limits.

hectorpal · 2022-12-19T18:07:18Z

I agree with Alvaro about this general idea:
at the very end, no matter what happens with the run, idempotency might be achieved by creating a file marking the run is done. File creation is atomic.
If it failed because of memory, it should be possible to create a file –'job_finished'– before liberating the parallel worker. In this setting, it's possible to lose work if a run ends before 'job_finished' is created. That's the price to pay for idempotency.

How?

Idea 1

I wonder if the right place is the end of wait() here:

lab/lab/calls/call.py

Lines 190 to 213 in dfa67fa

    
           def wait(self): 
        
               wall_clock_start_time = time.time() 
        
               self._redirect_streams() 
        
               retcode = self.process.wait() 
        
               for stream, _ in self.redirected_streams_and_limits.values(): 
        
                   # Write output to disk before the next Call starts. 
        
                   stream.flush() 
        
                   os.fsync(stream.fileno()) 
        
               # Close files that were opened in the constructor. 
        
               for file in self.opened_files: 
        
                   file.close() 
        
               wall_clock_time = time.time() - wall_clock_start_time 
        
               logging.info(f"{self.name} wall-clock time: {wall_clock_time:.2f}s") 
        
               if ( 
        
                   self.wall_clock_time_limit is not None 
        
                   and wall_clock_time > self.wall_clock_time_limit 
        
               ): 
        
                   logging.error( 
        
                       f"wall-clock time for {self.name} too high: " 
        
                       f"{wall_clock_time:.2f} > {self.wall_clock_time_limit}" 
        
                   ) 
        
               logging.info(f"{self.name} exit code: {retcode}") 
        
               return retcode

That's waiting for the return of Popen call. It should return something no matter what happens with the process. One idea would be to create the job_finished file right before returning. Perhaps that's not enough, as the return value is to be stored later. So here is another idea

Idea 2

job_finished is created at the highest level point right before switching to the next run.
This is compatible with the run producing a plan or failing because of bounded memory, time, or even an execution failure.

jendrikseipp · 2022-12-23T15:56:31Z

Your second proposal sounds like it could work. I'll think more about this after the break.

hectorpal · 2022-12-23T17:46:49Z

Good!

I was wondering about race conditions when using multiple CPU cores.

I guess the iteration over runs is centralized so there isn't much to coordinate. Otherwise, I was wondering if a lock of the directory is necessary or used per run. If that's happening, it would interact with the idea I proposed.

Happy holidays!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: More robust error recovery #110

feature request: More robust error recovery #110

hectorpal commented Dec 16, 2022

Some runs failed. How can I rerun them?

jendrikseipp commented Dec 19, 2022

alvaro-torralba commented Dec 19, 2022

hectorpal commented Dec 19, 2022 •

edited

Loading

jendrikseipp commented Dec 23, 2022

hectorpal commented Dec 23, 2022

feature request: More robust error recovery #110

feature request: More robust error recovery #110

Comments

hectorpal commented Dec 16, 2022

Some runs failed. How can I rerun them?

jendrikseipp commented Dec 19, 2022

alvaro-torralba commented Dec 19, 2022

hectorpal commented Dec 19, 2022 • edited Loading

Idea 1

Idea 2

jendrikseipp commented Dec 23, 2022

hectorpal commented Dec 23, 2022

hectorpal commented Dec 19, 2022 •

edited

Loading