-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cross-night dependencies for cumulative redshift jobs #2321
Conversation
…ss-night redshift dependencies
# Conflicts: # py/desispec/workflow/redshifts.py
…ssnight_zcumulative
Thanks for this update, which will prevent the most common case of avoidable job failures. Your tests seem to cover most of the gotcha cases, but here is another one that I don't think is working correctly:
When night B is submitted, desi_submit_night knows that ztile T also depends upon the job from night A and correctly adds it to the dependency list. However, it then tries to submit the job with that dependency. Empirically testing, two things could happen:
In desispec.workflow.processing.submit_batch_script we already check for dependencies that have COMPLETED and remove them, but we don't handle the case where a dependency has already failed. I think we always had this fragility within a night, but it was rare that a night would get submitted and already start running and failing while desi_proc_night was still submitting jobs for later in the night. This will become much more common with cross night dependencies when the previous nights may have been submitted long before the current night. I think in this situation we want to get an entry into the processing table so that future resubmits know about it, but otherwise not submit it now because sbatch will crash. How I tested this case with tile 6810 observed on both 20240115 expid 219391 and 20240116 expid 219523:
When I did that last step immediately after killing the tilenight-20240115-6810 job, sbatch accepted the ztile-6810-thru20240116.slurm job and only later killed it. When I waited a bit before submitting 20240116, then sbatch failed on ztile-6810-thru20240116.slurm and desi_proc_night crashed without submitting the later tiles. I think this PR is handling the resubmission of 20240116 correctly if the original submissions of 20240115 and 20240116 completed and then the jobs failed; but I don't think it is handling the case of dependent jobs failing mid-stream while other nights are being submitted later for the first time. Example prod in /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/xnight. The submit-.log files record the submit logs. For the multiple submit-20240116.log files, I removed the 20240116 processing table in between. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments about handling the cross-night failure case, and questions inline about cache usage.
Thank you, @sbailey , I have updated the code so that if it fails to submit a job 3 times it reports the state as UNSUBMITTED and gives it a default QID of 1. This should allow us to handle the earlier night and then resubmit the later job once the earlier one is handled. I ran a random night to make sure I didn't break the standard submission, but I have NOT tested that it solves the original concern. If you are setup to repeat the test please do so and let me know if further changes are needed. Otherwise I will try to find some time tomorrow to get to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates look good, thanks. I re-tested the "night A fails before night B is submitted" case and it properly held the night B job in an UNSUBMITTED state, and trying to resubmit it before the failing dependency was fixed resulted in a DEP_NOT_SUBD state. Once the dependency was fixed, desi_resubmit_queue_failures was able to submit it. I also updated desi_job_graph to special case QID=1 to recognize that as an unsubmitted job.
Overview
This PR adds cross-night dependencies for cumulative redshift jobs if any exposures were acquired on previous days. It also updates the resubmission code to understand
INTID
's (internal id's) from previous nights and how to use them to open the relevant processing table and determine whether the dependency succeeded or not.Implementation details
To improve I/O performance two new caches were implemented to cache processing table rows. Both were put in
desispec.workflow.proctable
. One,_full_ptab_cache
, is a dictionary with keys as nights and values as entire processing tables (asastropy.table.Table
's). The second,_tilenight_ptab_cache
, only caches tilenight jobs withLASTSTEP='all'
and is one large stacked table of all nights. The caches are not meant to be accessed directly, functions are in the same file that allow the user to query specific nights or tiles, and to load each cache, update each cache, or reset each cache. To parallel this, the previous read_minimal_exposure_tables() was moved todesispec.workflow.exptable
and renamed toread_minimal_science_exptab_cols()
which works with cache_science_etab_cache
. Similar functions are available to query, load, reset, and update that cache.The new dependency code loads all previous processing tables and identifies any previous tilenight jobs for a given tile. If any exist, they are included in the dependencies for the cumulative redshift job.
The resubmissions now check across nights if the dependencies are across nights. However, if anything is wrong with the dependency on a previous night an ERROR log is generated, and submission for that job is skipped. It does not raise an Exception or Python Error such that other jobs can still be submitted if they don't have cross-night issues. We don't resubmit jobs from previous nights because they may depend on other failed jobs and I didn't want to recursively move too far back into another night's processing table. The recommended usage is to resubmit nights chronologically to minimize such issues.
Tests
I have added some unit tests to check whether cross-night dependencies are created when an earlier night contains the same tile. Those are committed and pass in this branch.
I have run multiple tests with both dry-run outputs and real-world tests that included processing data. The current commit has passed all my tests. The prods, with logs and processing tables, can be view at:
/global/cfs/cdirs/desi/users/kremin/PRs/crossnight_zcumulative
which tests with actual submissions/global/cfs/cdirs/desi/users/kremin/PRs/crossnight_zcumulative2
which tests in dry_run_level=1 modeExample of tile 25627 that was observed on 4 nights. I submitted all four nights, then manually killed the third night's tilenight job. This caused the cumulative redshift job on the 4th night to be canceled due to failed dependency as shown in the processing table:
Attempting to resubmit just the 4th night while the 3rd night was still incomplete led to an error and no resubmission.
After processing the 3rd night successfully, the 4th night was allowed to be resubmitted and succeeded:
There are more such cases in the test prod. All appear to behave as intended.