Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cross-night dependencies for cumulative redshift jobs #2321

Merged
merged 13 commits into from
Aug 16, 2024

Conversation

akremin
Copy link
Member

@akremin akremin commented Aug 15, 2024

Overview

This PR adds cross-night dependencies for cumulative redshift jobs if any exposures were acquired on previous days. It also updates the resubmission code to understand INTID's (internal id's) from previous nights and how to use them to open the relevant processing table and determine whether the dependency succeeded or not.

Implementation details

To improve I/O performance two new caches were implemented to cache processing table rows. Both were put in desispec.workflow.proctable. One, _full_ptab_cache, is a dictionary with keys as nights and values as entire processing tables (as astropy.table.Table's). The second, _tilenight_ptab_cache, only caches tilenight jobs with LASTSTEP='all' and is one large stacked table of all nights. The caches are not meant to be accessed directly, functions are in the same file that allow the user to query specific nights or tiles, and to load each cache, update each cache, or reset each cache. To parallel this, the previous read_minimal_exposure_tables() was moved to desispec.workflow.exptable and renamed to read_minimal_science_exptab_cols() which works with cache _science_etab_cache. Similar functions are available to query, load, reset, and update that cache.

The new dependency code loads all previous processing tables and identifies any previous tilenight jobs for a given tile. If any exist, they are included in the dependencies for the cumulative redshift job.

The resubmissions now check across nights if the dependencies are across nights. However, if anything is wrong with the dependency on a previous night an ERROR log is generated, and submission for that job is skipped. It does not raise an Exception or Python Error such that other jobs can still be submitted if they don't have cross-night issues. We don't resubmit jobs from previous nights because they may depend on other failed jobs and I didn't want to recursively move too far back into another night's processing table. The recommended usage is to resubmit nights chronologically to minimize such issues.

Tests

I have added some unit tests to check whether cross-night dependencies are created when an earlier night contains the same tile. Those are committed and pass in this branch.

I have run multiple tests with both dry-run outputs and real-world tests that included processing data. The current commit has passed all my tests. The prods, with logs and processing tables, can be view at:
/global/cfs/cdirs/desi/users/kremin/PRs/crossnight_zcumulative which tests with actual submissions
/global/cfs/cdirs/desi/users/kremin/PRs/crossnight_zcumulative2 which tests in dry_run_level=1 mode

Example of tile 25627 that was observed on 4 nights. I submitted all four nights, then manually killed the third night's tilenight job. This caused the cumulative redshift job on the 4th night to be canceled due to failed dependency as shown in the processing table:

EXPID,OBSTYPE,TILEID,NIGHT,BADAMPS,LASTSTEP,EXPFLAG,PROCCAMWORD,CALIBRATOR,INTID,OBSDESC,JOBDESC,LATEST_QID,SUBMIT_DATE,STATUS,SCRIPTNAME,INT_DEP_IDS,LATEST_DEP_QID,ALL_QIDS
176243|,science,25627,20230413,,all,|,a0123456789,0,230413001,,tilenight,29404230,1723692304,COMPLETED,,|,|,29404230|
176243|,science,25627,20230413,,all,|,a0123456789,0,230413002,,cumulative,29404232,1723692309,CANCELLED,,230413001|220925001|221002001|230409001|,29404230|29404221|29404225|29404227|,29404232|

Attempting to resubmit just the 4th night while the 3rd night was still incomplete led to an error and no resubmission.
After processing the 3rd night successfully, the 4th night was allowed to be resubmitted and succeeded:

EXPID,OBSTYPE,TILEID,NIGHT,BADAMPS,LASTSTEP,EXPFLAG,PROCCAMWORD,CALIBRATOR,INTID,OBSDESC,JOBDESC,LATEST_QID,SUBMIT_DATE,STATUS,SCRIPTNAME,INT_DEP_IDS,LATEST_DEP_QID,ALL_QIDS
176243|,science,25627,20230413,,all,|,a0123456789,0,230413001,,tilenight,29404230,1723692304,COMPLETED,,|,|,29404230|
176243|,science,25627,20230413,,all,|,a0123456789,0,230413002,,cumulative,29406069,1723698600,SUBMITTED,,230413001|220925001|221002001|230409001|,29404221|29404225|29406039|29404230|,29404232|29406069|

There are more such cases in the test prod. All appear to behave as intended.

@akremin akremin requested a review from sbailey August 15, 2024 17:53
@sbailey
Copy link
Contributor

sbailey commented Aug 15, 2024

Thanks for this update, which will prevent the most common case of avoidable job failures. Your tests seem to cover most of the gotcha cases, but here is another one that I don't think is working correctly:

  • tile T was observed on nights A and B
  • night A is submitted and fails before night B is submitted

When night B is submitted, desi_submit_night knows that ztile T also depends upon the job from night A and correctly adds it to the dependency list. However, it then tries to submit the job with that dependency. Empirically testing, two things could happen:

  • If the tilenight T job on night A failed immediately before the ztile T job on night B is submitted, then slurm doesn't yet really know that the dependency has failed and it accepts the ztile T job, and then later cancels it when it realizes that the dependency on night A failed. This leaves an entry in the night B processing table that desi_resubmit_queue_failures can find and resubmit if the night A dependency has been fixed.
  • However, if the tilenight T job on night A failed several minutes before night B is submitted, then desi_proc_night still tries to submit the ztile T job on night B, sbatch fails to accept it due to the already-failed dependency, and desi_proc_night crashes out without submitting the rest of the tiles in the night.

In desispec.workflow.processing.submit_batch_script we already check for dependencies that have COMPLETED and remove them, but we don't handle the case where a dependency has already failed. I think we always had this fragility within a night, but it was rare that a night would get submitted and already start running and failing while desi_proc_night was still submitting jobs for later in the night. This will become much more common with cross night dependencies when the previous nights may have been submitted long before the current night.

I think in this situation we want to get an entry into the processing table so that future resubmits know about it, but otherwise not submit it now because sbatch will crash.

How I tested this case with tile 6810 observed on both 20240115 expid 219391 and 20240116 expid 219523:

# submit night 20240115
desi_proc_night -n 20240115 |& tee submit-20240115.log

# put the ccdcalib job on hold because I don't want to actually run stuff in the queue
scontrol hold 29433874

# purposefully kill the tilenight-20240115-6810 job
scancel 29433988

# now submit 20240116
desi_proc_night -n 20240116 |& tee submit-20240116.log

When I did that last step immediately after killing the tilenight-20240115-6810 job, sbatch accepted the ztile-6810-thru20240116.slurm job and only later killed it.

When I waited a bit before submitting 20240116, then sbatch failed on ztile-6810-thru20240116.slurm and desi_proc_night crashed without submitting the later tiles.

I think this PR is handling the resubmission of 20240116 correctly if the original submissions of 20240115 and 20240116 completed and then the jobs failed; but I don't think it is handling the case of dependent jobs failing mid-stream while other nights are being submitted later for the first time.

Example prod in /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/xnight. The submit-.log files record the submit logs. For the multiple submit-20240116.log files, I removed the 20240116 processing table in between.

Copy link
Contributor

@sbailey sbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments about handling the cross-night failure case, and questions inline about cache usage.

py/desispec/scripts/proc_night.py Show resolved Hide resolved
py/desispec/scripts/tile_redshifts.py Show resolved Hide resolved
@akremin
Copy link
Member Author

akremin commented Aug 16, 2024

Thank you, @sbailey , I have updated the code so that if it fails to submit a job 3 times it reports the state as UNSUBMITTED and gives it a default QID of 1. This should allow us to handle the earlier night and then resubmit the later job once the earlier one is handled. I ran a random night to make sure I didn't break the standard submission, but I have NOT tested that it solves the original concern. If you are setup to repeat the test please do so and let me know if further changes are needed. Otherwise I will try to find some time tomorrow to get to it.

Copy link
Contributor

@sbailey sbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates look good, thanks. I re-tested the "night A fails before night B is submitted" case and it properly held the night B job in an UNSUBMITTED state, and trying to resubmit it before the failing dependency was fixed resulted in a DEP_NOT_SUBD state. Once the dependency was fixed, desi_resubmit_queue_failures was able to submit it. I also updated desi_job_graph to special case QID=1 to recognize that as an unsubmitted job.

@sbailey sbailey merged commit ad71259 into main Aug 16, 2024
26 checks passed
@sbailey sbailey deleted the crossnight_zcumulative branch August 16, 2024 20:30
sbailey pushed a commit that referenced this pull request Aug 16, 2024
@sbailey sbailey mentioned this pull request Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants