You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kibo had 6 nights where the scronjob submitting the night was killed while the night was still being submitted, leaving a partially submitted night. The resubmission of 4 of those nights ended up with incorrect nightlyflat dependencies: 20210926, 20211005, 20211220, 20220126. Even though all 12 flats had run, the nightlyflat job was only given a subset of them, causing it to think that there were insufficient flats to make the nightly flat and it exited. For example, the 20210926 jobgraph:
And the nightlyflat-20210926-00101851-a0123456789.slurm script has
This was solved in PR #2348 . I will purge these nights and resubmit them with the updated code.
Note this corner case occurred because of a NERSC issue that crashed the job launcher while it was submitting calibrations. When the launcher restarted it tried to pick up where it left off but there was a bug in that logic that hadn't been seen before. This never occurs in daily operations because we wait until all calibrations are available before submitting and therefore all cals are submitted at once successfully. Similarly in a production they should all be submitted together.
Kibo had 6 nights where the scronjob submitting the night was killed while the night was still being submitted, leaving a partially submitted night. The resubmission of 4 of those nights ended up with incorrect nightlyflat dependencies: 20210926, 20211005, 20211220, 20220126. Even though all 12 flats had run, the nightlyflat job was only given a subset of them, causing it to think that there were insufficient flats to make the nightly flat and it exited. For example, the 20210926 jobgraph:
And the nightlyflat-20210926-00101851-a0123456789.slurm script has
Note only 8 expids instead of 12, missing 101845,46,47,50.
Night 20211129 and 20220202 were also resubmitted, but appear to be fine; perhaps their nightlyflats had already run at the time of re-submission?
The text was updated successfully, but these errors were encountered: