Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dependency bug for kibo resubmitted nights #2349

Closed
sbailey opened this issue Aug 28, 2024 · 2 comments
Closed

dependency bug for kibo resubmitted nights #2349

sbailey opened this issue Aug 28, 2024 · 2 comments
Assignees
Labels

Comments

@sbailey
Copy link
Contributor

sbailey commented Aug 28, 2024

Kibo had 6 nights where the scronjob submitting the night was killed while the night was still being submitted, leaving a partially submitted night. The resubmission of 4 of those nights ended up with incorrect nightlyflat dependencies: 20210926, 20211005, 20211220, 20220126. Even though all 12 flats had run, the nightlyflat job was only given a subset of them, causing it to think that there were insufficient flats to make the nightly flat and it exited. For example, the 20210926 jobgraph:

image

And the nightlyflat-20210926-00101851-a0123456789.slurm script has

... desi_proc_joint_fit   --obstype flat --cameras a0123456789 -n 20210926 -e 101851,101852,101855,101856,101857,101860,101861,101862 ...

Note only 8 expids instead of 12, missing 101845,46,47,50.

Night 20211129 and 20220202 were also resubmitted, but appear to be fine; perhaps their nightlyflats had already run at the time of re-submission?

@akremin
Copy link
Member

akremin commented Aug 28, 2024

This was solved in PR #2348 . I will purge these nights and resubmit them with the updated code.

Note this corner case occurred because of a NERSC issue that crashed the job launcher while it was submitting calibrations. When the launcher restarted it tried to pick up where it left off but there was a bug in that logic that hadn't been seen before. This never occurs in daily operations because we wait until all calibrations are available before submitting and therefore all cals are submitted at once successfully. Similarly in a production they should all be submitted together.

@akremin
Copy link
Member

akremin commented Aug 28, 2024

Those nights have been purged and resubmitted. As mentioned before the code was already fixed in PR #2348 .

@akremin akremin closed this as completed Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants