Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tweaks for running jura #2242

Merged
merged 1 commit into from
May 10, 2024
Merged

tweaks for running jura #2242

merged 1 commit into from
May 10, 2024

Conversation

sbailey
Copy link
Contributor

@sbailey sbailey commented May 9, 2024

This PR includes 3 updates to make running Jura just a bit easier:

  • write_traces_in_psf uses an intermediate temporary filename
    • we tripped on this when a job had an I/O error while writing PSF traces, leaving behind a 0-length PSF file of the right name, tricking further re-submissions to skip over that step.
  • Increase the tilenight job runtime by 5 minutes (10 minutes * 0.5 perlmutter-gpu speed factor).
  • Update queue_info_from_qids to work in batches of 100 qids at a time when calling sacct. I don't know what the upper limit is, but emperically 8*100 works but 800 doesn't.

Details:

write_traces_in_psf tested with

desi_compute_trace_shifts -i /dvs_ro/cfs/cdirs/desi/spectro/redux/jura/preproc/20220104/00116769/preproc-b4-00116769.fits.gz --psf /dvs_ro/cfs/cdirs/desi/spectro/redux/jura/calibnight/20220104/psfnight-b4-20220104.fits --degxx 2 --degxy 0 --continuum --outpsf $SCRATCH/psf.fits

that also normally worked before, but checks that I don't have typos.

tilenight job runtimes from jura so far:
image
The orange line gives the current job runtime limit, which is why the dots don't exceed that line. I wanted to give a little more time while still keeping the nexp=1 case under the 30 minute debug queue limit (now 26 minutes).

queue_info_from_qids also tested with real-life usage parsing jura jobs, e.g. $CFS/desi/users/sjbailey/dev/jura/ccdcalib_runtime.py

Speaking of ccdcalib runtimes, I also considered increasing those job runtimes since they sometimes timeout. However, the current limit of 15 minutes is already pretty far into the tail of the regular distribution so I left it as is:
image

@akremin after review, I suggest that we merge this, create a new incremental tag, and continue with Jura with this version.

@sbailey sbailey requested a review from akremin May 9, 2024 23:40
@sbailey
Copy link
Contributor Author

sbailey commented May 10, 2024

I also looked at flat job runtimes since those sometimes timeout. Our current limit of 20 minutes is already pretty far out on the tail so I think it is better to sometimes let them timeout rather than extend the limit further and waste even more compute time when a job hangs.
image

Copy link
Member

@akremin akremin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, and thank you for the careful timing analysis. I agree we should leave the flats even if they do occasionally time out.

@akremin akremin merged commit 238a4a0 into main May 10, 2024
26 checks passed
@akremin akremin deleted the jura-tweaks branch May 10, 2024 05:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants