Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

desi_use_reservation for prods #2346

Merged
merged 5 commits into from
Aug 27, 2024
Merged

desi_use_reservation for prods #2346

merged 5 commits into from
Aug 27, 2024

Conversation

sbailey
Copy link
Contributor

@sbailey sbailey commented Aug 27, 2024

This PR adds a new script desi_use_reservation to assist in moving jobs from the regular queue into a batch reservation while running productions. It moves "just enough" jobs to fill the reservation, but then pauses so that we don't overfill it because

  1. once a job is in a reservation, you can't move it back into the regular queue (you have to move it to a different reservation, or otherwise cancel and resubmit)
  2. reservations are handy for catching up on high priority reruns of specific jobs and we don't want thousands of other jobs backed up in the reservation queue.

The script auto-derives the size of the reservation, whether it is a CPU/GPU partition, and what regular queue jobs are eligible to be run. It prioritizes "bottleneck" jobs like ccdcalib, nightlyflat, psfnight over regular jobs like arc, flat, tilenight, ztile. It only moves jobs into the reservation if they are not waiting on a dependency so that we don't fill the reservation backlog with jobs that can't run anyway. As a consequence, this should be run fairly frequently so that it can move jobs that have become newly eligible.

I have been testing and refining this today with the Kibo run using reservations kibo26_cpu and kibo26_gpu. I'm not done with ideas for additional improvements, but I'll cut myself off from "one more thing" and get this PR out for review. Note: it is safe to run this in a different environment from the production itself, since it is just moving jobs around, not actually submitting jobs that need the right environment.

I updated desispec.workflow.queue.get_jobs_in_queue to include a RESERVATION column. Otherwise the functionality is currently contained inside the bin/desi_use_reservation, though pieces are broken out into functions that could be moved into desisepc.workflow.queue and desispec.scripts as needed.

Example usage

Dry run, run once and exit

Check status of reservation and recommend what to do, but don't actually do anything:

$> desi_use_reservation -r kibo26_cpu --dry-run
INFO:desi_use_reservation:23:get_reservation_info: Getting reservation info with: scontrol show res kibo26_cpu --json
INFO:queue.py:533:get_jobs_in_queue: Querying jobs in queue with: squeue -u desi -o "%i,%P,%v,%j,%u,%t,%M,%D,%R"
INFO:desi_use_reservation:118:use_reservation: At Mon Aug 26 16:57:46 2024, kibo26_cpu (30 nodes) has 7 jobs using 13 nodes
INFO:desi_use_reservation:119:use_reservation: 4 CPU jobs using 6 nodes are eligible to be moved into the reservation
INFO:desi_use_reservation:125:use_reservation: Adding jobs to use 6 additional nodes
INFO:desi_use_reservation:132:use_reservation: Move ccdcalib-20220315-00126226-a0123456789 to kibo26_cpu
INFO:desi_use_reservation:132:use_reservation: Move ccdcalib-20220314-00126112-a0123456789 to kibo26_cpu
INFO:desi_use_reservation:132:use_reservation: Move psfnight-20220312-00125887-a0123456789 to kibo26_cpu
INFO:desi_use_reservation:132:use_reservation: Move psfnight-20220309-00125501-a0123456789 to kibo26_cpu
INFO:desi_use_reservation:144:use_reservation: Dry run mode; will print what to do but not actually run the commands
scontrol update ReservationName=kibo26_cpu JobID=29829831,29829728,29828642,29827788
INFO:desi_use_reservation:199:main: Done checking at Mon Aug 26 16:57:46 2024

Update reservation in a loop

Actually move jobs from the regular queue into the reservation, entering a loop checking every 5 minutes until 2024-08-26T17:30. Include 10 extra nodes worth of jobs so that there is a little buffer of jobs finishing and new ones starting before the next check:

$> desi_use_reservation -r kibo26_cpu --sleep 5 --until 2024-08-26T17:30 -n 10

Copy link
Member

@akremin akremin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this incredibly useful script. I have added a few comments inline. One documents something we discussed in person that should be documented for the future, two others require very minor changes to make the code more robust. If this merge is time-critical we can proceed without the corrections since the script will be fine >99% of the time, but the additional robustness would be nice to have.

bin/desi_use_reservation Show resolved Hide resolved
#- which regular queue partition is eligible for this reservation?
regular_partition = resinfo['partition']

#- Determine CPU vs. GPU reservation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in person, this should be improved in the future to "future proof" it for later systems. For now it works fine on Perlmutter and we don't yet know how things may change for NERSC 10, so this is fine to leave as-is for now.

bin/desi_use_reservation Outdated Show resolved Hide resolved
@sbailey
Copy link
Contributor Author

sbailey commented Aug 27, 2024

@akremin thanks for the comments. I addressed your two comments that needed updates; please re-review.

In the meantime I also added logic to try to prevent a major imbalance of pending tilenight/ztile vs. flat jobs. In the end I'm not actually sure that helps our situation today of a backlog of GPU jobs preventing us from submitting more CPU jobs because either way we have to run those GPU jobs (flat or tilenight or ztile). Thoughts?

@akremin
Copy link
Member

akremin commented Aug 27, 2024

I see the additions for job balancing but I don't see either of the requested changes. Is it possible you forgot to push?

As we discussed earlier the load balancing isn't a bad thing since it allows us to complete earlier nights before moving on to processing flats+science exposures on later nights. But all gpu jobs need to run at some point for all nights to be complete, so it won't change the amount of time to completion of the production. From a human perspective though it is nice to complete nights before moving on in a depth-first strategy, so I like this change.

I might even advocate for 10x instead of 20x. We have 12 flats in a night and ~10-30 tilenights. So 20 is a major imbalance.

@sbailey
Copy link
Contributor Author

sbailey commented Aug 27, 2024

Missing changes pushed. I left the imbalance checking code, but also left it at only correcting the imbalance if it gets way out of whack with a factor of 20. The original reason of wanting to get the flats processed sooner is that they unblock the nightlyflat, which then unblocks the science tiles making them eligible to run in the regular queue even without a reservation. Running an equivalent number of tilenight/ztile jobs doesn't unblock as much.

Copy link
Member

@akremin akremin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the changes look good and my requested changes have been implemented.

@akremin akremin merged commit 9af08d6 into main Aug 27, 2024
26 checks passed
@akremin akremin deleted the use_reservation branch August 27, 2024 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants