Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Slurm Launcher #741

Open
wants to merge 8 commits into
base: v1.0
Choose a base branch
from

Conversation

MattToast
Copy link
Member

Implement a Slurm Launcher for fine grain status tracking of Slurm jobs (so far just srun, but can be expanded to include sbatch) with the V1 API.

@MattToast MattToast added type: feature Issues that include feature request or feature idea area: launcher Issues related to any of the launchers within SmartSim labels Oct 10, 2024
@MattToast MattToast self-assigned this Oct 10, 2024
Copy link
Member Author

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small question for reviewers in a departure from the original impl, and neither it nor this new one seem "correct" in my opinion.

Comment on lines +327 to +342
@pytest.mark.xfail(reason=r"Slurm launcher cannout parse `CANCELLED by \d+` syntax")
@requires_slurm
@requires_alloc_size(1)
def test_srun_sleep_for_two_min_with_cancel(make_srun_command):
launcher = SlurmLauncher()
srun = make_srun_command(
["-N", "1", "-n", "1"], ["sleep", "120"], use_current_alloc=True
)
id_ = launcher.start(srun)
time.sleep(1)
assert launcher.get_status(id_)[id_] == JobStatus.RUNNING
launcher.stop_jobs(id_)
time.sleep(1)
assert launcher.get_status(id_)[id_] == JobStatus.CANCELLED
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for reviewers: Is this something we want to be able to handle? SmartSim V0.8 also fails this test. It's impl will set the status to JobStatus.FAILED and this impl currently sets the status to JobStatus.UNKOWN.

If so, should this be handled in this ticket or a follow on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about the distinction between unknown and cancelled? If not, ignore it.

If yes, new ticket. I wouldn't add more to your existing ticket since it's technically a breaking change and non-critical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a plan! I think we care about the difference (or at least we cared enough to put it in the original API), so I'll add a note to my "tickets to write for slurm launcher" list!

Copy link

codecov bot commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 64.70588% with 6 lines in your changes missing coverage. Please review.

Please upload report for BASE (v1.0@879b96e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
smartsim/_core/launcher/launcher.py 40.00% 3 Missing ⚠️
smartsim/entity/dbnode.py 66.66% 1 Missing ⚠️
smartsim/launchable/mpmd_job.py 0.00% 1 Missing ⚠️
smartsim/launchable/mpmd_pair.py 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             v1.0     #741   +/-   ##
=======================================
  Coverage        ?   59.42%           
=======================================
  Files           ?      142           
  Lines           ?     9146           
  Branches        ?        0           
=======================================
  Hits            ?     5435           
  Misses          ?     3711           
  Partials        ?        0           
Files with missing lines Coverage Δ
smartsim/_core/launcher_/shell/shell_launcher.py 95.06% <ø> (ø)
smartsim/_core/utils/helpers.py 56.42% <100.00%> (ø)
smartsim/settings/arguments/launch/local.py 87.50% <100.00%> (ø)
smartsim/settings/arguments/launch/mpi.py 95.34% <100.00%> (ø)
smartsim/entity/dbnode.py 36.29% <66.66%> (ø)
smartsim/launchable/mpmd_job.py 87.75% <0.00%> (ø)
smartsim/launchable/mpmd_pair.py 80.00% <0.00%> (ø)
smartsim/_core/launcher/launcher.py 32.07% <40.00%> (ø)

@MattToast MattToast force-pushed the updated-slurm-launcher branch from 34334fa to 2679e30 Compare October 10, 2024 20:31
Copy link
Collaborator

@al-rigazzi al-rigazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments, otherwise looks great! LGTM

pyproject.toml Outdated Show resolved Hide resolved
@@ -121,7 +121,10 @@ def parse_sstat_nodes(output: str, job_id: str) -> t.List[str]:
return list(set(nodes))


def parse_step_id_from_sacct(output: str, step_name: str) -> t.Optional[str]:
StepID = t.NewType("StepID", str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there plans on making this something more structured?

Copy link
Member Author

@MattToast MattToast Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would LOVE to, but seeing as we are kinda shifting focus away from using launchers directly and more toward dragon, I wasn't planning on tackling it immediately. The NewType here is just so that the type checker could help me keep my sanity, lol.

I could definitely wrap the two other tickets requested into a "revamp slurm utilities" style ticket for when we have more time to work on this!!

step_id = None
while step_id is None and trials > 0:
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# TODO: Really don't like the implied sacct format and required
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a ticket for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, but I will write one!

"""
ids = (id_,) + ids
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# TODO: Really don't like the implied sacct format and required
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a ticket for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to my list of tickets to write!

" canceled.\n"
"SmartSim will consider it canceled.\n"
)
job_info.status_override = JobStatus.CANCELLED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be a recursive structure (like an exception)? Does this keep the API fixed (e.g. you always ask for job_info.status but have the root cause info if you really need it. At first glance, having status_override be public feels like it forces a client to look at both.

@MattToast MattToast changed the base branch from smartsim-refactor to v1.0 October 29, 2024 16:39
@MattToast MattToast force-pushed the updated-slurm-launcher branch from 61da33c to 98aaba7 Compare October 29, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: launcher Issues related to any of the launchers within SmartSim ignore-for-release type: feature Issues that include feature request or feature idea
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants