-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated Slurm Launcher #741
base: v1.0
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small question for reviewers in a departure from the original impl, and neither it nor this new one seem "correct" in my opinion.
@pytest.mark.xfail(reason=r"Slurm launcher cannout parse `CANCELLED by \d+` syntax") | ||
@requires_slurm | ||
@requires_alloc_size(1) | ||
def test_srun_sleep_for_two_min_with_cancel(make_srun_command): | ||
launcher = SlurmLauncher() | ||
srun = make_srun_command( | ||
["-N", "1", "-n", "1"], ["sleep", "120"], use_current_alloc=True | ||
) | ||
id_ = launcher.start(srun) | ||
time.sleep(1) | ||
assert launcher.get_status(id_)[id_] == JobStatus.RUNNING | ||
launcher.stop_jobs(id_) | ||
time.sleep(1) | ||
assert launcher.get_status(id_)[id_] == JobStatus.CANCELLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for reviewers: Is this something we want to be able to handle? SmartSim V0.8 also fails this test. It's impl will set the status to JobStatus.FAILED
and this impl currently sets the status to JobStatus.UNKOWN
.
If so, should this be handled in this ticket or a follow on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about the distinction between unknown and cancelled? If not, ignore it.
If yes, new ticket. I wouldn't add more to your existing ticket since it's technically a breaking change and non-critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a plan! I think we care about the difference (or at least we cared enough to put it in the original API), so I'll add a note to my "tickets to write for slurm launcher" list!
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## v1.0 #741 +/- ##
=======================================
Coverage ? 59.42%
=======================================
Files ? 142
Lines ? 9146
Branches ? 0
=======================================
Hits ? 5435
Misses ? 3711
Partials ? 0
|
34334fa
to
2679e30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments, otherwise looks great! LGTM
@@ -121,7 +121,10 @@ def parse_sstat_nodes(output: str, job_id: str) -> t.List[str]: | |||
return list(set(nodes)) | |||
|
|||
|
|||
def parse_step_id_from_sacct(output: str, step_name: str) -> t.Optional[str]: | |||
StepID = t.NewType("StepID", str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there plans on making this something more structured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would LOVE to, but seeing as we are kinda shifting focus away from using launchers directly and more toward dragon, I wasn't planning on tackling it immediately. The NewType
here is just so that the type checker could help me keep my sanity, lol.
I could definitely wrap the two other tickets requested into a "revamp slurm utilities" style ticket for when we have more time to work on this!!
step_id = None | ||
while step_id is None and trials > 0: | ||
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> | ||
# TODO: Really don't like the implied sacct format and required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a ticket for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, but I will write one!
""" | ||
ids = (id_,) + ids | ||
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> | ||
# TODO: Really don't like the implied sacct format and required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a ticket for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to my list of tickets to write!
" canceled.\n" | ||
"SmartSim will consider it canceled.\n" | ||
) | ||
job_info.status_override = JobStatus.CANCELLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this could be a recursive structure (like an exception)? Does this keep the API fixed (e.g. you always ask for job_info.status
but have the root cause info if you really need it. At first glance, having status_override be public feels like it forces a client to look at both.
61da33c
to
98aaba7
Compare
Implement a Slurm Launcher for fine grain status tracking of Slurm jobs (so far just
srun
, but can be expanded to includesbatch
) with the V1 API.