Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: logfile quoting and scancel error handling #140

Merged
merged 6 commits into from
Sep 7, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion snakemake_executor_plugin_slurm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ def run_job(self, job: JobExecutorInterface):
f"sbatch "
f"--parsable "
f"--job-name {self.run_uuid} "
f"--output {slurm_logfile} "
f"--output '{slurm_logfile}' "
f"--export=ALL "
f"--comment {comment_str}"
)
Expand Down Expand Up @@ -408,6 +408,13 @@ def cancel_jobs(self, active_jobs: List[SubmittedJobInfo]):
)
except subprocess.TimeoutExpired:
self.logger.warning("Unable to cancel jobs within a minute.")
except subprocess.CalledProcessError as e:
msg = e.stderr.decode().strip()
if msg:
msg = f": {msg}"
raise WorkflowError(
f"Unable to cancel jobs with scancel (exit code {e.returncode}){msg}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhance exception handling by using raise ... from err.

The new exception handling block improves error reporting. However, it is recommended to use raise ... from err to distinguish exceptions from errors in exception handling.

Apply this diff to enhance exception handling:

-raise WorkflowError(
-    f"Unable to cancel jobs with scancel (exit code {e.returncode}){msg}"
-)
+raise WorkflowError(
+    f"Unable to cancel jobs with scancel (exit code {e.returncode}){msg}"
+) from e
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except subprocess.CalledProcessError as e:
msg = e.stderr.decode().strip()
if msg:
msg = f": {msg}"
raise WorkflowError(
f"Unable to cancel jobs with scancel (exit code {e.returncode}){msg}"
)
except subprocess.CalledProcessError as e:
msg = e.stderr.decode().strip()
if msg:
msg = f": {msg}"
raise WorkflowError(
f"Unable to cancel jobs with scancel (exit code {e.returncode}){msg}"
) from e
Tools
Ruff

415-417: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure about the exit codes of slurm scancel. Do you know more? THis here would fail with a WorkflowError for anything except 0. In the docs there is nothing about exit codes. I recently received an exit code of 8 from scancel. And gemini says the following without being able to give me a source for it:

0: The job was successfully canceled.
1: The job could not be found.
2: The job is already in the completed state.
3: The job is already in the canceled state.
4: The job is already in the suspended state.
5: The job is already in the pending state.
6: The job is already running.
7: The job is already in the requeued state.
8: The job was successfully canceled.
9: The job was successfully suspended.
10: The job was successfully requeued.
11: The job was successfully resumed.
12: The job was successfully held.
13: The job was successfully released.
14: The job was successfully signaled.
15: The job was successfully terminated.
16: The job was successfully restarted.
17: The job was successfully updated.
18: The job was successfully requeued to a different partition.
19: The job was successfully requeued to a different queue.
20: The job was successfully requeued to a different priority.
21: The job was successfully requeued to a different account.
22: The job was successfully requeued to a different user.
23: The job was successfully requeued to a different group.
24: The job was successfully requeued to a different association.
25: The job was successfully requeued to a different node.
26: The job was successfully requeued to a different nodelist.
27: The job was successfully requeued to a different feature.
28: The job was successfully requeued to a different constraint.
29: The job was successfully requeued to a different time.
30: The job was successfully requeued to a different deadline.
31: The job was successfully requeued to a different dependency.
32: The job was successfully requeued to a different requeue priority.
33: The job was successfully requeued to a different requeue time.
34: The job was successfully requeued to a different requeue deadline.
35: The job was successfully requeued to a different requeue dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


async def job_stati(self, command):
"""Obtain SLURM job status of all submitted jobs with sacct
Expand Down
Loading