Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: logfile quoting and scancel error handling #140

Merged
merged 6 commits into from
Sep 7, 2024

Conversation

johanneskoester
Copy link
Contributor

@johanneskoester johanneskoester commented Aug 27, 2024

Summary by CodeRabbit

  • New Features

    • Enhanced handling of output paths in SLURM job commands to better accommodate spaces and special characters.
    • Improved error reporting for job cancellations, providing clearer feedback on failures.
  • Bug Fixes

    • Added robust exception handling for job cancellation processes to capture and report errors effectively.

Copy link
Contributor

coderabbitai bot commented Aug 27, 2024

Walkthrough

The changes involve modifications to the run_job and cancel_jobs methods in the SLURM executor plugin. The run_job method now encloses the slurm_logfile variable in single quotes to handle file paths correctly. Additionally, the cancel_jobs method includes new exception handling for subprocess.CalledProcessError, improving error reporting during job cancellation.

Changes

Files Change Summary
snakemake_executor_plugin_slurm/__init__.py - run_job: Enclosed slurm_logfile in single quotes.
- cancel_jobs: Added exception handling for subprocess.CalledProcessError.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SLURMExecutor
    participant Subprocess

    User->>SLURMExecutor: run_job()
    SLURMExecutor->>Subprocess: Execute SLURM command with 'slurm_logfile'
    Subprocess-->>SLURMExecutor: Return success or error
    SLURMExecutor-->>User: Return job status

    User->>SLURMExecutor: cancel_jobs()
    SLURMExecutor->>Subprocess: Execute scancel command
    alt Error Occurred
        Subprocess-->>SLURMExecutor: Raise CalledProcessError
        SLURMExecutor-->>User: Return error message with exit code
    else Success
        Subprocess-->>SLURMExecutor: Return success
        SLURMExecutor-->>User: Confirm cancellation
    end
Loading

Poem

🐇 In the land of code, where jobs do run,
A tweak to the logs brings joy and fun.
Errors now caught, with messages clear,
SLURM's dance grows smoother, let’s all cheer!
Hops of success, let the workflows flow,
With every change, our spirits grow! 🎉


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c414f65 and a0cccb4.

Files selected for processing (1)
  • snakemake_executor_plugin_slurm/init.py (2 hunks)
Additional context used
Ruff
snakemake_executor_plugin_slurm/__init__.py

415-417: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Additional comments not posted (1)
snakemake_executor_plugin_slurm/__init__.py (1)

140-140: LGTM!

The change to enclose slurm_logfile in single quotes ensures that file paths with spaces or special characters are correctly handled when passed to the SLURM command.

The code changes are approved.

Comment on lines 411 to 417
except subprocess.CalledProcessError as e:
msg = e.stderr.decode().strip()
if msg:
msg = f": {msg}"
raise WorkflowError(
f"Unable to cancel jobs with scancel (exit code {e.returncode}){msg}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhance exception handling by using raise ... from err.

The new exception handling block improves error reporting. However, it is recommended to use raise ... from err to distinguish exceptions from errors in exception handling.

Apply this diff to enhance exception handling:

-raise WorkflowError(
-    f"Unable to cancel jobs with scancel (exit code {e.returncode}){msg}"
-)
+raise WorkflowError(
+    f"Unable to cancel jobs with scancel (exit code {e.returncode}){msg}"
+) from e
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except subprocess.CalledProcessError as e:
msg = e.stderr.decode().strip()
if msg:
msg = f": {msg}"
raise WorkflowError(
f"Unable to cancel jobs with scancel (exit code {e.returncode}){msg}"
)
except subprocess.CalledProcessError as e:
msg = e.stderr.decode().strip()
if msg:
msg = f": {msg}"
raise WorkflowError(
f"Unable to cancel jobs with scancel (exit code {e.returncode}){msg}"
) from e
Tools
Ruff

415-417: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Comment on lines 411 to 417
except subprocess.CalledProcessError as e:
msg = e.stderr.decode().strip()
if msg:
msg = f": {msg}"
raise WorkflowError(
f"Unable to cancel jobs with scancel (exit code {e.returncode}){msg}"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure about the exit codes of slurm scancel. Do you know more? THis here would fail with a WorkflowError for anything except 0. In the docs there is nothing about exit codes. I recently received an exit code of 8 from scancel. And gemini says the following without being able to give me a source for it:

0: The job was successfully canceled.
1: The job could not be found.
2: The job is already in the completed state.
3: The job is already in the canceled state.
4: The job is already in the suspended state.
5: The job is already in the pending state.
6: The job is already running.
7: The job is already in the requeued state.
8: The job was successfully canceled.
9: The job was successfully suspended.
10: The job was successfully requeued.
11: The job was successfully resumed.
12: The job was successfully held.
13: The job was successfully released.
14: The job was successfully signaled.
15: The job was successfully terminated.
16: The job was successfully restarted.
17: The job was successfully updated.
18: The job was successfully requeued to a different partition.
19: The job was successfully requeued to a different queue.
20: The job was successfully requeued to a different priority.
21: The job was successfully requeued to a different account.
22: The job was successfully requeued to a different user.
23: The job was successfully requeued to a different group.
24: The job was successfully requeued to a different association.
25: The job was successfully requeued to a different node.
26: The job was successfully requeued to a different nodelist.
27: The job was successfully requeued to a different feature.
28: The job was successfully requeued to a different constraint.
29: The job was successfully requeued to a different time.
30: The job was successfully requeued to a different deadline.
31: The job was successfully requeued to a different dependency.
32: The job was successfully requeued to a different requeue priority.
33: The job was successfully requeued to a different requeue time.
34: The job was successfully requeued to a different requeue deadline.
35: The job was successfully requeued to a different requeue dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmeesters
Copy link
Member

will take a look in the late afternoon - right now, I have meeting after meeting.

@cmeesters
Copy link
Member

@johanneskoester I tried to read the source code. Didn't help. Consider this:

$ scancel 
scancel: error: No job identification provided
$ echo $?
1
$ sacct -j 16161523 -o state -X
     State 
---------- 
 COMPLETED 
$ scancel  16161523
$  echo $?
0

Which, according to your list, ought to be 2 for the last line. Also, your list states, that exit codes 8 and 0 are identical!

Some of the listed codes do not make sense at all: scancel is there to cancel (obviously) or signal jobs (steps). A job return code cannot indicate that it was requeued and scancel only gives its own exit codes (see main function after line 106) and distinguishes internally between its exit code and job codes. So, if anything, these codes refer to job exit codes, which (except for general ones) are software specific).

BTW I like your PR, might have a look into #136 (the feature works for me, but the tests fail, because an apparent NONE type?).

@cmeesters
Copy link
Member

cmeesters commented Aug 28, 2024

PS black is ok with the length of line 489, the CI is not. That is an issue in itself, don't you think?

after the post about the signals: If we erase considering the error code (because of the questionable purpose), we can delete it and the line shortens.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a0cccb4 and d48c726.

Files selected for processing (1)
  • snakemake_executor_plugin_slurm/init.py (2 hunks)
Additional context used
Ruff
snakemake_executor_plugin_slurm/__init__.py

415-417: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Additional comments not posted (1)
snakemake_executor_plugin_slurm/__init__.py (1)

140-140: LGTM!

The change to enclose slurm_logfile in single quotes ensures that file paths with spaces or special characters are correctly handled.

snakemake_executor_plugin_slurm/__init__.py Outdated Show resolved Hide resolved
@johanneskoester
Copy link
Contributor Author

PS black is ok with the length of line 489, the CI is not. That is an issue in itself, don't you think?

after the post about the signals: If we erase considering the error code (because of the questionable purpose), we can delete it and the line shortens.

Thanks for checking this (I really only stupidly pasted the AI output on the error codes, this again shows how useless this can be ATM). Still, since I observed the error code 8, let us keep it in the exception message for now, maybe it is useful for some people.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d48c726 and 1ede6fa.

Files selected for processing (1)
  • snakemake_executor_plugin_slurm/init.py (2 hunks)
Additional context used
Ruff
snakemake_executor_plugin_slurm/__init__.py

415-418: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Additional comments not posted (1)
snakemake_executor_plugin_slurm/__init__.py (1)

140-140: Approved change to logfile quoting.

The change to enclose the slurm_logfile variable in single quotes is a good practice to ensure that paths with spaces or special characters are correctly handled. This is crucial for maintaining the integrity of file paths in a UNIX environment.

snakemake_executor_plugin_slurm/__init__.py Outdated Show resolved Hide resolved
introduced "raise Error from exception" as suggested by coderabbit - while not necessary, it can make future error handling a bit more traceable.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)

415-418: LGTM! Nitpick: Consider removing from e.

Using from e is a good practice to distinguish the exception from errors in exception handling. However, in this specific case, it may not be necessary as there are no additional exception handling steps. Consider removing it for simplicity.

-raise WorkflowError(
-    "Unable to cancel jobs with scancel "
-    f"(exit code {e.returncode}){msg}"
-) from e
+raise WorkflowError(
+    "Unable to cancel jobs with scancel "
+    f"(exit code {e.returncode}){msg}"
+)
Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1ede6fa and 6f0e3e1.

Files selected for processing (1)
  • snakemake_executor_plugin_slurm/init.py (2 hunks)
Additional comments not posted (2)
snakemake_executor_plugin_slurm/__init__.py (2)

140-140: LGTM!

Enclosing the slurm_logfile in single quotes is a good practice to handle file paths with spaces or special characters correctly.


411-418: Improved error handling for job cancellation.

The new exception handling block for subprocess.CalledProcessError enhances error reporting when canceling jobs with scancel. It captures the exit code and standard error output to provide more context about the failure. This improves the robustness of the job cancellation process.

@cmeesters
Copy link
Member

Gnarf, now the tests fail, because of missing test files. Wonderful. Not something, I will check on a Sunday morning, though. Tomorrow, I will have lectures till noon, only them I might have time to investigate.

@johanneskoester
Copy link
Contributor Author

Should be fixed now.

@johanneskoester johanneskoester merged commit cb5d656 into main Sep 7, 2024
4 checks passed
@johanneskoester johanneskoester deleted the fix/logfile-quoting branch September 7, 2024 17:58
johanneskoester pushed a commit that referenced this pull request Sep 11, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.10.1](v0.10.0...v0.10.1)
(2024-09-07)


### Bug Fixes

* logfile quoting and scancel error handling
([#140](#140))
([cb5d656](cb5d656))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants