Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash at "start job" step #2

Open
sebherbert opened this issue Oct 21, 2021 · 2 comments
Open

Crash at "start job" step #2

sebherbert opened this issue Oct 21, 2021 · 2 comments
Assignees

Comments

@sebherbert
Copy link

Hello, I'm trying to follow your guide and failed with an error at the start step
My local machine is a Linux Mint 20 and I'm trying to use our University's cluster. Up until the start point everything goes according to your documentation, but when I click on start job (with the user.ijm macro), this error appears:
crash_at_job_start.txt
As far as I understand, some batch partition is invalid, so I've been trying to find where it is being called but couldn't figure it out. I've reached out to our IT department but they think the issue is that the job script is not correctly created and fails, so they can't see anything from the slurm side as it does not accept the job.

Could you help me figure out how I could debug the job submission?
Please let me know if I can add something more.

Thanks!
Sebastien

@velissarious velissarious self-assigned this Oct 22, 2021
@velissarious
Copy link
Collaborator

velissarious commented Oct 22, 2021

Thank you for your feedback!

You need to provide the correct partition name for your cluster. This must be specified in the SLURM Workflow Manger partition text field in the Node Configuration section of the Create job dialog.

The batch partition is provided by HPC Workflow Manager as a sensible default only, as it is the most commonly used default partition name for SLURM.
A partition called batch may not exist in your cluster, the name of the default partition on your cluster may be different.

How to find the name of the default partition

If you can connect to the cluster using a terminal you can run the command:
sinfo --all
In my test system I get the following results:

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch*       up  infinite     1  idle osboxes

The suffix "*" identifies the default partition. Copy this name, without the asterisk (batch in this case), to the SLURM Workflow Manger partition text field of New job dialog.

Alternatively, you may need to look for the default partition in the documentation of your cluster.

Unfortunately, the HPC Workflow Manager does not find the actual partition name used by default and must be set by the user. You may also want or need to use a different partition depending on your needs or use for different jobs.

NewJobHighlightedPartitionSLURM

@sebherbert
Copy link
Author

Thanks for the fast and detailed answer!
It is indeed better but then the job fails.
I have this error message in the console:
FijiConsole_failedJob.txt
I checked manually and the .scijava-parallel folder doesn't exist indeed. Maybe our cluster structure is different? Is there any other place I could look for or copy it from?

If it's of any use, there is a bash script created on the cluster side that I can't link here (not supported file type).

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants