Crash at "start job" step #2

sebherbert · 2021-10-21T16:03:13Z

Hello, I'm trying to follow your guide and failed with an error at the start step
My local machine is a Linux Mint 20 and I'm trying to use our University's cluster. Up until the start point everything goes according to your documentation, but when I click on start job (with the user.ijm macro), this error appears:
crash_at_job_start.txt
As far as I understand, some batch partition is invalid, so I've been trying to find where it is being called but couldn't figure it out. I've reached out to our IT department but they think the issue is that the job script is not correctly created and fails, so they can't see anything from the slurm side as it does not accept the job.

Could you help me figure out how I could debug the job submission?
Please let me know if I can add something more.

Thanks!
Sebastien

velissarious · 2021-10-22T10:41:06Z

Thank you for your feedback!

You need to provide the correct partition name for your cluster. This must be specified in the SLURM Workflow Manger partition text field in the Node Configuration section of the Create job dialog.

The batch partition is provided by HPC Workflow Manager as a sensible default only, as it is the most commonly used default partition name for SLURM.
A partition called batch may not exist in your cluster, the name of the default partition on your cluster may be different.

How to find the name of the default partition

If you can connect to the cluster using a terminal you can run the command:
sinfo --all
In my test system I get the following results:

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch*       up  infinite     1  idle osboxes

The suffix "*" identifies the default partition. Copy this name, without the asterisk (batch in this case), to the SLURM Workflow Manger partition text field of New job dialog.

Alternatively, you may need to look for the default partition in the documentation of your cluster.

Unfortunately, the HPC Workflow Manager does not find the actual partition name used by default and must be set by the user. You may also want or need to use a different partition depending on your needs or use for different jobs.

sebherbert · 2021-10-22T12:27:42Z

Thanks for the fast and detailed answer!
It is indeed better but then the job fails.
I have this error message in the console:
FijiConsole_failedJob.txt
I checked manually and the .scijava-parallel folder doesn't exist indeed. Maybe our cluster structure is different? Is there any other place I could look for or copy it from?

If it's of any use, there is a bash script created on the cluster side that I can't link here (not supported file type).

Thanks!

velissarious self-assigned this Oct 22, 2021

velissarious mentioned this issue Oct 25, 2021

The output directory .scijava-parallel directory is not automatically created #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash at "start job" step #2

Crash at "start job" step #2

sebherbert commented Oct 21, 2021

velissarious commented Oct 22, 2021 •

edited

Loading

sebherbert commented Oct 22, 2021

Crash at "start job" step #2

Crash at "start job" step #2

Comments

sebherbert commented Oct 21, 2021

velissarious commented Oct 22, 2021 • edited Loading

How to find the name of the default partition

sebherbert commented Oct 22, 2021

velissarious commented Oct 22, 2021 •

edited

Loading