Skip to content

Latest commit

 

History

History
175 lines (121 loc) · 4.4 KB

SLURM.md

File metadata and controls

175 lines (121 loc) · 4.4 KB

SLURM

Config

Show start and eplapsed time of jobs in sacct output

export SACCT_FORMAT="jobid,jobname,partition,account,alloccpus,state,start,elapsed,exitcode"

### Parallelization

#### Parallelize multi-threaded jobs

For a multi-threaded job _example.sh_

```bash
#!/bin/bash

seq 1 100 | xargs -P 20 -I {} echo "No. {}"

Suppose we have a cluster with 40 cores per node. We want to parallelize 4 example.sh on SLURM.

#!/bin/bash

#SBATCH -p partition
#SBATCH -N 2 -n 4 -c 20

parallel -j $SLURM_NTASKS '
srun -N 1 -n 1 -c 20 bash example.sh
'

The above script may only run 1 job step due to the default configuration of the cluster. This can be solved to specify additional parameters:

#!/bin/bash

#SBATCH -p partition
#SBATCH -N 2 -n 4 -c 20
#SBATCH --mem-per-cpu 4G

# limit the memory and specify --exclusive to avoid one job consume all res
parallel -j $SLURM_NTASKS '
srun -N 1 -n 1 --exclusive --mem-per-cpu 4G -c 20 bash example.sh
'

Notes for workflow tools

Workflow tools (e.g, snakemake) are not suitable for parallelizing multiple job steps within one SLURM job. This is because the workflow itself cannot be submited by sbatch. For example, if we run:

#!/bin/bash

#SBATCH -p partiction -N 2 -n 4 -c 20

snakemake -s workflow.smk

SLURM will allocated the snakemake workflow to a single node, and all jobs within workflow.smk will run on one node. Those workflow tools can only internally submit each job to SLURM by a wrapper of sbatch .

Conditional job submission

  • Create a flag for finished job. Test the flag when submit jobs:

    run.sh

    COMMAND TO RUN
    [[ $? -eq 0 ]] && touch finish/JOBNAME.done

    submit.sh

    [[ ! -e finish/JOBNAME.done ]] && srun -N 1 -n 1 bash run.sh
  • Label both finished and error jobs:

    run.sh:

    COMMAND TO RUN
    if [[ $? -eq 0 ]]; then 
        # always put error fix before mark finished, otherwise return exit code 1 if at end
        [[ -e "error/JOBNAME.error" ]] && mv "error/sample.error" "error/sample.error_fixed"
        touch "finish/JOBNAME.done"
    else
        touch "error/sample.error"
    fi

    submit.sh

    [[ ! -e finish/JOBNAME.done ]] && srun -N 1 -n 1 bash run.sh
  • Submit job when two files are different

    ! cmp -s FILE1 FILE2 &&    srun -N 1 -n 1 bash run.sh
  • Update analysis: Submit job when the input file is more recent than output file

    [[ INPUT -nt OUTPUT ]] && srun -N 1 -n 1 bash run.sh

Template

Input files of each sample are saved in a subfolder under input path. Here, we pass the sample folder, input path and output path to the test.sh. The test.sh will create a ${sample}.done to label finished samples. Log file for each sample will be saved in log path.

#!/bin/bash

#SBATCH -p cpu
#SBATCH -N 10
#SBATCH -n 40
#SBATCH -c 10

inpath=/example/input/
outpath=/example/output/
finish=/example/finish/
logpath=/example/log/

[[ ! -d $logpath ]] && mkdir -p $logpath
[[ ! -d $outpath ]] && mkdir -p $outpath
[[ ! -d $finish ]] && mkdir -p $finish

ls $inpath | parallel -j $SLURM_NTASKS '
sample={1}
inpath={2}
outpath={3}
logpath={4}
sample=$(basename $sample)
if [[ ! -e /example/finish/${sample}.done  ]]; then
    srun -N 1 -n 1 -c 10 -o ${logpath}${sample}.log bash test.sh $sample $inpath $outpath 
fi
' :::: - ::: $inpath ::: $outpath ::: $logpath

VS code

VS code remote to Jupyter Notebook on SLURM

This refers to Running Jupyter on Slurm GPU Nodes (stanford.edu) and VS code docs

In VS code, ssh to the server and set up jupyter notebook on a computing node:

#Request a computing node on the cluster
srun -p cpu -N 1 -n 40 --pty bash
#set up jupyter notebook
jupyter notebook --ip 0.0.0.0 --port 8888

Will get output like:

[I 15:37:41.813 NotebookApp] Jupyter Notebook 6.4.0 is running at:
[I 15:37:41.813 NotebookApp] http://NODENAME:8888/?token=xxx
[I 15:37:41.813 NotebookApp]  or http://127.0.0.1:8888/?token=xxx
[I 15:37:41.813 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Ctrl+Shift+P and run "Jupyter: Create interactive window". Select kernal from existing server and use the URL http://NODENAME:8888/?token=xxx not 127.0.0.1 to connect to the server.