Compute Clusters

Local Cluster

Our cluster (using Grid Engine) is made up from our workstations. About 15 nodes, ~160 cores.

Submitting Jobs

qsub is your friend. Use it like so:
```
qsub -V -cwd <command>     # where <command> is your command line to run
```
There are many many options for qsub, check the man page for details.
A more friendly way of submitting jobs is available from the SGE-extras module:
```
 sge_batch <command>
```

sge_batch is a tool used to submit a bunch of jobs through qsub, the submission system for Sun Grid Engine's (SGE) scheduler system. It lives in the SGE-extras module.

module load SGE-extras/1.0

In the simplest case (each job uses less than 4 GB of ram and you don't care about logs):

sge_batch some_command -l option1 -o option2

Logging

sge_batch can combine STDOUT and STDERR logs with the -k flag. Specify the output with the -o flag.

    sge_batch -k -o somelog.log some_command -l option1 -o option2

Requesting Resources

In order to prevent your super important science from not freezing all of the other computers, we've set a 4 GB / job limit on the queue. This means that unless you specifically ask for more than 4 GB of RAM, your job will be KILLed automatically.

You can overcome this with the -l flag and some options (-l is used to pass a large number of options to qsub... see -l in the qsub MAN page). The options to set are mem_free, h_vmem, and virtual_free. As an example, let's say we want 12 gigabytes of RAM for a particular job. The correct command would be:

sge_batch -l mem_free=20,h_vmem=20G,virtual_free=20G some_command -l option1 -o option2

Managing your jobs

See the status of your jobs: qstat
See detailed info about your job (including error messages): qstat -f <jobid>
See an updating display of job status: watch qstat
Delete a job: qdel 193295 to delete the job with ID 193295 (or qdel '*' to delete all jobs)
clear a job's error state (so that it is re-run): qmod -c <jobid>
Look at the following log file on srv-sge for error messages: /var/spool/gridengine/qmaster/messages.

Error Codes

qw: pending (user hold)
hqw: pending (system/user+system hold)
hRwq: pending, user/system hold, re-queue
r: running
t: transferring
Rr: running, re-submit
Rt: transferring, re-submit
s, st: job suspended
S, tS: queue suspended
T, tT: queue suspended by alarm
Rs, Rts, RS, RtS, RT, RtT: all suspended with re-submit
Eqw, Ehqw, EhRqw: all pending states with error
dr, dt, dRr, dRt, ds, dS, dT, dRs, dRS, dRT: all running and suspended states with deletion

Disabling/enabling computers from the queue

If you need your workstation in order to do local, intensive processing you can remove it from the queue so that jobs are scheduled on it:

$ qmod -d main.q@computername    (e.g. qmod -d [email protected])

And enable it like so:

$ qmod -e main.q@computername

Submitting many jobs at once

The boring way: use a for-loop to submit many individual jobs. e.g.:

for i in {0..100}; do 
   qsub ./process_subject.sh $i
done

The better way: Use https://github.com/pipitone/qbatch, which is also found in the SGE-extras module, to submit your jobs:
1. Create a file with all of your commands in it, one per line:
```
for i in {0..100}; do 
   echo process_subject $i       # echo the command
done > commands.txt              # save all commands in a file
```
2. Use qbatch to submit those jobs in one go:
```
qbatch commands.txt
```

See https://github.com/pipitone/qbatch homepage for more info on how to use it. It works on the local cluster as well as SciNet, and the SCC.

Submitting jobs that use more than one process

When using qsub, sge_batch, or sge_submit_array, use the -pe option, like so:

qsub -V -cwd -pe simple 8 <command>

In the example above, we've requested 8 cores per job.

Other Clusters

We have access to some external clusters you might be interested in using:

The CAMH Specialised Computing Centre (SCC): a cluster of about ~20 nodes, ~400 cores.
U of T's SciNet: a massive cluster of upwards of 3000 nodes, each with 16 cores. There is also a GPU cluster, and high memory cluster, and others. For more info, see the SciNet wiki

CAMH Specialised Computing Centre (SCC) For more info and to get a user account, visit: http://info2.camh.net/scc/index.php/Users_Guide

To access the SCC from the Research Imaging Centre network:

login via: login.scc.camh.net
fila transfer via: ftp.scc.camh.net

Provide feedback

Saved searches

Use saved searches to filter your results more quickly