Skip to content
Jon Pipitone edited this page Jun 16, 2016 · 16 revisions

Local Cluster

Our cluster (using Grid Engine) is made up from our workstations. About 15 nodes, ~160 cores.

Submitting Jobs

  • qsub is your friend. Use it like so:

    qsub -V -cwd <command>     # where <command> is your command line to run

    There are many many options for qsub, check the man page for details.

  • A more friendly way of submitting jobs is available from the SGE-extras module:

     sge_batch <command>

sge_batch is a tool used to submit a bunch of jobs through qsub, the submission system for Sun Grid Engine's (SGE) scheduler system. It lives in the SGE-extras module.

module load SGE-extras/1.0

In the simplest case (each job uses less than 4 GB of ram and you don't care about logs):

sge_batch some_command -l option1 -o option2

Logging

sge_batch can combine STDOUT and STDERR logs with the -k flag. Specify the output with the -o flag.

    sge_batch -k -o somelog.log some_command -l option1 -o option2

Requesting Resources

In order to prevent your super important science from not freezing all of the other computers, we've set a 4 GB / job limit on the queue. This means that unless you specifically ask for more than 4 GB of RAM, your job will be KILLed automatically.

You can overcome this with the -l flag and some options (-l is used to pass a large number of options to qsub... see -l in the qsub MAN page). The options to set are mem_free, h_vmem, and virtual_free. As an example, let's say we want 12 gigabytes of RAM for a particular job. The correct command would be:

sge_batch -l mem_free=20,h_vmem=20G,virtual_free=20G some_command -l option1 -o option2

Managing your jobs

  • See the status of your jobs: qstat
  • See detailed info about your job (including error messages): qstat -f <jobid>
  • See an updating display of job status: watch qstat
  • Delete a job: qdel 193295 to delete the job with ID 193295 (or qdel '*' to delete all jobs)
  • clear a job's error state (so that it is re-run): qmod -c <jobid>
  • Look at the following log file on srv-sge for error messages: /var/spool/gridengine/qmaster/messages.

Error Codes

  • qw: pending (user hold)
  • hqw: pending (system/user+system hold)
  • hRwq: pending, user/system hold, re-queue
  • r: running
  • t: transferring
  • Rr: running, re-submit
  • Rt: transferring, re-submit
  • s, st: job suspended
  • S, tS: queue suspended
  • T, tT: queue suspended by alarm
  • Rs, Rts, RS, RtS, RT, RtT: all suspended with re-submit
  • Eqw, Ehqw, EhRqw: all pending states with error
  • dr, dt, dRr, dRt, ds, dS, dT, dRs, dRS, dRT: all running and suspended states with deletion

Disabling/enabling computers from the queue

If you need your workstation in order to do local, intensive processing you can remove it from the queue so that jobs are scheduled on it:

$ qmod -d main.q@computername    (e.g. qmod -d [email protected])

And enable it like so:

$ qmod -e main.q@computername

Submitting many jobs at once

  • The boring way: use a for-loop to submit many individual jobs. e.g.:

    for i in {0..100}; do 
       qsub ./process_subject.sh $i
    done
  • The better way: Use https://github.com/pipitone/qbatch, which is also found in the SGE-extras module, to submit your jobs:

    1. Create a file with all of your commands in it, one per line:

      for i in {0..100}; do 
         echo process_subject $i       # echo the command
      done > commands.txt              # save all commands in a file
    2. Use qbatch to submit those jobs in one go:

      qbatch commands.txt

See https://github.com/pipitone/qbatch homepage for more info on how to use it. It works on the local cluster as well as SciNet, and the SCC.

Submitting jobs that use more than one process

When using qsub, sge_batch, or sge_submit_array, use the -pe option, like so:

qsub -V -cwd -pe simple 8 <command>

In the example above, we've requested 8 cores per job.

Other Clusters

We have access to some external clusters you might be interested in using:

  • The CAMH Specialised Computing Centre (SCC): a cluster of about ~20 nodes, ~400 cores.
  • U of T's SciNet: a massive cluster of upwards of 3000 nodes, each with 16 cores. There is also a GPU cluster, and high memory cluster, and others. For more info, see the SciNet wiki

CAMH Specialised Computing Centre (SCC) For more info and to get a user account, visit: http://info2.camh.net/scc/index.php/Users_Guide

To access the SCC from the Research Imaging Centre network:

  • login via: login.scc.camh.net
  • fila transfer via: ftp.scc.camh.net
Clone this wiki locally