Skip to content

Up to date guide of running genome aligner CACTUS (last edit - 01/09/2020)

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



15 Commits

Repository files navigation


Up to date guide of running genome aligner CACTUS -
(last edit - 01/09/2020)

DNAZoo's goal is to create 1 alignment with all dnazoo genomes + all mammals on ncbi (excluding those from the 200 mammals project)

  • there are ~120 mammals assembled by DNAZoo, ~150 genomes overall
  • there are 458 mammals on the NCBI DB (note that there will be overlap, in this instance we will use the DNAZoo genome)

-> annotations to form tree can be found here:

Starting Point (for testing) - DNA Zoos 9 primates

  • Macaca fuscata

  • Allenopithecus nigroviridis

  • Mandrillus sphinx

  • Saimiri boliviensis

  • Cebuella pygmaea

  • Propithecus coquereli

  • Microcebus murinus

  • Eulemur flavifrons

  • Eulemur mongoz


review tree here:

Pairwise alignment start:


Eulemur flavifrons (2.1G):

Eulemur mongoz (2.5G):

PAWSEY notes

-Zeus 96GB and 128GB nodes:

-Magnus 64GB nodes:

-Topaz 192 GB nodes:

  • Zeus benefits over Nimbus due to higher RAM at expense of walltime

  • Toil does not work with Slurm - limits to using 1 node on Zeus or Magnus (Toil is developed for cloud)

  • GPU acceleration has been introduced in most recent versions of cactus - there are GPUs on Topaz cluster - issue is on Topaz cannot reun CPU-only code

For monitoring jobs export SACCT_FORMAT=jobid%-20,jobname,partition,user,account,submit,start,end,elapsed,nnodes,ncpus,reqmem,maxrss,maxvmsize,state,exitcode,nodelist%10 sacct -a -u ashling_charles -S 2020-08-01 -MaxRSS and MaxVMSize gives idea of maximum RAM usage

Important notes:

-typical size of mammalian genomes are 2-4GB

-on cluster set up, VMs do not talk to each other i.e. memory (RAM) and core requirements are specified per machine not per cluster

-cactus job is a workflow made up of several tasks - can break these downs -when downloading genomes from ncbi, need to remove spaces in the names of the chromosomes eg.

NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly to NC_000001.11

-guide to AWS instructions for cactus:

main points:

  • mammal-size genomes require (N/2)x20 c4.8xlarge instances on the spot market, (N/2) r3.8xlarge on-demand ie. for pairwise alignment 20 c4.8xlarge instances, 1 r3.8xlarge instance

Set-up (on pawsey) dir="$MYSCRATCH/test-cactus" mkdir -p $dir cd $dir

module load singularity singularity pull docker://

git clone cp -p cactus/examples/evolverMammals.txt .

Cactus-prepare command (used to list chunks of the workflow i.e. can be run seperately - commands in the same block can be executed concurrently)

singularity exec cactus_v1.0.0.sif cactus-prepare --jobStore jobstore evolverMammals.txt steps-output steps-output/evovlerMammals.txt steps-output/evolverMammals.hal &>list_commands

#inspect the output file cat list_commands

Test run (example given in github repository) salloc -c 28 -t 1:00:00 srun singularity exec cactus_v1.0.0.sif cactus jobStore evolverMammals.txt evolverMammals.hal --root mr --binariesMode local

when it runs out of walltime, can continue with command: srun singularity exec cactus_v1.0.0.sif cactus jobStore evolverMammals.txt evolverMammals.hal --root mr --binariesMode local --restart

Test run on PAWSEY's Zeus (Mammalian pairwise alignment) SeqFile - testalignment.txt: (human,greykangaroo);

*human /group/pawsey0263/ashling_charles/human.fasta (size:3.1G)

greykangaroo /group/pawsey0263/ashling_charles/mg-2k.fasta.masked (size:3.4G)

Cactus script:

#!/bin/bash -l #SBATCH --job-name="myjob" #SBATCH --nodes=1 #SBATCH --account=pawsey0263 #SBATCH --export=NONE #SBATCH --ntasks=1 #SBATCH --cpus-per-task=28 #SBATCH --output=cactusexample.%j.o #SBATCH --error=cactusexample.%j.e

module load singularity CONTAINER_PATH=/scratch/pawsey0263/ashling_charles/test-cactus

srun singularity exec cactus_v1.0.0.sif cactus jobStore /group/pawsey0263/ashling_charles/testalignment.txt humankangaroo.hal --binariesMode local --restart

#TOIL_SLURM_ARGS="--nodes 8 --export=ALL" srun singularity exec cactus_v1.0.0.sif cactus --binariesMode local --batchSystem slurm --workDir=tmp --defaultCores 28 --maxCores 28 --maxNodes 8 --logLevel=debug --defaultDisk 96G --defaultMemory 96G jobStore evolverMammals.txt evolverMammals.hal --disableCaching --clean=always

-> issue with script - always gets stuck in LastzRepeatMasking step


Up to date guide of running genome aligner CACTUS (last edit - 01/09/2020)






No releases published


No packages published