Before your workflow can be executed some preparation must be done:
- upload input data
- create the stack
- prepare the workflow
Pick a new prefix: a prefix can be a S3 bucket or a subdirectory of a S3 bucket, preferably empty to avoid conflicts with existing files. All input/output files of the workflow will be relative to this prefix.
There will be 3 kinds of files stored on this prefix:
- input files
- workflow directory
- output files
Input files is any file required, but not created, by the workflow,
examples are .fastq
, genomes, annotations.
Upload your input data to somewhere inside the prefix, such as
<prefix>/input
.
Workflow directory is the working directory where you run snakemake,
it includes the Snakefile
, configs, extra rule files, envs.yaml, scripts.
The workflow directory will be sync'ed automatically to
<prefix>/_workflow
, this is for convenience and to always keep
the workflow current.
This directory will be downloaded by all jobs, so don't put big files here.
Go to cloudformation on the aws console and create a stack with the included template.
The template automates the creation of aws elements required for hyperdrive to work, creating the stack has no cost, only when we start submitting jobs the instances will be created.
The stack only needs to be created once per AWS region, and can be used many times, even by multiple workflows at the same time, just make sure each workflow is running on its own prefix.
A custom instance image (AMI) is required to run the jobs.
a packer template is included to automatically
build the AMI:
cd share; packer build packer-ami-template.json
after the build is complete the AMI id will be printed on screen.
The AMI only needs to be built once per AWS region.
On your snakemake workflow directory, create a config file:
hyperdrive config --stack-name <stack> --prefix <prefix> --ami <ami_id>
Arguably the hardest part, is preparing the workflow to run on the cloud environment, specially if you have an existing workflow, it probably contains assumptions about file paths, number of cores, memory, software installed, etc, all these assumptions need to be explicitly declared.
-
use
hyperdrive snakemake --dry-run
to test if the rules were formed correctly. -
inputs: all files inside the prefix can be used as if it were local files, example: a file in
<prefix>/input/data.txt
can be used on a rule as:input: "input/data.txt"
same for outputs. -
there is no shared filesystem, this means all rules need to explicitly define all inputs and outputs required by the job to run, so for example, an alignment job need as input not only the sequence read files but also the genome fasta and all index files required by the aligner. when the job starts all declared input files will be downloaded to the instance, and when the job ends all declared output files will be uploaded to the prefix. any other undeclared file will be lost.
-
Define cpus, memory and disk requirements:
- use
threads
to allocate vcpus/cores/threads. - use
resources.mem_gb
ormem_mb
to allocate memory, - use
resources.disk_gb
ordisk_mb
to allocate disk space, this space needs to accomodate all input+output+temporary files of the job. resources
can be a function of the inputs:resources: disk_gb=lambda wc, input: round(0.5+2*input.size/1e6)
- use
--default-resources
to provide defaults for all rules
- use
-
files outside the prefix or in other buckets can be used as inputs with
S3.remote("<bucket>/path/to/file")
-
local rules run on the local machine, and input files will be downloaded. use
S3.remote()
onrule all
to avoid downloading final results. -
group
jobs execute together, use it to minimize overhead of short lived jobs theresources
of all rules in a group will be combined withmax()
-
use
conda
orsingularity
to deploy software. the instance only have very basic linux tools. -
avoid
run:
, it will not be run inside the conda/singularity env -
temp()
,directory()
: does not work- jobs that generate many output files can use
tar
to bundle a directory into a single output file. - use a common
tmp
path for temporary output files to make it easier to delete later
- jobs that generate many output files can use
-
the prefix can be accessed inside the workflow with
config['DEFAULT_REMOTE_PREFIX']
- can be used on the workflow to avoid hardcoding the prefix
on the workflow directory:
hyperdrive snakemake [[extra arguments for snakemake]]
This command will sync the workflow dir, and run snakemake with all the required options to run in "cloud" mode.
You can add extra snakemake arguments as usual, such as --dry-run
as snakemake submits jobs, instances will be started to process the queue and terminated when not needed anymore.
hyperdrive status
shows status of submitted jobs
hyperdrive log <jobid>
prints the log of a job as it runs
hyperdrive kill <jobid>
to terminate a job if something goes wrong.
All instances and volumes used are tagged with the following tags:
Name
, a unique per workflow name using snakemake rule name and a numberhyperdrive.jobid
, the jobid running on the instancehyperdrive.prefix
, the S3 prefix being usedhyperdrive.stack
, the name of the stack being usedhyperdrive.rule
, the snakemake rule that created the jobhyperdrive.wildcards.<x>
, wildcards of the job
Besides threads/mem and disk you can also further narrow instance-types by requesting extra features:
resources: avx=N
avx=1
for instances with AVXavx=2
for instances with AVX2avx=3
for instances with AVX512
-
aws instance-types sizes follow a power of 2 law, if your job requests 5 threads you will get a 8-core instance, so its better to either use 4 or 8 threads, same idea for memory.
-
be careful with short-lived jobs, if you have many short-lived jobs try to run all of them as a single job to minimize the overhead of launching a new ec2 instance.
-
aws have quotas/limits on how much of a resource you can use, you might need to contact support to increase these limits if you want to run big workflows.
-
Set a retention policy on cloudwatch logs, by default the logs are kept for a month.
-
Track workflow costs by activating cost allocation tags on the aws billing page, however note that S3 usage can only be tracked on the bucket level, not prefixes.
-
Check if you have a default subnet on all AZs of your region, no instances will be launched on a AZ without a subnet.