-
Notifications
You must be signed in to change notification settings - Fork 2
Case 1
This example runs through SLURM on a linux based compute cluster.
If you want to follow this example, make sure to have the following softwares installed:
The sequence data used for this case is available under BioProject accession PRJNA16087 (SRA run accession SRR518502).
Download sequence data from the SRA using the SRA-toolkit:
fasterq-dump --outdir ~/case_1/data SRR518502
Subsample the read set to approximately 100X coverage using seqtk:
cd ~/case_1/data
seqtk sample -s100 SRR518502_1.fastq 950000 > SRR518502_sample_1.fastq
seqtk sample -s100 SRR518502_2.fastq 950000 > SRR518502_sample_2.fastq
Please make sure that the file paths and SLURM details in the YAML file are in line with your own setup.
case_1.yaml
# Specify the experimental design
design:
type: ccf
# Each factor to vary is specified here.
# Min/max: the smallest/largest values allowed
# Low_init/high_init: the initial design space min and max. Is only used when not running the GSD screening step in the beginning.
# Type: ordinal (integer), qualitative (categorical), quantitative (float)
factors:
# Factor 'KMER' is varied between 20 and 90
KMER:
min: 20
max: 90
low_init: 10
high_init: 50
type: ordinal
# Factor 'MIKC' is varied between 2 and 15
MIKC:
min: 2
max: 15
low_init: 4
high_init: 8
type: quantitative
# Factor 'MIAL' is varied between 20 and 60
MIAL:
min: 20
max: 60
low_init: 30
high_init: 50
type: ordinal
# Factor 'MIPA' is varied between 5 and 15
MIPA:
min: 5
max: 15
low_init: 8
high_init: 12
type: ordinal
# Define the responses
responses:
# Response 'tSeq' represents the total sequence in the assembly
# Should be maximized
# Min accepted value is 1,850,000 bp
# The goal is the reference genome size (1,894,157 bp)
tSeq:
criterion: maximize
low_limit: 1850000
target: 1894157
# Response 'nSeq' represents how many contigs are in the assembly
# Should be minimized
# Max accepted value is 100
# Goal is to have 90 contigs
nSeq:
criterion: minimize
high_limit: 100
target: 90
# Response 'N50' represents the assembly quality in terms of contiguity
# Should be maximized
# Should at least be 30,000 bp
# Goal is 35,000 bp
N50:
criterion: maximize
low_limit: 30000
target: 35000
# Set the working directory (all output goes here)
working_directory: "~/case_1/doepipeline_output"
# The name of the results file (containing the response values). This file should be structured like:
# RESPONSE1,VALUE
# RESPONSE2,VALUE
results_file: "results.txt"
# Specify the names of the different pipeline steps to be run
pipeline:
- ABYSS
- CalculateResponse
# The first pipeline step
# Here, abyss-pe is executed whilst varying the four parameters k, c, l, and n.
ABYSS:
# Define the script that should be run for this step
script: >
abyss-pe \
k={% KMER %} \
c={% MIKC %} \
l={% MIAL %} \
n={% MIPA %} \
j=14 \
name=case_1 \
in="~/case_1/data/SRR518502_sample_1.fastq ~/case_1/data/SRR518502_sample_1.fastq"
# Specify the factors that are used in this script.
# We substitute these factors in the script with the values given by the experimental design.
factors:
KMER:
substitute: true
MIKC:
substitute: true
MIAL:
substitute: true
MIPA:
substitute: true
# If running through SLURM, specify the SLURM details you need.
SLURM:
A: PROJECT
n: 14
t: 00:15:00
o: case_1_slurm.out
e: case_1_slurm.error
# The second pipeline step
CalculateResponse:
# Define the script that should be run for this step
script: >
~/case_1/bin/Fastaq/scripts/fastaq filter --min_length 1000 case_1-scaffolds.fa case_1_l1000-scaffolds.fasta;
~/case_1/bin/seqstats/seqstats case_1_l1000-scaffolds.fasta > case_1_l1000-scaffolds_seqstats.fasta;
python ~/case_1/bin/case_1_scoring.py \
-f case_1_l1000-scaffolds_seqstats.fasta \
-o {% results_file %};
The following script (case_1_scoring.py) is used in the pipeline step CalculateResponse (see case_1.yaml) to create the result file.
case_1_scoring.py
#!/usr/bin/env python
import os, sys, argparse
import pyfasta
parser = argparse.ArgumentParser(description='summarize responses from assembly')
parser.add_argument('-f', '--infile', help='result file from seqstats', required=True)
parser.add_argument('-o', '--outfile', help='doepipeline compatible result file', required=True)
args = parser.parse_args()
def read_result(result_file):
"""put each row of the result file as a key:value pair of a dictionary"""
result_dict = {}
with open(result_file, 'r') as f:
for row in f:
row = row.strip()
row = row.rstrip(" bp")
key, value = row.split(":")
value = value.strip()
result_dict[key] = value
return result_dict
def write_result(result_dict, outfile):
with open(outfile, 'w') as f:
f.write('tSeq,{}\n'.format(result_dict['Total seq']))
f.write('nSeq,{}\n'.format(result_dict['Total n']))
f.write('N50,{}\n'.format(result_dict['N 50']))
result = read_result(args.infile)
write_result(result, args.outfile)
The following script (case_1.sbatch) is used to run doepipeline in a SLURM environment. Please make sure that the file paths and SLURM details are in line with your own setup.
case_1.sbatch
#!/bin/bash -l
#SBATCH -A PROJECT
#SBATCH -n 1
#SBATCH -t 05:00:00
#SBATCH -J case_1_doepipeline
#SBATCH -o case_1_doepipeline.out
#SBATCH -e case_1_doepipeline.error
set -e
set -x
# Changes here
_WORKDIR="~/case_1"
_OUTDIR="${_WORKDIR}/doepipeline_output"
_YAML="${_WORKDIR}/yaml/case_1.yaml"
_LOG="${_OUTDIR}/case_1.log"
_SCREENING_REDUCTION="8"
_SHRINKAGE="0.9"
# Create output directory if needed
if [ ! -d ${_OUTDIR} ]; then
echo "Creating working directory: ${_OUTDIR}"
mkdir ${_OUTDIR}
fi
# Run doepipeline
doepipeline -e slurm -s ${_SHRINKAGE} --screening_reduction ${_SCREENING_REDUCTION} -m greedy -l ${_LOG} ${_YAML}