A bash and biobash library for workflow and pipeline design within but not restricted to the scope of Next Generation Sequencing (NGS) data analyses.
- Bashbone
- License
- Download
- Bash library usage (without installation)
- Installation
- Biobash library usage (requires installation)
- Third-party software
- Supplementary information
- Closing remarks
- Write command line code in your favorite programming language via Here-documents for later orchestrated execution
- Add object-oriented programming (oop) like syntactic sugar to bash variables and arrays to avoid complex parameter-expansions, variable-expansions and brace-expansions
- Execute commands in parallel on your machine or submit them as jobs to a workflow manager like sun grid engine (SGE) and log stdout, stderr and exit codes per job
- Benchmark runtime and memory usage
- Infer number of parallel instances according to targeted memory consumption or targeted threads per instance
- Get a full bash error stack trace in interactive shells or within scripts
- Log execution of bash functions at different verbosity levels
- Extend the library by custom bash functions which will inherit
- Stack trace
- Termination of all function related (sub-)processes, including asynchronous background jobs upon error/exit or when reaching prompt-command (interactive shell)
- Removal of temporary files created via
mktemp
and execution of custom cleanup commands upon error/exit or when reaching prompt-command (interactive shell)
- Profit from helper functions that implement
- Joining of multiple files
- Multi-threaded sorting
- Multi-threaded compression plus indexing for random access by byte offset or line number without noticeable overhead
- Easily design multi-threaded pipelines to perform NGS related tasks
- Use many best-practice parameterized and heavily run-time tweaked software wrappers
- Most software related parameters will be inferred directly from input data, so that all functions require just a minimal set of input arguments
- Benefit from a non-root stand-alone installer without need for any prerequisites
- Get genomes, annotations from Ensembl, variants from GATK resource bundle and RAW sequencing data from NCBI Sequence Read Archive (SRA)
- For paired-end and single-end derived raw sequencing or prior mapped read data
- RNA-Seq protocols (RNA, RIP, m6A, ..)
- DNA-Seq protocols (WGS, ChIP, Chip-exo, ATAC, CAGE, Quant, Cut&Tag, ..)
- Bisulfite converted DNA-Seq protocols (WGBS, RRBS)
- Data quality anlysis and preprocessing
- adapter and poly-mono/di-nucleotide clipping
- quality trimming
- error correction
- artificial rRNA depletion
- Read alignment and post-processing
- knapsack problem based slicing of alignment files for parallel task execution
- sorting, filtering
- UMI based de-duplication or removal of optical and PCR duplicates
- generation of pools and pseudo-replicates
- read group modification, split N-cigar reads, left-alignment and base quality score recalibration
- Gene fusion detection
- Methyl-C calling and prediction of differentially methylated regions
- Expression analysis
- Read quantification (also from quasi-mappings), TPM and Z-score normalization and heatmap plotting
- Inference of strand specific library preparation methods
- Inference of differential expression as well as clusters of co-expression
- Detection of differential splice junctions and differential exon usage
- Gene ontology (GO) gene set enrichment and over representation analysis plus semantic similarity based clustering and visualizations
- Implementation of ENCODE v3 best-practice ChIP-Seq Peak calling
- Peak calling from RIP-Seq, MeRIP-Seq, m6A-Seq and other related IP-Seq data
- Inference of effective genome sizes
- Variant detection from DNA or RNA sequencing experiments
- Integration of multiple solutions for germline and somatic calling
- VCF normalization
- Tree reconstruction from homozygous sites
- ssGSEA and survival analysis from TCGA cancer expression data
- Genome and SRA data retrieval
- Genome to transcriptome conversion
- Data visualization via IGV batch processing
The whole project is licensed under the GPL v3 (see LICENSE file for details), except the the third-party tools set-upped during installation. Please refer to the corresponding licenses
Copyleft (C) 2020, Konstantin Riege
This will download you a copy which includes the latest developments
git clone --recursive https://github.com/Hoffmann-Lab/bashbone
To check out the latest release (irregularly compiled) do
cd bashbone
git checkout $(git describe --tags)
When used, in a script, bashbone is meant to be sourced at the very top to handle positional arguments and to re-execute (-r true
) the script under its own process group id in order to take care of proper termination (-a "$@"
). It will enable error stack tracing and sub-process handling globally by setting traps for EXIT
ERR
RETURN
INT
. So, don't override them. In case your script intends to spawn deamons use setsid
or disable bashbone first.
#!/usr/bin/env bash
source <path/to/bashbone>/activate.sh -r true -a "$@"
# do stuff
# now spawn deamons
setsid deamon1 &
bashbone -x
deamon2 &
If bashbone library is not required in your script, except for error tracing, proper sub-process termination upon exit or error and cleanup of temporary files, bashbone_lite
functions as an in-line replacement.
#!/usr/bin/env bash
source <path/to/bashbone>/bashbone_lite.sh -a "$@"
Please note, that error tracing in bash is circumvented by using ||
or '&&' constructs. Therefore, avoid them in any context of function calls.
#!/usr/bin/env bash
source <path/to/bashbone>/activate.sh -r true -a "$@"
function myfun(){
cat file_not_found
echo "error ignored. starting time consuming calculation now."
}
# DON'T !
myfun || echo "failed with code $?"
To get all third-party tools set-upped and subsequently all biobash bashbone functions to work properly, see also
Load the bashbone library in an interactive terminal session. Note, that none of your environment settings will be modified. Each function, which can be auto-completed when using bash v5 or later, comes with its own usage.
# shows activate usage
source ./activate.sh -h
# activates bashbone
source ./activate.sh
# shows bashbone usage
bashbone -h
# list developer function
bashbone -d
commander::makecmd commander::print commander::printcmd commander::printerr
commander::printinfo commander::qalter commander::qstat commander::qsubcmd
commander::runalter commander::runcmd commander::runstat commander::warn
configure::instances_by_memory configure::instances_by_threads configure::jvm configure::memory_by_instances
helper::addmemberfunctions helper::basename helper::capply helper::cat
helper::index helper::isarray helper::ishash helper::join
helper::lapply helper::makecatcmd helper::multijoin helper::pgzip
helper::ps2pdf helper::sort helper::vcfsort progress::log
# shows function usage
To unset bashbone functions in your interactive shell or to revert changes made to the environment when used in a script, do
bashbone -x
When bashbone is used within a script, which makes use of positional arguments, hand them over to bashbone activation script.
#! /usr/bin/env bash
source <path/to/bashbone>/activate.sh -r true -a "$@"
To create commands, print, and to execute them in parallel (with optional benchmarking of runtime and memory consumption), further inspect and modify task concurrency, utilize
commander::makecmd
commander::printcmd
commander::runcmd
commander::runstat
commander::runalter
# truncated usage of commander::makecmd
easily compose and store command in an array for later execution
-a <cmds> | mandatory. array to append the command onto
-v <variable> | optional. env variable(s) necessary for command. can be used multiple times
-s <separator> | optional. character to separate command chunks, when file descriptor array COMMANDER is used. default: '\n'
-m | force multi line support in case of activated bashbone legacy mode, which inerts line breaks
-o <outfile> | optional. redirect stdout to output file
-O <outfile> | optional. append stdout to output file
-c <cmd|{fd[0]}..> | ALWAYS LAST OPTION. command line string(s) and or here-documents or file descriptor array COMMANDER
# usage commander::runcmd
orchestrates the execution of commands stored within an array by utilizing GNU parallel
-v | optional. toggle on verbose mode which will print the commands prior to their execution
-b | optional. toggle on benchmarking run time and peak memory consumption
-c <env> | optional. activate a bashbone conda environment prior to command execution
-i <instances> | optional. number of parallel instances. see also commander::runalter. default: 1
-t <instances> | optional. obsolete synonym for -i
-s <idx[:idx]> | optional. restrict the execution of commands by array index start or range. default: 1
-n <name> | optional. prefix for logs and job scripts
-o <path> | optional. output directory for logs, job scripts and exit codes
-r | optional. override existing logs. default: append
-a <cmds> | mandatory. array of commands. see also commander::makecmd
example 1:
cmds=("echo job1" "echo job2" "echo job3")
cmds+=("sleep 2; echo job4")
commander::runcmd -v -b -i 2 -a cmds
example 2:
commander::runcmd -i 2 -n current -o ~/jobs -r -a cmds
Parallel task execution is also possible by submitting commands as an array job to the workload manager Sun Grid Engine or Son of Grid Engine fork (SGE) via the in-line replacement functions
commander::qsubcmd
commander::qstat
commander::qalter
To monitor any bash function and redirect all stdout and stderr into a log file while defining what is going to be printed to the terminal via verbosity levels (-v [1..3]
), use
progress::log -v 1 -o <logfile> -f commander::runcmd
To infer a suitable number of parallel instances for a local execution of tasks given the current state of CPU and memory resources, call bashbone configuration functions. Settings for memory and threads per instance and number of instances as well as garbage collection threads and concurrency for java jvm, matching the current machine (unless the -d
switch is used to perform a dry run), will be returned.
The execution of applications known to have a large memory footprint should be parameterized according to
configure::instances_by_memory
configure::memory_by_instances
configure::jvm
Task parallelization settings for lightweight applications, can be inferred via
configure::instances_by_threads
For a later orchestrated, parallel execution of commands, they need to be stored beforehand, which typically calls in for escaping special characters to not get them interpreted by the shell. commander::makecmd
offers two methods to solve this by using interpreted (EOF
) or non-interpreted ('EOF'
) Here-documents and file descriptor arrays.
for i in sun mercury venus earth mars jupiter saturn uranus neptune pluto; do
echo "$i" | awk '{print "hello "$1}'
done
# store
declare -a cmds
for i in sun mercury venus earth mars jupiter saturn uranus neptune pluto; do
cmds+=("echo \"$i\" | awk '{print \"hello \"\$1}'")
done
Solution 1 relies on variable reference(s) via -v <var>
and an uninterpreted Here-document. The content of each Here-document will be stored as a single command string in a de-referenced array (-a cmds
).
declare -a cmds
for i in sun mercury venus earth mars jupiter saturn uranus neptune pluto; do
commander::makecmd -a cmds -v i -c <<-'EOF'
echo "$i" | awk '{print "hello "$1}'
EOF
done
Solution 2 showcase the builtin file descriptors array COMMANDER
from which the content of interpreted or non-interpreted Here-documents will be read and concatenated using the pipe symbol as separator via -s '|'
. Each concatenation will be stored as a single command string in a de-referenced array (-a cmds
). Optionally, an output file can be defined (-o <path/to/file>
).
declare -a cmds
for i in sun mercury venus earth mars jupiter saturn uranus neptune pluto; do
commander::makecmd -a cmds -s '|' -c {COMMANDER[0]}<<-EOF {COMMANDER[1]}<<-'EOF'
echo $i
EOF
awk '{
print "hello "$1
}'
EOF
done
The stored commands can now be printed and executed in parallel instances. Optionally, stdout and sterr of each instance as well as its exit code can be re-directed into log files via -o <logdir> -n <prefix>
.
commander::printcmd -a cmds
instances=4
commander::runcmd -v -b -i $instances -a cmds
When commands are crafted for later execution, bashbone allows for multiple cleanup strategies to be applied on exit (success or failure). A reversely executed, temporary script, which is accessible through the $BASHBONE_CLEANUP
variable. An overloaded mktemp
function, that automatically adds removal commands to the cleanup script. An after all called _on_exit
function, which holds the job scripts exit code in the first positional argument.
source <path/to/bashbone>/activate.sh
commander::makecmd -a cmds -c <<-'EOF'
function _on_exit(){
rm -r /tmp/tmptest3
echo "exit code is $1"
}
mkdir -p /tmp/tmptest1
echo "rm -r /tmp/tmptest1" >> "$BASHBONE_CLEANUP"
mktemp -p /tmp/tmptest2
mkdir -p /tmp/tmptest3
ls /tmp/tmptest*
EOF
commander::runcmd -a cmds -i 1
ls /tmp/tmptest* # should fail
A small excerpt of possible member(-like-)functions, which help to avoid complex parameter expansions. See bashbone -d
helper underscore functions for a full list.
declare -a arr
helper::addmemberfunctions -v arr
arr.push "hello" "world" "and" "moon"
arr.get 1 # world
arr.get -1 # moon
arr.get 0 1 # hello world
arr.get 1 -1 # world and
arr.print # hello world and moon
arr.shift # world and moon
arr.pop # world and
arr.print # world and
arr.substring 2 4 # rld d
arr.sort # d rld
arr.uc # D RLD
arr.print # D RLD
arr.length # 2
# see more available arr.* member functions via auto-completion
In order to speed up gzip compression (including indexing for random access via byte offset or line number) or sorting a (vcf) file, from stdin respectively, utilize
helper::pgzip
helper::sort
helper::vcfsort
When a gzip file was not created via helper::pgzip
, or if a plain text flat file should be indexed, utilize helper::index
. The index can be used for random access to apply a command or pipeline parallelized on stdin of equal sized data chunks. Each process, chunk respectively, can be addressed by its JOB_ID
.
helper::index
helper::capply
# example
helper::capply -i 10 -f <file[.gz]> -c 'cat > $JOB_ID.out'
An other method, which does not require any index, allows to apply a command or pipeline parallelized on stdin line records or seekable temporary files. The number of total processes executed is the number lines divided by number of (adjustable) records or can be set to a specified number of instances. The latter will operate on line records in round robin mode. Each process can be addressed by its JOB_ID
. Temporary file names can be addressed using FILE
variable.
helper::lapply
# example
cat <file> | helper::lapply -f -t 10 -c 'cat $FILE > $JOB_ID.out'
Determining file mime-type for ultra fast printing of a its content thanks to multi-threaded decompression of gzip and bzip files, can be achieved using
helper::makecatcmd
helper::cat
Left outer join multiple files by unique ids in the first column, given the separator (-s '\t'
) and NA character (-e '.') via
helper::multijoin
To return the basename stripping off a single or two file extensions in case of compressed data and optionally report the suffix as well, use
helper::basename
To get a full bash error stack trace for your interactively used bash functions, to add proper handling of sub-process termination and to run cleanup procedures upon error or when reaching script end, prompt-command respectively (interactive shell), simply extend bashbone functions by your own library.
I.e. store a set of bash functions that require the function
keyword in a lib
directory and files with .sh
suffix to be sourced along with the bashbone library afterwards.
cd <path/to/custom>
mkdir lib
cat <<-'EOF' > lib/fun.sh
function world(){
echo "hello world"
mars
}
function mars(){
echo "hello mars"
cmd_that_fails
}
EOF
Now hand over the parent directory to activate.sh
.
source <path/to/bashbone>/activate.sh -s "$PWD"
world
# will print
hello world
hello mars
cmd_that_fails: command not found
:ERROR: in <path/to/custom>/lib/fun.sh (function: mars) @ line 8: cmd_that_fails
:ERROR: in <path/to/custom>/lib/fun.sh (function: world) @ line 3: mars
Each bashbone procedure, function from a custom extension respectively, gets a local, temporary cleanup script assigned. This bash script will be reversely executed upon return (success or failure) of the function. The path to the script is kept in the $BASHBONE_CLEANUP
variable. When mktemp
is used within the function, a removal command of this path will be automatically added to the cleanup script. For sure, custom commands can be added, too.
cd <path/to/custom>
mkdir lib
cat <<-'EOF' > lib/fun.sh
function tmptest(){
mkdir -p /tmp/tmptest1
echo "rm -r /tmp/tmptest1" >> "$BASHBONE_CLEANUP"
mktemp -p /tmp/tmptest2
ls /tmp/tmptest*
}
EOF
source <path/to/bashbone>/activate.sh -s $PWD
tmptest
ls /tmp/tmptest* # should fail
As an alternative, functions can be wrapped manually. Therefore, define an alias before its actual definition.
alias myfun="_bashbone_wrapper myfun"
function myfun(){
# do stuff
}
Upon script exit, due to failure or upon success, a custom cleanup functions can be handed over to bashbone and executed at the very end. The scripts exit code will be supplied as positional argument. Analogous to Local cleanup, temporary files created by mktemp
will be automatically nuked and a cleanup script, kept in the $BASHBONE_CLEANUP
variable, is executed.
#! /usr/bin/env bash
source <path/to/bashbone>/activate.sh -x cleanup -r true -a "$@"
function cleanup(){
rm -r /tmp/tmptest3
ls /tmp/tmptest* # should fail
echo "exit code is $1"
}
mkdir -p /tmp/tmptest1
echo "rm -r /tmp/tmptest1" >> "$BASHBONE_CLEANUP"
mktemp -p /tmp/tmptest2
mkdir -p /tmp/tmptest3
ls /tmp/tmptest*
When using the -g
switch (recommended), the setup routine will create conda environments or setups software from source according to enclosed configuration files, URLs respectively. Without -g
switch, software is installed in latest available version, which may lead to unexpected behavior and errors. During setup, current configuration files will be written to <path/of/installation/config>
.
# shows usage
scripts/setup.sh -h
scripts/setup.sh -g -i all -d <path/to/installation>
When bashbone was successfully installed, it is recommended to activate bashbone along with its conda environment to ensure all biobash functions to operate as expected. The conda environment can be stopped and (re-)started by the bashbone -c
switch.
source <path/of/installation>/latest/bashbone/activate.sh -c true
Use the -g
switch, in order to also upgrade biobash conda environments that fail the comparison with the supplied configuration files. Attention: This switch will downgrade tools, if the initial installation was done for cutting edge tools i.e. without -g
.
scripts/setup.sh -g -i upgrade -d <path/of/installation>
Trimmomatic, segemehl, STAR-Fusion, GEM, mdless and gztool will be installed next to the biobash conda environments. Their latest versions and download URLs will be automatically inferred.
scripts/setup.sh -i trimmomatic,segemehl,starfusion,gem,mdless,gztool -d <path/of/installation>
To get all third-party tools set-upped and subsequently all biobash bashbone functions to work properly, see also
Load the library and list available functions.
source <path/of/installation>/latest/bashbone/activate.sh
# list biobash user function
bashbone -f
# will print
alignment::add4stats alignment::addreadgroup alignment::bamqc alignment::bqsr
alignment::bulkindex alignment::bwa alignment::clip alignment::clipmateoverlaps
alignment::downsample alignment::inferstrandness alignment::leftalign alignment::mkreplicates
alignment::postprocess alignment::qcstats alignment::reorder alignment::rmduplicates
alignment::segemehl alignment::slice alignment::soft2hardclip alignment::splitncigar
alignment::star alignment::strandsplit alignment::tn5clip alignment::tobed
bisulfite::bwa bisulfite::haarz bisulfite::join bisulfite::mecall
bisulfite::methyldackel bisulfite::metilene bisulfite::mspicut bisulfite::rmduplicates
bisulfite::segemehl cluster::coexpression cluster::coexpression_deseq cluster::wgcna
cluster::wgcna_deseq enrichment::go expression::deseq expression::diego
expression::join expression::join_deseq fusions::arriba fusions::join
fusions::starfusion genome::indexgtf genome::mkdict genome::mkgodb
genome::view peaks::gem peaks::gem_idr peaks::genrich
peaks::genrich_idr peaks::gopeaks peaks::gopeaks_idr peaks::m6aviewer
peaks::m6aviewer_idr peaks::macs peaks::macs_idr peaks::matk
peaks::matk_idr peaks::peakachu peaks::peakachu_idr peaks::seacr
peaks::seacr_idr preprocess::add4stats preprocess::cutadapt preprocess::dedup
preprocess::fastqc preprocess::qcstats preprocess::rcorrector preprocess::rmpolynt
preprocess::sortmerna preprocess::trimmomatic quantify::bamcoverage quantify::featurecounts
quantify::normalize quantify::profiles quantify::salmon quantify::tpm
survival::gettcga survival::ssgsea variants::bcftools variants::freebayes
variants::haplotypecaller variants::makepondb variants::mutect variants::panelofnormals
variants::platypus variants::tree variants::vardict variants::vardict_threads
variants::varscan variants::vcfnorm visualize::venn
In order to make use of biobash conda environments, which ensures all supplied scripts to work as expected, activate bashbone with conda enabled
source <path/of/installation>/latest/bashbone/activate.sh -c true
Or activate conda at a later timepoint
source <path/of/installation>/latest/bashbone/activate.sh
# enable/disable conda
bashbone -c
Or use commander functions with conda enabled (see Developers centerpiece).
commander::runcmd -c bashbone -a cmds
Bashbone is shipped with a couple of scripts to be used stand alone (experimental, when bashbone is not installed and activated) or being part of the biobash functions. They can be listed via
bashbone -s
# will print
annotate.pl canonicals.pl deseq2.R dlgenome.sh
fftool.sh fpkm.pl genome2transcriptome.pl heatmap.R
id2length.pl mergefq.sh mergexons.pl mergexons.sh
pca.R pca_deseq.R pileup2fastq.pl revigo.R
rrbsMspIselection.sh shufnsplitfq.sh sra-dump.sh survival.R
tpm.pl vcfix.pl vcfixuniq.pl vizco.R
volcano.R wgcna.R
Use the enclosed script to fetch sequencing data and meta information from NCBI SRA in a parallelized fashion, given SRA or GEO accession number or to convert local sra files.
sra-dump.sh -h
Use the enclosed script to fetch human hg19/hg38 or mouse mm10/mm11 genomes, gene and ontology annotations plus dbSNP and MSigDB. The Plug-n-play CTAT genome resource, made for gene fusion detection and shipped with STAR index, can be also selected optionally.
dlgenome.sh -h
The genome, using annotation information, can be converted into a transcriptome, transcript-genome or transcript-chromosome.
genome2transcriptome.pl -h
Use the enclosed script to merge e.g. exons of multiple transcripts of a gene with optional offsets for elongation or shrinkage.
mergexons.sh -h
To shuffle and split NGS raw data in fastq format into two pseudo-replicates, use
shufnsplitfq.sh -h
In order to perform desired comparative tasks, some functions require a sample design info file.
Assume this input:
Treatment | Replicate 1 | Replicate 2 |
---|---|---|
wild-type | path/to/wt1.fq | path/to/wt2.fq |
treatment A | path/to/trA_1.fq | path/to/trA_2.fq |
treatment B | path/to/trB.n1.fq | path/to/trB.n2.fq |
And this desired output (N=2 vs N=2 each):
- wt_vs_A
- wt_vs_B
- A_vs_B
Then the info file should consist of:
- At least 4 tab-separated columns (
<name>
,<main-factor>
,NA
,<replicate>
) - Optionally, additional factors
- First column needs to consist of unique prefixes of input fastq basenames which can be expand to full file names
wt1 | wt | NA | N1 | female |
wt2 | wt | NA | N2 | male |
trA_1 | A | NA | N1 | female |
trA_2 | A | NA | N2 | male |
trB.n1 | B | NA | N1 | female |
trB.n2 | B | NA | N2 | male |
Adapter sequences listed below will be tested by FastQC, extracted from the reports and stored as arrays. In case of paired-end data, unknown adapter sequences will be extracted from mate overlaps utilizing BBMap.
source <path/of/installation>/latest/bashbone/activate.sh
declare -a fastq_R1=(<path/to/file> [..<path/to/file>])
declare -a fastq_R2=(<path/to/file> [..<path/to/file>])
declare -a adapter_R1 adapter_R2
preprocess::fastqc -t <threads> -o <outdir> -1 fastq_R1 [-2 fastq_R2] -a1 adapter_R1 [-a2 adapter_R2]
Further adapter sequences can be found in the Illumina Adapter Sequences Document (<https://www.illumina.com/search.html?q=Illumina Adapter Sequences Document>) or Illumina Adapter Sequences HTML (https://support-docs.illumina.com/SHARE/adapter-sequences.htm) and the resource of Trimmomatic (https://github.com/usadellab/Trimmomatic/tree/main/adapters), FastQC respectively (https://github.com/s-andrews/FastQC/blob/master/Configuration).
The following excerpt is independent of the indexing type, i.e. single, unique dual (UD) or combinatorial dual (CD).
Nextera (Transposase Sequence), TruSight, AmpliSeq, stranded total/mRNA Prep, Ribo-Zero Plus: CTGTCTCTTATACACATCT
TruSeq (Universal) Adapter with A prefix due to 3' primer A-tailing : AGATCGGAAGAGC
TruSeq full length DNA & RNA R1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA R2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
TruSeq full length DNA MethC R1: AGATCGGAAGAGCACACGTCTGAAC R2: AGATCGGAAGAGCGTCGTGTAGGGA
TruSeq Small RNA 3': TGGAATTCTCGGGTGCCAAGG
TruSeq Small RNA 5': GTTCAGAGTTCTACAGTCCGACGATC
Ovation Methyl-Seq R1: AGATCGGAAGAGC R2: AAATCAAAAAAAC
Tiny example pipeline to perform gene fusion detection and differential -and coexpression expression analyses plus gene ontology enrichment. Check out further pre-compiled pipelines for peak calling from *IP-Seq experiments and differential expression- and methylation analysis from RNA-Seq data, RRBS/WGBS respectively (rippchen) or for multiple variant calling options from Exome-Seq/WGS/RNA-Seq data including GATK best-practices in an optimized, parallelized fashion (muvac).
source <path/of/installation>/latest/bashbone/activate.sh -c true
genome=<path/to/fasta>
genomeidx=<path/to/segemehl.idx>
gtf=<path/to/gtf>
declare -a fastqs=(<path/to/fastq/files> [..<path/to/fastq/files])) # can be compressed
threads=16
memory=64000
declare -a adapter
preprocess::fastqc -t $threads -M $memory -o results/qualities/raw -a adapter -1 fastqs
preprocess::trimmomatic -t $threads -M $memory -o results/trimmed -1 fastqs
preprocess::cutadapt -t $threads -o results/adapterclipped -a adapter -1 fastqs
preprocess::rcorrector -t $threads -o results/corrected -1 fastqs
preprocess::sortmerna -t $threads -o results/rrnafiltered -1 fastqs
fragmentsize=200
fusions::arriba -t $threads -g $genome -a $gtf -o results/fusions -f $fragmentsize -1 fastqs
fusions::starfusion -t $threads -g $genome -g $gtf -o results/fusions -1 fastqs
declare -a mapped
alignment::segemehl -t $threads -g $genome -x $genomeidx -o results/mapped -1 fastqs -r mapped
alignment::postprocess -j uniqify -t $threads -o results/mapped -r mapped
alignment::postprocess -r sort -t $threads -o results/mapped -r mapped
declare -A strandness
alignment::inferstrandness -t $threads -g $gtf -r mapped -x strandness
quantify::featurecounts -t $threads -M $memory -g $gtf -o results/counted -r mapped -x strandness
quantify::tpm -t $threads -g $gtf -o results/counted -r mapped
declare -a comparisons=(<path/to/sample-info/files> [..<path/to/sample-info/files>])
expression::diego -t $threads -g $gtf -c comparisons -i results/counted -o results/diffexonjunctions -r mapped -x strandness
expression::deseq -t $threads -g $gtf -c comparisons -i results/counted -o results/diffgenes -r mapped
expression::join_deseq -t $threads -g $gtf -c comparisons -i results/counted -j results/diffgenes -o results/counted
cluster::coexpression -t $threads -M $memory -g $gtf -i results/counted -o results/coexpressed -r mapped
go=<path/to/gtf.go>
enrichment::go -t $threads -r mapper -c comparisons -l coexpressions -g $go -i results/deseq
In some rare cases a glibc pthreads bug (https://sourceware.org/bugzilla/show_bug.cgi?id=23275) may cause pigz failures (internal threads error
) and premature termination of tools leveraging on it e.g. Cutadapt and pigz. One can circumvent this by making use of a newer glibc version, an alternative pthreads library e.g. compiled without lock elision via LD_PRELOAD
or by using the statically compiled pigz binary shipped with bashbone.
LD_PRELOAD=</path/to/no-elision/libpthread.so.0> <command>
Bashbone is a continuously developed library and actively used in my daily work. As a single developer it may take me a while to fix errors and issues. Feature requests cannot be handled so far, but I am happy to receive pull request.