Bashbone

A bash and biobash library for workflow and pipeline design within but not restricted to the scope of Next Generation Sequencing (NGS) data analyses.

Outline

Bashbone
- For developers - bash library
- For users - biobash library
  - Covered Tasks
License
Download
Bash library usage (without installation)
Installation
Biobash library usage (requires installation)
Third-party software
Supplementary information
Closing remarks

For developers - bash library

▲ back to top

Write command line code in your favorite programming language via Here-documents for later orchestrated execution
Add object-oriented programming (oop) like syntactic sugar to bash variables and arrays to avoid complex parameter-expansions, variable-expansions and brace-expansions
Execute commands in parallel on your machine or submit them as jobs to a workflow manager like sun grid engine (SGE) and log stdout, stderr and exit codes per job
Benchmark runtime and memory usage
Infer number of parallel instances according to targeted memory consumption or targeted threads per instance
Get a full bash error stack trace in interactive shells or within scripts
Log execution of bash functions at different verbosity levels
Extend the library by custom bash functions which will inherit
- Stack trace
- Termination of all function related (sub-)processes, including asynchronous background jobs upon error/exit or when reaching prompt-command (interactive shell)
- Removal of temporary files created via mktemp and execution of custom cleanup commands upon error/exit or when reaching prompt-command (interactive shell)
Profit from helper functions that implement
- Joining of multiple files
- Multi-threaded sorting
- Multi-threaded compression plus indexing for random access by byte offset or line number without noticeable overhead

For users - biobash library

▲ back to top

Easily design multi-threaded pipelines to perform NGS related tasks
Use many best-practice parameterized and heavily run-time tweaked software wrappers
Most software related parameters will be inferred directly from input data, so that all functions require just a minimal set of input arguments
Benefit from a non-root stand-alone installer without need for any prerequisites
Get genomes, annotations from Ensembl, variants from GATK resource bundle and RAW sequencing data from NCBI Sequence Read Archive (SRA)

Covered tasks

▲ back to top

For paired-end and single-end derived raw sequencing or prior mapped read data
- RNA-Seq protocols (RNA, RIP, m6A, ..)
- DNA-Seq protocols (WGS, ChIP, Chip-exo, ATAC, CAGE, Quant, Cut&Tag, ..)
- Bisulfite converted DNA-Seq protocols (WGBS, RRBS)
Data quality anlysis and preprocessing
- adapter and poly-mono/di-nucleotide clipping
- quality trimming
- error correction
- artificial rRNA depletion
Read alignment and post-processing
- knapsack problem based slicing of alignment files for parallel task execution
- sorting, filtering
- UMI based de-duplication or removal of optical and PCR duplicates
- generation of pools and pseudo-replicates
- read group modification, split N-cigar reads, left-alignment and base quality score recalibration
Gene fusion detection
Methyl-C calling and prediction of differentially methylated regions
Expression analysis
- Read quantification (also from quasi-mappings), TPM and Z-score normalization and heatmap plotting
- Inference of strand specific library preparation methods
- Inference of differential expression as well as clusters of co-expression
- Detection of differential splice junctions and differential exon usage
- Gene ontology (GO) gene set enrichment and over representation analysis plus semantic similarity based clustering and visualizations
Implementation of ENCODE v3 best-practice ChIP-Seq Peak calling
- Peak calling from RIP-Seq, MeRIP-Seq, m6A-Seq and other related IP-Seq data
- Inference of effective genome sizes
Variant detection from DNA or RNA sequencing experiments
- Integration of multiple solutions for germline and somatic calling
- VCF normalization
- Tree reconstruction from homozygous sites
ssGSEA and survival analysis from TCGA cancer expression data
Genome and SRA data retrieval
- Genome to transcriptome conversion
- Data visualization via IGV batch processing

License

▲ back to top

The whole project is licensed under the GPL v3 (see LICENSE file for details), except the the third-party tools set-upped during installation. Please refer to the corresponding licenses

Download

▲ back to top

This will download you a copy which includes the latest developments

git clone --recursive https://github.com/Hoffmann-Lab/bashbone

To check out the latest release (irregularly compiled) do

cd bashbone
git checkout $(git describe --tags)

Bash library usage (without installation)

▲ back to top

Do's and don'ts

▲ back to top

When used, in a script, bashbone is meant to be sourced at the very top to handle positional arguments and to re-execute (-r true) the script under its own process group id in order to take care of proper termination (-a "$@"). It will enable error stack tracing and sub-process handling globally by setting traps for EXIT ERR RETURN INT. So, don't override them. In case your script intends to spawn deamons use setsid or disable bashbone first.

#!/usr/bin/env bash
source <path/to/bashbone>/activate.sh -r true -a "$@"
# do stuff
# now spawn deamons
setsid deamon1 &
bashbone -x
deamon2 &

If bashbone library is not required in your script, except for error tracing, proper sub-process termination upon exit or error and cleanup of temporary files, bashbone_lite functions as an in-line replacement.

#!/usr/bin/env bash
source <path/to/bashbone>/bashbone_lite.sh -a "$@"

Please note, that error tracing in bash is circumvented by using || or '&&' constructs. Therefore, avoid them in any context of function calls.

#!/usr/bin/env bash
source <path/to/bashbone>/activate.sh -r true -a "$@"

function myfun(){
  cat file_not_found
  echo "error ignored. starting time consuming calculation now."
}
# DON'T !
myfun || echo "failed with code $?"

Quick start

▲ back to top

To get all third-party tools set-upped and subsequently all biobash bashbone functions to work properly, see also

Installation
Biobash library usage (requires installation)

Load the bashbone library in an interactive terminal session. Note, that none of your environment settings will be modified. Each function, which can be auto-completed when using bash v5 or later, comes with its own usage.

# shows activate usage
source ./activate.sh -h

# activates bashbone
source ./activate.sh

# shows bashbone usage
bashbone -h

# list developer function
bashbone -d

commander::makecmd              commander::print                 commander::printcmd  commander::printerr
commander::printinfo            commander::qalter                commander::qstat     commander::qsubcmd
commander::runalter             commander::runcmd                commander::runstat   commander::warn
configure::instances_by_memory  configure::instances_by_threads  configure::jvm       configure::memory_by_instances
helper::addmemberfunctions      helper::basename                 helper::capply       helper::cat
helper::index                   helper::isarray                  helper::ishash       helper::join
helper::lapply                  helper::makecatcmd               helper::multijoin    helper::pgzip
helper::ps2pdf                  helper::sort                     helper::vcfsort      progress::log

# shows function usage

To unset bashbone functions in your interactive shell or to revert changes made to the environment when used in a script, do

bashbone -x

Positional arguments

▲ back to top

When bashbone is used within a script, which makes use of positional arguments, hand them over to bashbone activation script.

#! /usr/bin/env bash
source <path/to/bashbone>/activate.sh -r true -a "$@"

Developers centerpiece

▲ back to top

To create commands, print, and to execute them in parallel (with optional benchmarking of runtime and memory consumption), further inspect and modify task concurrency, utilize

commander::makecmd
commander::printcmd
commander::runcmd
commander::runstat
commander::runalter

# truncated usage of commander::makecmd

easily compose and store command in an array for later execution

-a <cmds>          | mandatory. array to append the command onto
-v <variable>      | optional. env variable(s) necessary for command. can be used multiple times
-s <separator>     | optional. character to separate command chunks, when file descriptor array COMMANDER is used. default: '\n'
-m                 | force multi line support in case of activated bashbone legacy mode, which inerts line breaks
-o <outfile>       | optional. redirect stdout to output file
-O <outfile>       | optional. append stdout to output file
-c <cmd|{fd[0]}..> | ALWAYS LAST OPTION. command line string(s) and or here-documents or file descriptor array COMMANDER


# usage commander::runcmd

orchestrates the execution of commands stored within an array by utilizing GNU parallel

-v             | optional. toggle on verbose mode which will print the commands prior to their execution
-b             | optional. toggle on benchmarking run time and peak memory consumption
-c <env>       | optional. activate a bashbone conda environment prior to command execution
-i <instances> | optional. number of parallel instances. see also commander::runalter. default: 1
-t <instances> | optional. obsolete synonym for -i
-s <idx[:idx]> | optional. restrict the execution of commands by array index start or range. default: 1
-n <name>      | optional. prefix for logs and job scripts
-o <path>      | optional. output directory for logs, job scripts and exit codes
-r             | optional. override existing logs. default: append
-a <cmds>      | mandatory. array of commands. see also commander::makecmd

example 1:
cmds=("echo job1" "echo job2" "echo job3")
cmds+=("sleep 2; echo job4")
commander::runcmd -v -b -i 2 -a cmds

example 2:
commander::runcmd -i 2 -n current -o ~/jobs -r -a cmds

Parallel task execution is also possible by submitting commands as an array job to the workload manager Sun Grid Engine or Son of Grid Engine fork (SGE) via the in-line replacement functions

commander::qsubcmd
commander::qstat
commander::qalter

To monitor any bash function and redirect all stdout and stderr into a log file while defining what is going to be printed to the terminal via verbosity levels (-v [1..3]), use

progress::log -v 1 -o <logfile> -f commander::runcmd

To infer a suitable number of parallel instances for a local execution of tasks given the current state of CPU and memory resources, call bashbone configuration functions. Settings for memory and threads per instance and number of instances as well as garbage collection threads and concurrency for java jvm, matching the current machine (unless the -d switch is used to perform a dry run), will be returned.

The execution of applications known to have a large memory footprint should be parameterized according to

configure::instances_by_memory
configure::memory_by_instances
configure::jvm

Task parallelization settings for lightweight applications, can be inferred via

configure::instances_by_threads

Example

▲ back to top

For a later orchestrated, parallel execution of commands, they need to be stored beforehand, which typically calls in for escaping special characters to not get them interpreted by the shell. commander::makecmd offers two methods to solve this by using interpreted (EOF) or non-interpreted ('EOF') Here-documents and file descriptor arrays.

for i in sun mercury venus earth mars jupiter saturn uranus neptune pluto; do
  echo "$i" | awk '{print "hello "$1}'
done
# store
declare -a cmds
for i in sun mercury venus earth mars jupiter saturn uranus neptune pluto; do
  cmds+=("echo \"$i\" | awk '{print \"hello \"\$1}'")
done

Solution 1 relies on variable reference(s) via -v <var> and an uninterpreted Here-document. The content of each Here-document will be stored as a single command string in a de-referenced array (-a cmds).

declare -a cmds
for i in sun mercury venus earth mars jupiter saturn uranus neptune pluto; do
  commander::makecmd -a cmds -v i -c <<-'EOF'
    echo "$i" | awk '{print "hello "$1}'
  EOF
done

Solution 2 showcase the builtin file descriptors array COMMANDER from which the content of interpreted or non-interpreted Here-documents will be read and concatenated using the pipe symbol as separator via -s '|'. Each concatenation will be stored as a single command string in a de-referenced array (-a cmds). Optionally, an output file can be defined (-o <path/to/file>).

declare -a cmds
for i in sun mercury venus earth mars jupiter saturn uranus neptune pluto; do
	commander::makecmd -a cmds -s '|' -c {COMMANDER[0]}<<-EOF {COMMANDER[1]}<<-'EOF'
		echo $i
	EOF
		awk '{
			print "hello "$1
		}'
	EOF
done

The stored commands can now be printed and executed in parallel instances. Optionally, stdout and sterr of each instance as well as its exit code can be re-directed into log files via -o <logdir> -n <prefix>.

commander::printcmd -a cmds
instances=4
commander::runcmd -v -b -i $instances -a cmds

Cleanup

When commands are crafted for later execution, bashbone allows for multiple cleanup strategies to be applied on exit (success or failure). A reversely executed, temporary script, which is accessible through the $BASHBONE_CLEANUP variable. An overloaded mktemp function, that automatically adds removal commands to the cleanup script. An after all called _on_exit function, which holds the job scripts exit code in the first positional argument.

source <path/to/bashbone>/activate.sh

commander::makecmd -a cmds -c <<-'EOF'
  function _on_exit(){
    rm -r /tmp/tmptest3
    echo "exit code is $1"
  }
  mkdir -p /tmp/tmptest1
  echo "rm -r /tmp/tmptest1" >> "$BASHBONE_CLEANUP"
  mktemp -p /tmp/tmptest2
  mkdir -p /tmp/tmptest3
  ls /tmp/tmptest*
EOF

commander::runcmd -a cmds -i 1
ls /tmp/tmptest* # should fail

Helper functions

▲ back to top

OOP bash

▲ back to top

A small excerpt of possible member(-like-)functions, which help to avoid complex parameter expansions. See bashbone -d helper underscore functions for a full list.

declare -a arr
helper::addmemberfunctions -v arr

arr.push "hello" "world" "and" "moon"
arr.get 1 # world
arr.get -1 # moon
arr.get 0 1 # hello world
arr.get 1 -1 # world and
arr.print # hello world and moon
arr.shift # world and moon
arr.pop # world and
arr.print # world and
arr.substring 2 4 # rld d
arr.sort # d rld
arr.uc # D RLD
arr.print # D RLD
arr.length # 2
# see more available arr.* member functions via auto-completion

Multithreaded implementations

▲ back to top

In order to speed up gzip compression (including indexing for random access via byte offset or line number) or sorting a (vcf) file, from stdin respectively, utilize

helper::pgzip
helper::sort
helper::vcfsort

When a gzip file was not created via helper::pgzip, or if a plain text flat file should be indexed, utilize helper::index. The index can be used for random access to apply a command or pipeline parallelized on stdin of equal sized data chunks. Each process, chunk respectively, can be addressed by its JOB_ID.

helper::index
helper::capply

# example
helper::capply -i 10 -f <file[.gz]> -c 'cat > $JOB_ID.out'

An other method, which does not require any index, allows to apply a command or pipeline parallelized on stdin line records or seekable temporary files. The number of total processes executed is the number lines divided by number of (adjustable) records or can be set to a specified number of instances. The latter will operate on line records in round robin mode. Each process can be addressed by its JOB_ID. Temporary file names can be addressed using FILE variable.

helper::lapply

# example
cat <file> | helper::lapply -f -t 10 -c 'cat $FILE > $JOB_ID.out'

Determining file mime-type for ultra fast printing of a its content thanks to multi-threaded decompression of gzip and bzip files, can be achieved using

helper::makecatcmd
helper::cat

Misc

▲ back to top

Left outer join multiple files by unique ids in the first column, given the separator (-s '\t') and NA character (-e '.') via

helper::multijoin

To return the basename stripping off a single or two file extensions in case of compressed data and optionally report the suffix as well, use

helper::basename

Extending the library

▲ back to top

To get a full bash error stack trace for your interactively used bash functions, to add proper handling of sub-process termination and to run cleanup procedures upon error or when reaching script end, prompt-command respectively (interactive shell), simply extend bashbone functions by your own library.

I.e. store a set of bash functions that require the function keyword in a lib directory and files with .sh suffix to be sourced along with the bashbone library afterwards.

cd <path/to/custom>
mkdir lib
cat <<-'EOF' > lib/fun.sh
  function world(){
    echo "hello world"
    mars
  }

  function mars(){
    echo "hello mars"
    cmd_that_fails
  }
EOF

Now hand over the parent directory to activate.sh.

source <path/to/bashbone>/activate.sh -s "$PWD"
world

# will print
hello world
hello mars
cmd_that_fails: command not found
:ERROR: in <path/to/custom>/lib/fun.sh (function: mars) @ line 8: cmd_that_fails
:ERROR: in <path/to/custom>/lib/fun.sh (function: world) @ line 3: mars

Local cleanup

▲ back to top

Each bashbone procedure, function from a custom extension respectively, gets a local, temporary cleanup script assigned. This bash script will be reversely executed upon return (success or failure) of the function. The path to the script is kept in the $BASHBONE_CLEANUP variable. When mktemp is used within the function, a removal command of this path will be automatically added to the cleanup script. For sure, custom commands can be added, too.

cd <path/to/custom>
mkdir lib
cat <<-'EOF' > lib/fun.sh
  function tmptest(){
    mkdir -p /tmp/tmptest1
    echo "rm -r /tmp/tmptest1" >> "$BASHBONE_CLEANUP"
    mktemp -p /tmp/tmptest2
    ls /tmp/tmptest*
  }
EOF

source <path/to/bashbone>/activate.sh -s $PWD
tmptest
ls /tmp/tmptest* # should fail

As an alternative, functions can be wrapped manually. Therefore, define an alias before its actual definition.

alias myfun="_bashbone_wrapper myfun"

function myfun(){
  # do stuff
}

Global cleanup

▲ back to top

Upon script exit, due to failure or upon success, a custom cleanup functions can be handed over to bashbone and executed at the very end. The scripts exit code will be supplied as positional argument. Analogous to Local cleanup, temporary files created by mktemp will be automatically nuked and a cleanup script, kept in the $BASHBONE_CLEANUP variable, is executed.

#! /usr/bin/env bash
source <path/to/bashbone>/activate.sh -x cleanup -r true -a "$@"

function cleanup(){
  rm -r /tmp/tmptest3
  ls /tmp/tmptest* # should fail
  echo "exit code is $1"
}

mkdir -p /tmp/tmptest1
echo "rm -r /tmp/tmptest1" >> "$BASHBONE_CLEANUP"
mktemp -p /tmp/tmptest2
mkdir -p /tmp/tmptest3
ls /tmp/tmptest*

Installation

▲ back to top

Full installation of all third party tools used in bashbones biobash library

▲ back to top

When using the -g switch (recommended), the setup routine will create conda environments or setups software from source according to enclosed configuration files, URLs respectively. Without -g switch, software is installed in latest available version, which may lead to unexpected behavior and errors. During setup, current configuration files will be written to <path/of/installation/config>.

# shows usage
scripts/setup.sh -h

scripts/setup.sh -g -i all -d <path/to/installation>

When bashbone was successfully installed, it is recommended to activate bashbone along with its conda environment to ensure all biobash functions to operate as expected. The conda environment can be stopped and (re-)started by the bashbone -c switch.

source <path/of/installation>/latest/bashbone/activate.sh -c true

Upgrade to a newer release

▲ back to top

Use the -g switch, in order to also upgrade biobash conda environments that fail the comparison with the supplied configuration files. Attention: This switch will downgrade tools, if the initial installation was done for cutting edge tools i.e. without -g.

scripts/setup.sh -g -i upgrade -d <path/of/installation>

Update tools

▲ back to top

Trimmomatic, segemehl, STAR-Fusion, GEM, mdless and gztool will be installed next to the biobash conda environments. Their latest versions and download URLs will be automatically inferred.

scripts/setup.sh -i trimmomatic,segemehl,starfusion,gem,mdless,gztool -d <path/of/installation>

Biobash library usage (requires installation)

▲ back to top

To get all third-party tools set-upped and subsequently all biobash bashbone functions to work properly, see also

Installation
Do's and don'ts

Load the library and list available functions.

source <path/of/installation>/latest/bashbone/activate.sh

# list biobash user function
bashbone -f

# will print
alignment::add4stats       alignment::addreadgroup     alignment::bamqc             alignment::bqsr
alignment::bulkindex       alignment::bwa              alignment::clip              alignment::clipmateoverlaps
alignment::downsample      alignment::inferstrandness  alignment::leftalign         alignment::mkreplicates
alignment::postprocess     alignment::qcstats          alignment::reorder           alignment::rmduplicates
alignment::segemehl        alignment::slice            alignment::soft2hardclip     alignment::splitncigar
alignment::star            alignment::strandsplit      alignment::tn5clip           alignment::tobed
bisulfite::bwa             bisulfite::haarz            bisulfite::join              bisulfite::mecall
bisulfite::methyldackel    bisulfite::metilene         bisulfite::mspicut           bisulfite::rmduplicates
bisulfite::segemehl        cluster::coexpression       cluster::coexpression_deseq  cluster::wgcna
cluster::wgcna_deseq       enrichment::go              expression::deseq            expression::diego
expression::join           expression::join_deseq      fusions::arriba              fusions::join
fusions::starfusion        genome::indexgtf            genome::mkdict               genome::mkgodb
genome::view               peaks::gem                  peaks::gem_idr               peaks::genrich
peaks::genrich_idr         peaks::gopeaks              peaks::gopeaks_idr           peaks::m6aviewer
peaks::m6aviewer_idr       peaks::macs                 peaks::macs_idr              peaks::matk
peaks::matk_idr            peaks::peakachu             peaks::peakachu_idr          peaks::seacr
peaks::seacr_idr           preprocess::add4stats       preprocess::cutadapt         preprocess::dedup
preprocess::fastqc         preprocess::qcstats         preprocess::rcorrector       preprocess::rmpolynt
preprocess::sortmerna      preprocess::trimmomatic     quantify::bamcoverage        quantify::featurecounts
quantify::normalize        quantify::profiles          quantify::salmon             quantify::tpm
survival::gettcga          survival::ssgsea            variants::bcftools           variants::freebayes
variants::haplotypecaller  variants::makepondb         variants::mutect             variants::panelofnormals
variants::platypus         variants::tree              variants::vardict            variants::vardict_threads
variants::varscan          variants::vcfnorm           visualize::venn

In order to make use of biobash conda environments, which ensures all supplied scripts to work as expected, activate bashbone with conda enabled

source <path/of/installation>/latest/bashbone/activate.sh -c true

Or activate conda at a later timepoint

source <path/of/installation>/latest/bashbone/activate.sh
# enable/disable conda
bashbone -c

Or use commander functions with conda enabled (see Developers centerpiece).

commander::runcmd -c bashbone -a cmds

Enclosed scripts

▲ back to top

Bashbone is shipped with a couple of scripts to be used stand alone (experimental, when bashbone is not installed and activated) or being part of the biobash functions. They can be listed via

bashbone -s

# will print
annotate.pl           canonicals.pl    deseq2.R                 dlgenome.sh
fftool.sh             fpkm.pl          genome2transcriptome.pl  heatmap.R
id2length.pl          mergefq.sh       mergexons.pl             mergexons.sh
pca.R                 pca_deseq.R      pileup2fastq.pl          revigo.R
rrbsMspIselection.sh  shufnsplitfq.sh  sra-dump.sh              survival.R
tpm.pl                vcfix.pl         vcfixuniq.pl             vizco.R
volcano.R             wgcna.R

Retrieve SRA datasets

▲ back to top

Use the enclosed script to fetch sequencing data and meta information from NCBI SRA in a parallelized fashion, given SRA or GEO accession number or to convert local sra files.

sra-dump.sh -h

Retrieve genomes

▲ back to top

Use the enclosed script to fetch human hg19/hg38 or mouse mm10/mm11 genomes, gene and ontology annotations plus dbSNP and MSigDB. The Plug-n-play CTAT genome resource, made for gene fusion detection and shipped with STAR index, can be also selected optionally.

dlgenome.sh -h

The genome, using annotation information, can be converted into a transcriptome, transcript-genome or transcript-chromosome.

genome2transcriptome.pl -h

Merge/collate features

▲ back to top

Use the enclosed script to merge e.g. exons of multiple transcripts of a gene with optional offsets for elongation or shrinkage.

mergexons.sh -h

Generate pseudo replicates

▲ back to top

To shuffle and split NGS raw data in fastq format into two pseudo-replicates, use

shufnsplitfq.sh -h

Sample info file

▲ back to top

In order to perform desired comparative tasks, some functions require a sample design info file.

Assume this input:

Treatment	Replicate 1	Replicate 2
wild-type	path/to/wt1.fq	path/to/wt2.fq
treatment A	path/to/trA_1.fq	path/to/trA_2.fq
treatment B	path/to/trB.n1.fq	path/to/trB.n2.fq

And this desired output (N=2 vs N=2 each):

wt_vs_A
wt_vs_B
A_vs_B

Then the info file should consist of:

At least 4 tab-separated columns (<name>, <main-factor>, NA, <replicate>)
Optionally, additional factors
First column needs to consist of unique prefixes of input fastq basenames which can be expand to full file names


wt1	wt	NA	N1	female
wt2	wt	NA	N2	male
trA_1	A	NA	N1	female
trA_2	A	NA	N2	male
trB.n1	B	NA	N1	female
trB.n2	B	NA	N2	male

Adapter sequences

▲ back to top

Adapter sequences listed below will be tested by FastQC, extracted from the reports and stored as arrays. In case of paired-end data, unknown adapter sequences will be extracted from mate overlaps utilizing BBMap.

source <path/of/installation>/latest/bashbone/activate.sh
declare -a fastq_R1=(<path/to/file> [..<path/to/file>])
declare -a fastq_R2=(<path/to/file> [..<path/to/file>])
declare -a adapter_R1 adapter_R2
preprocess::fastqc -t <threads> -o <outdir> -1 fastq_R1 [-2 fastq_R2] -a1 adapter_R1 [-a2 adapter_R2]

Further adapter sequences can be found in the Illumina Adapter Sequences Document (<https://www.illumina.com/search.html?q=Illumina Adapter Sequences Document>) or Illumina Adapter Sequences HTML (https://support-docs.illumina.com/SHARE/adapter-sequences.htm) and the resource of Trimmomatic (https://github.com/usadellab/Trimmomatic/tree/main/adapters), FastQC respectively (https://github.com/s-andrews/FastQC/blob/master/Configuration).

The following excerpt is independent of the indexing type, i.e. single, unique dual (UD) or combinatorial dual (CD).

Nextera (Transposase Sequence), TruSight, AmpliSeq, stranded total/mRNA Prep, Ribo-Zero Plus: CTGTCTCTTATACACATCT

TruSeq (Universal) Adapter with A prefix due to 3' primer A-tailing : AGATCGGAAGAGC

TruSeq full length DNA & RNA R1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA R2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

TruSeq full length DNA MethC R1: AGATCGGAAGAGCACACGTCTGAAC R2: AGATCGGAAGAGCGTCGTGTAGGGA

TruSeq Small RNA 3': TGGAATTCTCGGGTGCCAAGG

TruSeq Small RNA 5': GTTCAGAGTTCTACAGTCCGACGATC

Ovation Methyl-Seq R1: AGATCGGAAGAGC R2: AAATCAAAAAAAC

Pipeline example

▲ back to top

Tiny example pipeline to perform gene fusion detection and differential -and coexpression expression analyses plus gene ontology enrichment. Check out further pre-compiled pipelines for peak calling from *IP-Seq experiments and differential expression- and methylation analysis from RNA-Seq data, RRBS/WGBS respectively (rippchen) or for multiple variant calling options from Exome-Seq/WGS/RNA-Seq data including GATK best-practices in an optimized, parallelized fashion (muvac).

source <path/of/installation>/latest/bashbone/activate.sh -c true
genome=<path/to/fasta>
genomeidx=<path/to/segemehl.idx>
gtf=<path/to/gtf>
declare -a fastqs=(<path/to/fastq/files> [..<path/to/fastq/files])) # can be compressed
threads=16
memory=64000

declare -a adapter
preprocess::fastqc -t $threads -M $memory -o results/qualities/raw -a adapter -1 fastqs
preprocess::trimmomatic -t $threads -M $memory -o results/trimmed -1 fastqs
preprocess::cutadapt -t $threads -o results/adapterclipped -a adapter -1 fastqs

preprocess::rcorrector -t $threads -o results/corrected -1 fastqs
preprocess::sortmerna -t $threads -o results/rrnafiltered -1 fastqs

fragmentsize=200
fusions::arriba -t $threads -g $genome -a $gtf -o results/fusions -f $fragmentsize -1 fastqs
fusions::starfusion -t $threads -g $genome -g $gtf -o results/fusions -1 fastqs

declare -a mapped
alignment::segemehl -t $threads -g $genome -x $genomeidx -o results/mapped -1 fastqs -r mapped

alignment::postprocess -j uniqify -t $threads -o results/mapped -r mapped
alignment::postprocess -r sort -t $threads -o results/mapped -r mapped

declare -A strandness
alignment::inferstrandness -t $threads -g $gtf -r mapped -x strandness
quantify::featurecounts -t $threads -M $memory -g $gtf -o results/counted -r mapped -x strandness
quantify::tpm -t $threads -g $gtf -o results/counted -r mapped

declare -a comparisons=(<path/to/sample-info/files> [..<path/to/sample-info/files>])
expression::diego -t $threads -g $gtf -c comparisons -i results/counted -o results/diffexonjunctions -r mapped -x strandness
expression::deseq -t $threads -g $gtf -c comparisons -i results/counted -o results/diffgenes -r mapped
expression::join_deseq -t $threads -g $gtf -c comparisons -i results/counted -j results/diffgenes -o results/counted
cluster::coexpression -t $threads -M $memory -g $gtf -i results/counted -o results/coexpressed -r mapped

go=<path/to/gtf.go>
enrichment::go -t $threads -r mapper -c comparisons -l coexpressions -g $go -i results/deseq

Third-party software

▲ back to top

Tool	Source	DOI
Arriba	https://github.com/suhrig/arriba/	NA
BamUtil	https://genome.sph.umich.edu/wiki/BamUtil	10.1101/gr.176552.114
BBTools	https://jgi.doe.gov/data-and-tools/software-tools/bbtools	10.1371/journal.pone.0185056
BWA	https://github.com/lh3/bwa	10.1093/bioinformatics/btp324
BWA-mem2	https://github.com/bwa-mem2/bwa-mem2	10.1109/IPDPS.2019.00041
BWA-meth	https://github.com/brentp/bwa-meth	arXiv:1401.1129
BCFtools	http://www.htslib.org/doc/bcftools.html	10.1093/bioinformatics/btr509
BEDTools	https://bedtools.readthedocs.io	10.1093/bioinformatics/btq033
gztool	https://github.com/circulosmeos/gztool	NA
cgpBigWig	https://github.com/cancerit/cgpBigWig	NA
clusterProfiler	https://guangchuangyu.github.io/software/clusterProfiler	10.1089/omi.2011.0118
Cutadapt	https://cutadapt.readthedocs.io/en/stable	10.14806/ej.17.1.200
DANPOS3	https://github.com/sklasfeld/DANPOS3	10.1101/gr.142067.112
deepTools2	https://deeptools.readthedocs.io/en/latest/index.html	10.1093/nar/gkw257
DESeq2	https://bioconductor.org/packages/release/bioc/html/DESeq2.html	10.1186/s13059-014-0550-8 10.1093/biostatistics/kxw041
DEXSeq	https://bioconductor.org/packages/release/bioc/html/DEXSeq.html	10.1101/gr.133744.111
DIEGO	http://www.bioinf.uni-leipzig.de/Software/DIEGO	10.1093/bioinformatics/btx690
DGCA	https://github.com/andymckenzie/DGCA	10.1186/s12918-016-0349-1
dupsifter	https://github.com/huishenlab/dupsifter	10.1093/bioinformatics/btad729
fastqc	https://www.bioinformatics.babraham.ac.uk/projects/fastqc	NA
featureCounts	http://subread.sourceforge.net	10.1093/bioinformatics/btt656
fgbio	http://fulcrumgenomics.github.io/fgbio/	NA
freebayes	https://github.com/ekg/freebayes	arXiv:1207.3907
GATK4	https://github.com/broadinstitute/gatk	10.1101/gr.107524.110 10.1038/ng.806
GEM	https://groups.csail.mit.edu/cgs/gem	10.1371/journal.pcbi.1002638
GNU Parallel	https://www.gnu.org/software/parallel/	10.5281/zenodo.1146014
GoPeaks	https://github.com/maxsonBraunLab/gopeaks	10.1186/s13059-022-02707-w
GoSemSim	http://bioconductor.org/packages/release/bioc/html/GOSemSim.html	10.1093/bioinformatics/btq064
GSEABase	https://bioconductor.org/packages/release/bioc/html/GSEABase.html	NA
HOMER	http://homer.ucsd.edu/homer	10.1016/j.molcel.2010.05.004
HTSeq	https://htseq.readthedocs.io	10.1093/bioinformatics/btu638
IDR	https://github.com/nboley/idr	10.1214/11-AOAS466
IGV	http://software.broadinstitute.org/software/igv	10.1038/nbt.1754
Intervene	https://github.com/asntech/intervene	10.1186/s12859-017-1708-7
kent/UCSC utilities	https://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads	10.1093/bioinformatics/btq351
khmer	https://khmer.readthedocs.io	10.12688/f1000research.6924.1
KMC	https://github.com/refresh-bio/KMC	10.1186/1471-2105-14-160
m6aViewer	http://dna2.leeds.ac.uk/m6a/	10.1261/rna.058206.116
Macs2	https://github.com/macs3-project/MACS	10.1186/gb-2008-9-9-r137
MethylDackel	https://github.com/dpryan79/MethylDackel	NA
metilene	https://www.bioinf.uni-leipzig.de/Software/metilene/	10.1101/gr.196394.115
moose2	http://grabherr.github.io/moose2/	10.1186/s13040-017-0150-8
PEAKachu	https://github.com/tbischler/PEAKachu	NA
Picard	http://broadinstitute.github.io/picard	NA
Platypus	https://rahmanteamdevelopment.github.io/Platypus	10.1038/ng.3036
pugz	https://github.com/Piezoid/pugz	10.1109/IPDPSW.2019.00042
rapidgzip	https://github.com/mxmlnkn/rapidgzip	10.1145/3588195.3592992
RAxML	https://cme.h-its.org/exelixis/web/software/raxml/index.html	10.1093/bioinformatics/btl446
Rcorrector	https://github.com/mourisl/Rcorrector	10.1186/s13742-015-0089-y
RSeQC	http://rseqc.sourceforge.net	10.1093/bioinformatics/bts356
REVIGO	https://code.google.com/archive/p/revigo-standalone	10.1371/journal.pone.0021800
RRVGO	https://ssayols.github.io/rrvgo	10.17912/micropub.biology.000811
Salmon	https://combine-lab.github.io/salmon/	10.1038/nmeth.4197
SalmonTE	https://github.com/hyunhwan-jeong/SalmonTE	10.1142/9789813235533_0016
SAMtools	http://www.htslib.org/doc/samtools.html	10.1093/bioinformatics/btp352
SEACR	https://github.com/FredHutch/SEACR	10.1186/s13072-019-0287-4
segemehl	http://www.bioinf.uni-leipzig.de/Software/segemehl	10.1186/gb-2014-15-2-r34 10.1371/journal.pcbi.1000502
SnpEff	https://pcingola.github.io/SnpEff	10.4161/fly.19695
SortMeRNA	https://bioinfo.lifl.fr/RNA/sortmerna	10.1093/bioinformatics/bts611
STAR	https://github.com/alexdobin/STAR	10.1093/bioinformatics/bts635
STAR-Fusion	https://github.com/STAR-Fusion/STAR-Fusion/wiki	10.1101/120295
Trimmomatic	http://www.usadellab.org/cms/?page=trimmomatic	10.1093/bioinformatics/btu170
UMI-tools	https://github.com/CGATOxford/UMI-tools	10.1101/gr.209601.116
VarDict	https://github.com/AstraZeneca-NGS/VarDict	10.1093/nar/gkw227
VarScan	http://dkoboldt.github.io/varscan	10.1101/gr.129684.111
vcflib	https://github.com/vcflib/vcflib	10.1371/journal.pcbi.1009123
Vt	https://genome.sph.umich.edu/wiki/Vt	10.1093/bioinformatics/btv112
WGCNA	https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA	10.1186/1471-2105-9-559

Supplementary information

▲ back to top

In some rare cases a glibc pthreads bug (https://sourceware.org/bugzilla/show_bug.cgi?id=23275) may cause pigz failures (internal threads error) and premature termination of tools leveraging on it e.g. Cutadapt and pigz. One can circumvent this by making use of a newer glibc version, an alternative pthreads library e.g. compiled without lock elision via LD_PRELOAD or by using the statically compiled pigz binary shipped with bashbone.

LD_PRELOAD=</path/to/no-elision/libpthread.so.0> <command>

Closing remarks

▲ back to top

Bashbone is a continuously developed library and actively used in my daily work. As a single developer it may take me a while to fix errors and issues. Feature requests cannot be handled so far, but I am happy to receive pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 827 Commits
config		config
lib		lib
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
activate.sh		activate.sh
bashbone_lite.sh		bashbone_lite.sh

License

koriege/bashbone

Folders and files

Latest commit

History

Repository files navigation

Bashbone

Outline

For developers - bash library

For users - biobash library

Covered tasks

License

Download

Bash library usage (without installation)

Do's and don'ts

Quick start

Positional arguments

Developers centerpiece

Example

Cleanup

Helper functions

OOP bash

Multithreaded implementations

Misc

Extending the library

Local cleanup

Global cleanup

Installation

Full installation of all third party tools used in bashbones biobash library

Upgrade to a newer release

Update tools

Biobash library usage (requires installation)

Enclosed scripts

Retrieve SRA datasets

Retrieve genomes

Merge/collate features

Generate pseudo replicates

Sample info file

Adapter sequences

Pipeline example

Third-party software

Supplementary information

Closing remarks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages