RNApipeline

A pipeline to automate RNA sequence aligning and quality control workflows via list-based batch submission and parallel processing

What is RNApipeline for?

RNApipeline is a series of scripts, or handlers, to automate and speed up DNA sequence alignment and quality control. Currently, RNApipeline is designed to work with Illumina paired- and single-end bulk RNAseq data data. The workflow is designed to process samples in batches and in parallel. It is also intended to be easy for users to configure. The handlers use a list of sequences, with full sequence paths as input. RNApipeline uses GNU Parallel to speed up analysis. The handlers are designed to run as jobs submitted to a job scheduler.

Configuration File

The included configuration file, proj.conf, provides information needed to run each of the handlers within it. No other information is needed as RNApipeline pulls all necessary information from proj.conf. Variables that are used by more than one handler are located at the top of proj.conf, followed by handler-specific variables, ending with software definitions. Please read proj.conf for more usage information.

proj.conf is broken up into several sections. The first section, at the top of Config contains variables that are used by more than one handler. Each section below is headed by a block of hash (#) marks and contains variables for one specific handler only.

For example, the section headed by

############################################
##########    Adapter_Trimming    ##########
############################################

contains variables for Adapter_Trimming only. These variables are completely ignored by other handlers.

Please note, some of the variables are pre-defined in proj.conf. These have been set for using the entirety of RNApipeline, and follows naming conventions used by all of the handlers. If you choose to not use some of the handlers in your analyses (See Do I have to use the entire workflow as is? below), please modify variables as needed.

Why use list-based batch submission?

Piping one sample alone through this workflow can take over 12 hours to run to completion. Most sequence handling jobs are not dealing with one sample, so the amount of time to run this workflow increases drastically. List-based batch submission simplifies the amount of typing that one has to do, and enables parallel processing to decrease time spent waiting for samples to finish.

An example list is shown below

/home/path_to_sample/sample_001_R1.fastq.gz /home/path_to_sample/sample_001_R2.fastq.gz /home/path_to_sample/sample_003_R1.fastq.gz /home/path_to_sample/sample_003_R2.fastq.gz

Why use parallel processing?

Parallel processing decreases the amount of time by running multiple jobs at once and keeping track of which are done, which are running, and which have yet to be run. This workflow, with the list-based batch submissions and parallel processing, both simplifies and quickens the process of sequence handling.

Do I have to use the entire workflow as is?

No; no two handlers are entirely dependent on one another. While all these handlers are designed to easily use the output from one to the next, these handlers are not required to achieve the end result of RNApipeline. If you prefer tools other than the ones used within this workflow, you can modify or replace any or all of the handlers offered in RNApipeline. This creates a pseudo-modularity for the entire workflow that allows for customization for each user.

Dependencies

Due to the pseudo-modularity of this workflow, dependencies for each individual handler are listed below. The pipeline as a whole depends on BASH and a compute cluster that uses the Grid Engine as a scheduler

Basic Usage

To run RNApipeline, use the following command, assuming you are in the RNApipeline directory:

./main.sh <handler> proj.conf

Where <handler> is one of the handlers listed below and proj.conf is the full file path to the configuration file. A brief usage message can be viewed by passing no arguments to RNApipeline:

./main.sh

Handlers

Quality_Assessment

The Quality_Assessment handler runs FastQC on a series of samples organized in a project directory for quality control. The Quality_Assessment handler depends on:

FastQC
GNU Parallel

Sequence_Trimming

The Sequence_Trimming handler runs Trimmomatic on a series of samples, removing adapters and trimming based on quality. This handler supports both paired-end and single-ended samples. A list of all trimmed samples will be output at the end of all runs. The Sequence_Trimming handler depends on:

Java
Trimmomatic

Read_Mapping

The Read_Mapping handler maps reads to a reference genome using STAR. This handler supports both paired-end and single-ended samples. A list of all mapped samples will be output at the end of all runs. The Read_Mapping handler depends on:

STAR

SAM_Processing

The SAM_Processing handler converts the SAM files from read mapping to the BAM format using SAMTools and Picard. In the conversion process, it will sort the reads and mark duplicates for the finished BAM file. The SAM_Processing handler depends on:

SAMTools
Java
Picard

Quantify_Summarize

The Quanitfy_Summarize handler depends on:

featureCounts
SAMTools
Bedtools
bc
GNU Parallel
GNU datamash
Rscript
dstat

Acknowledgements

RNApipeline was inspired by the sequence_handling pipeline written by the Morrell Lab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RNApipeline

A pipeline to automate RNA sequence aligning and quality control workflows via list-based batch submission and parallel processing

What is RNApipeline for?

Configuration File

Why use list-based batch submission?

Why use parallel processing?

Do I have to use the entire workflow as is?

Dependencies

Basic Usage

Handlers

Quality_Assessment

Sequence_Trimming

Read_Mapping

SAM_Processing

Quantify_Summarize

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

RNApipeline

A pipeline to automate RNA sequence aligning and quality control workflows via list-based batch submission and parallel processing

What is RNApipeline for?

Configuration File

Why use list-based batch submission?

Why use parallel processing?

Do I have to use the entire workflow as is?

Dependencies

Basic Usage

Handlers

Quality_Assessment

Sequence_Trimming

Read_Mapping

SAM_Processing

Quantify_Summarize

Acknowledgements