Pipeline for Reference based Transcriptomics.
PiReT is installed using conda. So, please make sure that conda is installed and in your path. The installation can take upto 2 hours depending on your internet speed.
Coming soon!
For installation to work, conda must be installed. See here for instructions on how to install conda. Use following commands to create conda environments and then install corresponding packages. Also make sure that there is not an environment by the name of piret_env before attempting the installation. Delete the environment if its already present. I recommend that if you are python savvy, use this instruction as you will have control on every step of the installation, and if something fails, you wont have to start from the beginning.
git clone https://github.com/mshakya/piret.git
cd piret
conda create -n piret_env python=3.6.6 --yes
conda install -c bioconda faqcs -n piret_env --yes
conda install -c bioconda star hisat2 subread -n piret_env --yes
conda install -c bioconda subread stringtie -n piret_env --yes
conda install -c bioconda samtools bamtools bedtools -n piret_env --yes
conda install -c bioconda diamond=0.9.24 -n piret_env --yes
source activate piret_env
cd thirdparty
rm -rf eggnog-mapper
git clone https://github.com/mshakya/eggnog-mapper.git
cd eggnog-mapper
python download_eggnog_data.py -y
cd ..
cd ..
Rscript --no-init-file -e "if('BiocManager' %in% rownames(installed.packages()) == FALSE){install.packages('BiocManager',repos='https://cran.r-project.org')}";
# install optparse
Rscript --no-init-file -e "if('optparse' %in% rownames(installed.packages()) == FALSE){install.packages('optparse',repos='https://cran.r-project.org')}";
# install tidyverse
Rscript --no-init-file -e "if('tidyverse' %in% rownames(installed.packages()) == FALSE){install.packages('tidyverse',repos='https://cran.r-project.org')}";
# install R reshape2 packages
Rscript --no-init-file -e "if('reshape2' %in% rownames(installed.packages()) == FALSE){install.packages('reshape2',repos='https://cran.r-project.org')}";
# install R pheatmap packages
Rscript --no-init-file -e "if('pheatmap' %in% rownames(installed.packages()) == FALSE){install.packages('pheatmap',repos='https://cran.r-project.org')}";
# install R edgeR packages
Rscript --no-init-file -e "if('edgeR' %in% rownames(installed.packages()) == FALSE){BiocManager::install('edgeR')}";
# install R deseq2 packages
Rscript --no-init-file -e "if('DESeq2' %in% rownames(installed.packages()) == FALSE){BiocManager::install('DESeq2')}";
# install R pathview package
Rscript --no-init-file -e "if('pathview' %in% rownames(installed.packages()) == FALSE){BiocManager::install('pathview')}";
# install R gage package
Rscript --no-init-file -e "if('gage' %in% rownames(installed.packages()) == FALSE){BiocManager::install('gage')}";
# install R ballgown package
Rscript --no-init-file -e "if('ballgown' %in% rownames(installed.packages()) == FALSE){BiocManager::install('ballgown')}";
python setup.py install
$ git clone https://github.com/mshakya/piret.git
$ cd piret
$ ./installer.sh <conda_env>
For example:
$ git clone https://github.com/mshakya/piret.git
$ cd piret
$ ./installer.sh piret_env
Make sure that the environment name (eg. piret_env) doesnt exist yet.
Coming soon!
We have provided test data set to check if the installation was successful or not. fastq
files can be found in tests/fastqs
and corresponding reference fasta files are found in tests/data
. To run the test, from within piret
directory:
For running tests on eukaryote datasets:
$ cd piret
$ source activate piret_env
$LUIGI_CONFIG_PATH="/panfs/biopan01/scratch-311300/ecoli_usda/ecoli.cfg" bin/piret -c ecoli.cfg -d ecoli_piret -e exp_desn.txt
$LUIGI_CONFIG_PATH="full_path_to/piret/tests/test_euk.cfg" bin/piret -c tests/test_euk.cfg -d tests/test_euk -e tests/test_euk.txt
For running tests on prokarya datasets:
$LUIGI_CONFIG_PATH="full_path_to/piret/tests/test_prok.cfg" bin/piret -c tests/test_prok.cfg -d tests/test_prok -e tests/test_prok.txt
For running tests using both
prokarya and eukarya datasets:
$LUIGI_CONFIG_PATH="full_path_to/piret/tests/test_both.cfg" bin/piret -c tests/test_prok.cfg -d tests/test_prok -e tests/test_both.txt
For getting KO ids for genes, PiReT uses emapper. The conda install of PiReT also includes emapper. However, its database need to be downloaed following instruction here. Briefly,
PiReT requires following dependencies, all of which should be installed and in the PATH.
- Python >=v3.6.3
- The pipeline is not compatible with Python v3.0 or higher.
- R >=v3.3.1
- Perl >=v5.26.2
- conda v4.2.13
If conda is not installed,
INSTALL.sh
will download and install miniconda, a "mini" version ofconda
that only installs handful of packages compared to anaconda
usage: piret [-h] -d WORKDIR -e EXPDSN -c CONFIG [-v]
piret
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
required arguments:
-d WORKDIR working directory where all output files will be
processed and written (default: None)
-e EXPDSN tab delimited experimental design file
-c CONFIG, --config CONFIG
luigi config file for setting parameters that control
each step, see github repo for an example (default:
None)
Example runs:
piret -d <workdir> -e <design file> -c <config file>
An experimental design file consist of sample name (SampleID), full path to fastq files (Files), and different groups of your samples (Group). We recommend that you use a text editor like BBedit or TextWrangler to generate the tab delimited experimental design file. Exporting a tab delimited file directly from Excel tend to cause formatting problem. If possible, please avoid any special characters in sample names and group names.
For example:
samp1, samp_1 : good name
samp 1, samp.1: not a good name and will likely cause errors.
A sample of experimental design file can be found here.
All options are set in the config file.
All the outputs will be within the working directory
. The main output file is a concatenated JSON file called out.json
.
-
samp2
: The name of this directory corresponds to sample name. Within this folder there are two sub-folders:mapping_results
This folder contains reads mapped using hisat2 in following formats. Ifsplice_sites_gff.txt
is present, hisat2 aligns based on known splice sites.*.sam
: outputs of hisat2*.bam
: generated from.sam
- mapping.log: Alignment summary file from
hisat2
. *sTie.tab
: Tab delimited file with Coverage, FPKM, TPM, for all the genes and novel transcripts. Generated using string tie.*sTie.gtf
: Primay GTF formatted output of stringtie.
trimming_results
This folder contains results of quality trimming and filtering using FaQC.*_qc_report.pdf
: A QC report file with figures.- fastqCount.txt: A text file with summary of read counts.
- *trimmed.fastq: Pair of trimmed fastq files.
- *unpaired.trimmed.fastq: fastq that did not have pairs after QC.
*.stats.txt
: Summary file with numbers of reads before and after QC.
-
ballgown
ballgown
folder. The folder is to be read byR
packageballgown
for finding significantly expressed genes. There is one folder per sample. -
*merged_transcript.gtf
: Non-redundant list of transcripts in GTF format merged from all samples. -
featureCounts
: A folder containing tables of counts fromfeatureCounts
.- CDS.count:Reads mapped to regions annotated as CDS.
- CDS.count.summary: Summary of reads mapped and unmapped to CDS.
- exon.count
- exon.count.summary
- prok_CDS.count : When used
both
option, prokaryote counts are in this file. Eukaryotes are found in file namedeuk_CDS.count
- prok_CDS.count.summary: Corresponding summary file.
-
edgeR
: A folder containing tables and figures processed mainly using R packageedgeR
to detect significantly expressed genes. Based on the options picked, the folder will have either one or two folders,prokarya
andeukarya
. Withing these folders there are following files and figures.*RPKM.csv
: A table with RPKM values for all genes across all samples.*CPM.csv
: A table with CPM values for all features across all samples*feature_count_heatmap.pdf
: Heatmap based on count data for the features listed in gff files.*feature_count_CPM_histogram.pdf
: A histogram of CPMs.*MDS.pdf
: A MDS plot based on reads mapped to samples.group1__group2__gene__et.csv
: table with gene name, logFC, logCPM, PValue, and FDR comparing group1 vs. group 2. This one contains all genes that have any counts.group1__group2__gene__sig.csv
: A subset ofgroup1__group2__gene__et.csv
with all only genes that are significant based on the specified P-value.
For removal, since all dependencies that are not in your system are installed in PiReT
, delete (rm -rf
) PiReT
folder is sufficient to uninstall the package. Before removing check if your project files are within PiReT
directory.
- Migun Shakya
If you use PiReT please cite following papers:
- samtools: Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
- bowtie2: Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357-359. [PMID: 22388286]
- bwa: Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]
- DESeq2: Love MI, Huber W and Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, pp. 550. [PMID: 25516281]
- edgeR: McCarthy, J. D, Chen, Yunshun, Smyth and K. G (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), pp. -9. [PMID: 22287627]
- HTSeq: Anders, S., Pyl, P. T., & Huber, W. (2014). HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics. [PMID: 25260700]
- hisat2: Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nature methods, 12(4), 357-360. [PMID: 25751142]
- BEDTools: Quinlan AR and Hall IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842. [PMID: 20110278]
- GAGE: Luo, Weijun, Michael S. Friedman, Kerby Shedden, Kurt D. Hankenson, and Peter J. Woolf. 2009. “GAGE: Generally Applicable Gene Set Enrichment for Pathway Analysis.” BMC Bioinformatics 10 (May): 161.
- Pathview: Luo, Weijun, and Cory Brouwer. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14). Oxford University Press: 1830–31.
- Ballgown: Frazee, Alyssa C., Geo Pertea, Andrew E. Jaffe, Ben Langmead, Steven L. Salzberg, and Jeffrey T. Leek. 2015. “Ballgown Bridges the Gap between Transcriptome Assembly and Expression Analysis.” Nature Biotechnology 33 (3): 243–46.
- featureCounts: Liao, Yang, Gordon K. Smyth, and Wei Shi. 2014. “featureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30.
- StringTie: Pertea, Mihaela, Geo M. Pertea, Corina M. Antonescu, Tsung-Cheng Chang, Joshua T. Mendell, and Steven L. Salzberg. 2015. “StringTie Enables Improved Reconstruction of a Transcriptome from RNA-Seq Reads.” Nature Biotechnology 33 (3): 290–95.
Copyright (XXXX). Triad National Security, LLC. All rights reserved.
This program was produced under U.S. Government contract 89233218CNA000001 for Los Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC for the U.S. Department of Energy/National Nuclear Security Administration.
All rights in the program are reserved by Triad National Security, LLC, and the U.S. Department of Energy/National Nuclear Security Administration. The Government is granted for itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide license in this material to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.
This is open source software; you can redistribute it and/or modify it under the terms of the GPLv3 License. If software is modified to produce derivative works, such modified software should be clearly marked, so as not to confuse it with the version available from LANL. Full text of the GPLv3 License can be found in the License file in the main development branch of the repository.