This pipeline is an R
, phyloseq
oriented pipeline to analyse raw metabardocing data using dada2
and based on the following tutorials: https://f1000research.com/articles/5-1492, https://benjjneb.github.io/dada2/tutorial.html.
Cite this package in addition to the developers of dada2
and the other packages you used in the pipeline i.e., DECIPHER
:
- Installation:
- Process your data with the
metabaRpipe
pipeline: - Exploring the outputs:
- Raw data organisation:
- Define your own presets:
- ETH FBT users:
- muse users:
- Getting some help:
- Additional functionalities:
- 1. Compute an ASV-phylogenetic tree in a
phyloseq
object: - 2. Adding/replacing taxonomical table in a
phyloseq
object: - 3. Adding/updating the metadata information in a
phyloseq
object: - 4. Post-clustering curation using
lulu:
- 5. Cluster ASV using
DECIPHER
: - 6.
picrust2
functional potential estimation: - 7. Export
qiime2
compatible files: - 8. Combine
phyloseq
objects:
- 1. Compute an ASV-phylogenetic tree in a
- To do:
-
Install conda e.g., for MAC users:
bash Miniconda3-latest-MacOSX-x86_64.sh
-
Create a dedicated conda environment:
conda create -n metabaRpipe -y
- Activate the conda environment:
conda activate metabaRpipe
- Install
R
andatropos
anddevtools
R package:
conda install -c conda-forge r-base=4.1 -y
conda install -c bioconda atropos=1.1.25 -y
conda install -c conda-forge r-devtools -y
- Confirm R was correctly installed within the conda environment:
which R
which should result in something like this, indicating you will use R installed within your conda environment.
/Users/xxx/miniconda3/envs/metabaRpipe/bin/R
- Start
R
from the terminal and install the requiredR packages
:
R
install.packages("optparse", repos = "https://cloud.r-project.org")
devtools::install_github('tidyverse/tidyverse')
devtools::install_github("benjjneb/dada2")
devtools::install_github("KlausVigo/phangorn")
install.packages("BiocManager", repos = "https://cloud.r-project.org")
BiocManager::install("ShortRead", update = FALSE)
BiocManager::install("DECIPHER", update = FALSE)
BiocManager::install("phyloseq", update = FALSE)
quit(save = "no")
- Clone the
metabaRpipe
repository
This step is required to get the Rscripts
locally.
Change the directory where you would like to clone the repository.
MY_DIR=/path_to_mydir/whereIwant/metabaRpipe/to/be/installed/
cd ${MY_DIR}
Use git clone
to clone on your computer the repository including the functions and test data.
git clone https://github.com/fconstancias/metabaRpipe.git
Everything is now ready to analyse your raw data. We can use Rscripts
from the terminal -or an HPLC cluster- enabling you to run the pipeline and generate a phyloseq
object directly from the raw sequencing data and using one single command.
- First, activate the dedicated conda environment:
conda activate metabaRpipe
- Use
Rscript
to run the pipeline and specify some parameters e.g.: databases
For instance, using the test-data
available from the repository:
Rscript \
${MY_DIR}/metabaRpipe/Rscripts/dada2_metabarcoding_pipeline.Rscript \
-i ${MY_DIR}/metabaRpipe/test-data/ \
--preset V3V4 \
--db ${MY_DIR}//metabaRpipe/databases/databases/GTDB_bac120_arc122_ssu_r202_Genus.fa.gz \
--db_species ${MY_DIR}/metabaRpipe/databases//databases/GTDB_bac120_arc122_ssu_r202_Species.fa.gz \
--metadata ${MY_DIR}/metabaRpipe/metadata.xlsx \
--save_out test_pipe_Rscript.RDS > mylogs.txt 2>&1
The > mylogs.txt 2>&1
trick will redirect what is printed on the screen to a file including potential errors and also parameters that you used.
- If you encounter a
Permission denied
error:
.../metabaRpipe/Rscripts/dada2_metabarcoding_pipeline.Rscript: Permission denied
Mark the file as executable using chmod
.
chmod +x ${MY_DIR}/metabaRpipe/Rscripts/dada2_metabarcoding_pipeline.Rscript
The scripts can also be run within R
using the R functions stored under ${MY_DIR}metabaRpipe/Rscripts/functions.R
- examples will come later.
By default the script generates several outputs under a dada2
directory. Below the subfolders and outputs generated by the pipeline:
dada2/
├── 00_atropos_primer_removed
│ ├── 180914_M00842_0310_000000000-C3GBT
│ │ ├── R1F1-S66_primersout_R1_.fastq.gz
│ │ ├── R1F1-S66_primersout_R2_.fastq.gz
│ │ ├── R1F2-S300_primersout_R1_.fastq.gz
│ │ ├── R1F2-S300_primersout_R2_.fastq.gz
│ │ ├── R1F3-S90_primersout_R1_.fastq.gz
│ │ └── R1F3-S90_primersout_R2_.fastq.gz
│ └── 190719_M00842_0364_000000000-CKGHM
│ ├── Y2A15-2M-06-S78_primersout_R1_.fastq.gz
│ ├── Y2A15-2M-06-S78_primersout_R2_.fastq.gz
│ ├── Y2A15-2M-12-S77_primersout_R1_.fastq.gz
│ ├── Y2A15-2M-12-S77_primersout_R2_.fastq.gz
│ ├── Y3-R1F4-S136_primersout_R1_.fastq.gz
│ └── Y3-R1F4-S136_primersout_R2_.fastq.gz
├── 01_dada2_quality_profiles
│ ├── 180914_M00842_0310_000000000-C3GBT
│ │ ├── 180914_M00842_0310_000000000-C3GBT_forward.pdf
│ │ └── 180914_M00842_0310_000000000-C3GBT_reverse.pdf
│ └── 190719_M00842_0364_000000000-CKGHM
│ ├── 190719_M00842_0364_000000000-CKGHM_forward.pdf
│ └── 190719_M00842_0364_000000000-CKGHM_reverse.pdf
├── 02_dada2_filtered_denoised_merged
│ ├── 180914_M00842_0310_000000000-C3GBT
│ │ ├── 180914_M00842_0310_000000000-C3GBT.RData
│ │ ├── 180914_M00842_0310_000000000-C3GBT_seqtab.rds
│ │ ├── 180914_M00842_0310_000000000-C3GBT_track_analysis.tsv
│ │ ├── errors_180914_M00842_0310_000000000-C3GBT_fwd.pdf
│ │ ├── errors_180914_M00842_0310_000000000-C3GBT_rev.pdf
│ │ └── seq_distrib_180914_M00842_0310_000000000-C3GBT.pdf
│ └── 190719_M00842_0364_000000000-CKGHM
│ ├── 190719_M00842_0364_000000000-CKGHM.RData
│ ├── 190719_M00842_0364_000000000-CKGHM_seqtab.rds
│ ├── 190719_M00842_0364_000000000-CKGHM_track_analysis.tsv
│ ├── errors_190719_M00842_0364_000000000-CKGHM_fwd.pdf
│ ├── errors_190719_M00842_0364_000000000-CKGHM_rev.pdf
│ └── seq_distrib_190719_M00842_0364_000000000-CKGHM.pdf
├── 03_dada2_merged_runs_chimera_removed
│ ├── no-chim-seqtab.fasta
│ ├── no-chim-seqtab.rds
│ ├── physeq.rds
│ ├── seqtab_distrib.pdf
│ └── track_analysis.tsv
├── 04_dada2_taxonomy
│ ├── silva_nr99_v138_train_set_assignation.rds
│ └── silva_nr99_v138_train_set_table.tsv
└── phyloseq.RDS
There are several outputs generated during the steps performed:
- Reads after PCR primer removal
00_atropos_primer_removed
- removed by default. - Fwd and Rev reads quality profiles:
01_dada2_quality_profiles
- Reads quality filtering, error learning, ASV inference and Fwd and Rev reads merging:
02_dada2_filtered_denoised_merged
- Merging data from different sequencing runs and the removal of chimeras and ASV length filtering:
03_dada2_merged_runs_chimera_removed
- Taxonomy assignments:
04_dada2_taxonomy
- A
phyloseq
object combining all the data.
library(phyloseq)
readRDS("dada2/phyloseq.RDS")
phyloseq-class experiment-level object
otu_table() OTU Table: [ 322 taxa and 6 samples ]
sample_data() Sample Data: [ 6 samples by 18 sample variables ]
tax_table() Taxonomy Table: [ 322 taxa by 7 taxonomic ranks ]
refseq() DNAStringSet: [ 322 reference sequences ]
This object contains the Amplicon Sequence Variants (ASV) sequences refseq()
, the ASV/sample count table otu_table()
, the taxonomic path of the ASV tax_table()
and the metadata sample_data()
. This enable an easy handling of all those facet of the metabarcoding dataset.
Let's have a look. We first load and store the object in R
.
library(phyloseq)
readRDS("dada2/phyloseq.RDS") -> ps
refseq()
ps %>% refseq()
DNAStringSet object of length 322:
width seq names
[1] 429 TGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCA...GACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG ASV001
[2] 406 TGGGGAATTTTCCGCAATGGGCGAAAGCCTGACGGA...GACACTGAGAGACGAAAGCTAGGGGAGCGAATGGG ASV002
[3] 429 TGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCA...GACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG ASV003
[4] 429 TGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCA...GACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG ASV004
[5] 429 TGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCA...GACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG ASV005
... ... ...
[318] 409 TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCA...GACGCTGAGAAGCGAAAGCATGGGGAGCGAACAGG ASV318
[319] 405 TTGGGAATCTTGCACAATGGGGGAAACCCTGATGCA...GACGCTGAGGCGCGAAAGCGTGGGTAGCAAACAGG ASV319
[320] 428 TAGGGAATCTTGCGCAATGGGCGAAAGCCTGACGCA...GACGCTGAGACGCGAAAGCGTGGGGAGCGAACAGG ASV320
[321] 429 TAGGGAATCTTCCGCAATGGACGCAAGTCTGACGGA...GACGCTGAGGTGCGAAAGCGTGGGGATCAAACAGG ASV321
[322] 423 TAGGGAATATTGGGCAATGGAGGAAACTCTGACCCA...GACGCTGAGGCACGAAAGTGTGGGGATCAAACAGG ASV322
tax_table()
ps %>% tax_table()
Taxonomy Table: [ 322 taxa by 7 taxonomic ranks ]:
Kingdom Phylum Class Order Family Genus Species
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
ASV001 Bacteria Proteobacteria Gammaproteobacteria Burkholderiales Burkholderiaceae Ralstonia ""
ASV002 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Rhizobiaceae Aureimon… ""
ASV003 Bacteria Proteobacteria Gammaproteobacteria Burkholderiales Burkholderiaceae Massilia ""
ASV004 Bacteria Proteobacteria Gammaproteobacteria Burkholderiales Burkholderiaceae Paraburk… "unknow…
ASV005 Bacteria Proteobacteria Gammaproteobacteria Burkholderiales Burkholderiaceae Massilia "sp0128…
ASV006 Bacteria Proteobacteria Gammaproteobacteria Burkholderiales Burkholderiaceae Massilia ""
ASV007 Bacteria Actinobacteriota Actinomycetia Mycobacteriales Frankiaceae Frankia ""
ASV008 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomo… "unknow…
ASV009 Bacteria Bacteroidota Bacteroidia Cytophagales Hymenobacteraceae Hymenoba… ""
ASV010 Bacteria Bacteroidota Bacteroidia Flavobacteriales Blattabacteriaceae Walczuch… ""
# … with 312 more taxa
otu_table()
ps %>% otu_table()
OTU Table: [ 322 taxa and 6 samples ]:
Taxa are rows
`R1F1-S66` `R1F2-S300` `R1F3-S90` `Y2A15-2M-06-S78` `Y2A15-2M-12-S77` `Y3-R1F4-S136`
ASV001 2185 2214 54 0 0 0
ASV002 10 42 21 2765 65 0
ASV003 0 0 1253 0 213 0
ASV004 728 732 0 0 0 0
ASV005 0 0 172 0 1235 0
ASV006 0 0 1144 0 221 0
ASV007 0 25 18 579 16 0
ASV008 0 0 302 0 198 0
ASV009 0 0 358 0 94 0
ASV010 0 0 0 420 0 0
# … with 312 more taxa (rows)
sample_data()
ps %>% sample_data()
Sample Data: [ 6 samples by 18 sample variables ]:
input filtered denoisedF denoisedR merged tabled filtered_pc denoisedF_pc denoisedR_pc merged_pc
R1F1-S66 5614 5370 4797 4793 4099 4099 0.957 0.893 0.893 0.854
R1F2-S300 4678 4475 4233 4140 3671 3671 0.957 0.946 0.925 0.867
R1F3-S90 8123 7754 7331 7420 6624 6624 0.955 0.945 0.957 0.904
Y2A15-2M-… 7002 6593 6383 6440 6088 6088 0.942 0.968 0.977 0.954
Y2A15-2M-… 6071 5853 5730 5753 5391 5391 0.964 0.979 0.983 0.941
Y3-R1F4-S… 26 20 13 9 8 8 0.769 0.65 0.45 0.615
# … with 8 more variables: filtered_merged_pc <dbl>, input_merged_pc <dbl>, tabled_joined <dbl>,
# chimera_out <dbl>, length_filtered <dbl>, tabled_pc <dbl>, chimera_out_pc <dbl>,
# length_filtered_pc <dbl>
- A phylogenetic tree
phy_tree()
of the ASV sequences can also be stored in thephyloseq
object:
readRDS("dada2/physeq_phylo/phyloseq_phylo.RDS") %>% phy_tree()
Phylogenetic tree with 322 tips and 321 internal nodes.
Tip labels:
ASV001, ASV002, ASV003, ASV004, ASV005, ASV006, ...
Rooted; includes branch lengths.
More informations regarding phyloseq
object can be found here.
If --save_out test_pipe_Rscript.RDS
is specified, all the outputs are also saved within this R
object.
readRDS("test_pipe_Rscript.RDS") -> out
ls(out)
[1] "filtering_denoising" "merging" "physeq"
[4] "qplot" "taxo"
For more details, check the dada2 original tutorial.
Organisation of the raw sequencing data is crucial:
Fwd and Rev reads (R1 and R2, respectively) are placed in run specific directory - error learning and ASV inference has to be perform on a run basis. If you are only analysing one sequencing run, simply add only one subdirectory.
test-data/
├── 180914_M00842_0310_000000000-C3GBT
│ ├── R1F1-S66_L001_R1_001.fastq.gz
│ ├── R1F1-S66_L001_R2_001.fastq.gz
│ ├── R1F2-S300_L001_R1_001.fastq.gz
│ ├── R1F2-S300_L001_R2_001.fastq.gz
│ ├── R1F3-S90_L001_R1_001.fastq.gz
│ └── R1F3-S90_L001_R2_001.fastq.gz
├── 190719_M00842_0364_000000000-CKGHM
│ ├── Y2A15-2M-06-S78_L001_R1_001.fastq.gz
│ ├── Y2A15-2M-06-S78_L001_R2_001.fastq.gz
│ ├── Y2A15-2M-12-S77_L001_R1_001.fastq.gz
│ ├── Y2A15-2M-12-S77_L001_R2_001.fastq.gz
│ ├── Y3-R1F4-S136_L001_R1_001.fastq.gz
│ └── Y3-R1F4-S136_L001_R2_001.fastq.gz
└── metadata.xlsx
By default, sample names are retrieved from files names after the first _
, Fwd and Rev fastq files should be recognised by *_R1_*
and *_R2_*
, respectively and a sample_name
column in the metadata.xlsx
file links the sample names retrieved from the files together with the metadata.
The --preset
option is an important parameter here since it will be used to remove primers used for PCR amplification - as recommended by dada2 authors - and also adjust the length filtering and trimming parameters. For instance, V3V4
is related to the following primers, dada2 options.
Depending on the library preparation protocol you could also skip the PCR primer removal step using --rm_primers FALSE
open ${MY_DIR}/metabaRpipe/Rscripts/functions.R
...
if(V == "V3V4"){
PRIMER_F = "CCTAYGGGRBGCASCAG"
PRIMER_R = "GGACTACNNGGGTATCTAAT"
trim_length = c(240,600)
trunclen = c(260,250)
maxee = c(4,5)
minLen = 160
minover = 10
}
...
The sequences CCTAYGGGRBGCASCAG
and GGACTACNNGGGTATCTAAT
will be searched by atropos
as Fwd and Rev primers, respectively and removed. Only reads containing the expected primers will be kept.
Fwd and Rev reads will by truncated after 260
and 250
nucleotide positions, reads shorter then 160
nucleotides will be removed as well as the Fwd with a maximum expected error more then 4
and Rev of 5
. 10
nucleotides will be used to merged denoised Fwd and Rev reads and only ASV >240 length <600
will be kept.
For more details, check the dada2 original tutorial.
You can modify the ${MY_DIR}/metabaRpipe/Rscripts/functions.R
script by adding any preset
of your choice adding another if statement and using the same variable names. Then --preset mypreset
can be used to call the parameters you defined in ${MY_DIR}/metabaRpipe/Rscripts/functions.R
when running the script Rscript ${MY_DIR}/metabaRpipe/Rscripts/dada2_metabarcoding_pipeline.Rscript
. This allows automatisation of the process and the possibility to run the pipeline from your HPC cluster.
You can use the following commands to process the data targeting 16S V4 region using the PCR primers and parameters we use in the FBT group from the C18 bioinformatic workstation. Everything is installed and configured, you can follow the steps below to analyse your data.
-
Get familiar with the pipeline and important information
p/Documentation/ILLUMINA_Sequencing/16S-bioinformatic-pipeline/FBT_bioinfo_pipeline.pdf
. -
Book the C18 bioinformatic workstation using the calendar.
-
The information to access the C18 bioinformatic workstation can be found here:
p/Documentation/ILLUMINA_Sequencing/16S-bioinformatic-pipeline/FBT_bioinfo_pipeline.pdf
onslide 16
. -
Create a directory where you are going to place the raw sequencing data and generate the outputs under
/Users/fbt-group/Documents/WORKSHOP
. You could do that using Mac Finder or from the terminal.. -
Organise the raw sequencing files in Miseq-run-specific sub-directories as explained in the Raw data structure section above. It should look like
/Users/fbt-group/Documents/WORKSHOP/My_dir/My_analysis/raw/my_run_1/
and if you are analysing more than 1 run you would other sub-directories, e.g.,/Users/fbt-group/Documents/WORKSHOP/My_dir/My_analysis/raw/my_run_2/
-
Using the terminal, navigate to the directory you have created
cd
i.e., change directory command.
cd /Users/fbt-group/Documents/WORKSHOP/My_dir/My_analysis/
- Activate the conda environment.
conda activate metabaRpipe
- Run the pipeline with the default V4 FBT parameters - press enter to start. If you generated the library using the 2 PCR approach used in the lab use
--preset V4-2PCR
, check the next paragraph if you are analysing dataset generated using the 1 PCR approach:
Rscript /Users/fbt-group/github/metabaRpipe/Rscripts/dada2_metabarcoding_pipeline.Rscript \
-i raw/ -T 4 \
--db /Users/fbt-group/github/metabaRpipe/databases/silva_nr99_v138.1_train_set.fa.gz \
--db_species /Users/fbt-group/github/metabaRpipe/databases/silva_species_assignment_v138.1.fa.gz \
--metadata mapping_file.xlsx --preset V4-2PCR --atropos_binary cutadapt > run_pipe_logs.txt 2>&1
- If you are analysing dataset generated using the 1 PCR approach pleaase use the
--preset V4-1PCR
Rscript /Users/fbt-group/github/metabaRpipe/Rscripts/dada2_metabarcoding_pipeline.Rscript \
-i raw/ -T 4 --preset V4-1PCR \
--db /Users/fbt-group/github/metabaRpipe/databases/silva_nr99_v138.1_train_set.fa.gz \
--db_species /Users/fbt-group/github/metabaRpipe/databases/silva_species_assignment_v138.1.fa.gz \
--metadata mapping_file.xlsx > run_pipe_logs.txt 2>&1
- Two other presets are avialable to potentially resolve this issue
--preset V4-1PCR-up
or--preset V4-2PCR-up
if you generated your libraries using the 1-PCR or 2-PCR approaches respectively.
- Add a phylogenetic tree of the ASV directly to the R phyloseq object:
Rscript /Users/fbt-group/github/metabaRpipe/Rscripts/run_add_phylogeny_to_phyloseq.Rscript \
-p dada2/phyloseq.RDS \
-o dada2/phyloseq_phylo > add_phylo_logs.txt 2>&1
More details here.
- Export qiime2 compatible files:
Rscript /Users/fbt-group/github/metabaRpipe/Rscripts/phyloseq_export_qiime.Rscript \
-i dada2/phyloseq_phylo/phyloseq_phylo.RDS \
-o dada2/qiime2
More details here.
You now have everything ready for analysis using your preferred platform !
Thanks to the flexibility of conda
, metabaRpipe can be installed and run from any cluster computing environment. First, you would have to follow the installation instructions.
Below and example of the script you could use to run your data under the muse SLURM cluster. You would need to adapt the job scheduler related informations based on your credentials as well as the job scheduler. More example regarding the SLURM parameters can be found here.
#!/bin/sh
#SBATCH --job-name=**metabaRpipe**
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH -N 1 # Nombre de nœuds
#SBATCH -n 6 # Nombre de tâches au total
#SBATCH --mem 10 # Mémoire réservée par nœud
#SBATCH --mail-user **[email protected]**
#SBATCH --ntasks-per-node=6
#SBATCH --ntasks-per-core=1
#SBATCH --partition=**my_queue**
module load python/Anaconda/3-5.1.0
eval "$(conda shell.bash hook)" #https://github.com/conda/conda/issues/7980
conda activate metabaRpipe
MY_DIR=/home/user/scratch/my_directory/
export OMP_NUM_THREADS=10
# JOB STARTS
Rscript \
${MY_DIR}/metabaRpipe/Rscripts/dada2_metabarcoding_pipeline.Rscript \
-i ${MY_DIR}/metabaRpipe/test-data/ \
--preset V3V4 \
--db ${MY_DIR}//metabaRpipe/databases/databases/GTDB_bac120_arc122_ssu_r202_Genus.fa.gz \
--db_species ${MY_DIR}/metabaRpipe/databases//databases/GTDB_bac120_arc122_ssu_r202_Species.fa.gz \
--metadata ${MY_DIR}/metabaRpipe/metadata.xlsx \
--save_out test_pipe_Rscript.RDS > mylogs.txt 2>&1
# JOB END
exit 0
In order to run that script you would create a bash file of those command and submit the job using qsub.
vim metabaRpipe.sbatch
# copy paste the commands above - after adjusting the paths and parameters [especially the **bold** parameters].
sbatch metabaRpipe.sbatch
Then you can follow squeue -j jobid
or squeue -u userid
- Options from a specific
Rscript
can be access using--help
argument.
${MY_DIR}/metabaRpipe/Rscripts/dada2_metabarcoding_pipeline.Rscript --help
Usage: /Users/test/Documents/GitHub/metabaRpipe/Rscripts/dada2_metabarcoding_pipeline.Rscript [options]
Options:
-i CHARACTER, --input_directory=CHARACTER
Path of the input directory containing raw _R1_ and _R2_ raw reads in their respective run sub-directories
e.g., -i raw [contains raw/run1 and raw/run2]
N.B.: sample name is extracted from .fastq.gz samples before the first '_' e.g., XXX-XX
sample names must be unambigious and unique e.g., sample-1 = sample-11, sample-01 != sample-10
--raw_file_pattern=CHARACTER
Patterns for R1 and R2 for raw files
-a CHARACTER, --atropos_binary=CHARACTER
Path of atropos program [used for primer removal]
--cut_dir=CHARACTER
Directory for atropos PCR primer removed reads
--cut_pattern=CHARACTER
Patterns for R1 and R2 of atropos files
--rm_primers=CHARACTER
Perform atropos PCR primer check & removal ?
--filt_dir=CHARACTER
Directory for filtered reads
--merged_run_dir=CHARACTER
Directory for run merging, chimera removal
--taxa_dir=CHARACTER
Directory for taxonomy step
--sep=CHARACTER
regex pattern to identify sample names [default: after first _]
--preset=CHARACTER
Will use default primers and parameters as defined in the run_dada2_pipeline() [primers, trunc, maxee, overlap, expected length, ...]
--PRIMER_F=CHARACTER
Sequence of the gene specific Fwd primer to be removed with atropos [if using -V V4 or V3V4, this parameter is already set]
--PRIMER_R=CHARACTER
Sequence of the gene specific Rev primer to be removed with atropos [if using -V V4 or V3V4, this parameter is already set]
--tax_threshold=CHARACTER
Threshold for taxonomic assignments [the minimum bootstrap confidence for assigning a taxonomic level.
--nbases=CHARACTER
Number of bases for error learning step
--pool=CHARACTER
Pooling strategy
--minover=CHARACTER
Minimum overlap for merginf R1 and R2 reads [if using -V V4 or V3V4, this parameter is already set]
--trunclen=CHARACTER
Nucleotide position to truncate the Fwd and Rev reads at [if using -V V4 or V3V4, this parameter is already set]
--trim_length=CHARACTER
ASV of length outside the range will be discarded [i.e., insilco size exclusion of ASV - if using -V V3 or V3V4, this parameter is already set]
--maxee=CHARACTER
Maximum expected error for Fwd and Rev reads [if using -V V4 or V3V4, this parameter is already set]
--minLen=CHARACTER
Minimul read length [if using -V V3 or V3V4, this parameter is already set]
--chimera_method=CHARACTER
Chimera removal strategy
--collapseNoMis=CHARACTER
Perform 100% similarity ASV clustering?
--tryRC=CHARACTER
Perform taxonomic assignments also on reversed complemented ASV sequence?
--metadata=CHARACTER
Path to excel document containing metadata [Sample identifier column should be sample_name]
--db=CHARACTER
Path to the taxonomic database
--db_species=CHARACTER
Path to the speies-level taxonomic database [only for --tax_metod dada]
--run_phylo=CHARACTER
Compute phylogenetic tree from the ASV sequence ?
--merge_phyloseq_export=CHARACTER
Path fof the phyloseq object
--output_phyloseq_phylo=CHARACTER
Path fof the phyloseq object
--save_out=CHARACTER
Path fof the R object containing all output from the pipeline
--export=CHARACTER
Export pdf, table and intermediate RDS files?
--remove_input_fastq=CHARACTER
Remove intermediate fastq.gz files
-T NUMERIC, --slots=NUMERIC
Number of threads to perform the analyses
--seed_value=NUMERIC
Seed value for random number generator
-f CHARACTER, --fun_dir=CHARACTER
Directory containing the R functions
-h, --help
Show this help message and exit
Based on the approach described here. This step can be intensive depending on your setup and the overall richness of your phyloseq object. This is therefore not computed by default running the pipeline.
conda activate metabaRpipe
Rscript \
${MY_DIR}/metabaRpipe/Rscripts/run_add_phylogeny_to_phyloseq.Rscript \
-p dada2/phyloseq.RDS \
-o dada2/phyloseq_phylo
N.B.: The phyloseq object should include the ASV sequences stored as refseq()
:
readRDS("dada2/phyloseq.RDS")
phyloseq-class experiment-level object
otu_table() OTU Table: [ 322 taxa and 6 samples ]:
sample_data() Sample Data: [ 6 samples by 18 sample variables ]:
tax_table() Taxonomy Table: [ 322 taxa by 7 taxonomic ranks ]:
refseq() DNAStringSet : [ 322 reference sequences ]
taxa are rows
Running the script above generated a new phyloseq
object which now includes a phylogenetic tree:
readRDS("dada2/physeq_phylo/phyloseq_phylo.RDS") %>% phy_tree()
Phylogenetic tree with 322 tips and 321 internal nodes.
Tip labels:
ASV001, ASV002, ASV003, ASV004, ASV005, ASV006, ...
Rooted; includes branch lengths.
N.B.: The phyloseq object should include the ASV sequences stored as refseq()
.
- The
run_phyloseq_dada2_tax.Rscript
Rscript allows to update the taxonomy usingdada2::assignTaxonomy
anddada2::assignSpecies
from the terminal.
The dada2 formatted
databases can be downloaded here. Some are also included in the metabaRpipe
repository:
ls ${MY_DIR}/metabaRpipe/databases/
GTDB_bac120_arc122_ssu_r202_Genus.fa.gz eHOMD_RefSeq_dada2_V15.22.fasta.gz
GTDB_bac120_arc122_ssu_r202_Species.fa.gz eHOMD_RefSeq_dada2_assign_species_V15.22.fasta.gz
Below the options we can specify to the script:
Rscript ${MY_DIR}/metabaRpipe/Rscripts/run_phyloseq_dada2_tax.Rscript --help
Loading required package: optparse
Usage: /Users/test//metabaRpipe/Rscripts/run_phyloseq_dada2_tax.Rscript [options]
Options:
-p CHARACTER, --phyloseq_path=CHARACTER
Path of the input phyloseq object
-o CHARACTER, --output=CHARACTER
Name of the output directory
--tax_threshold=NUMERIC
Minimum bootstrap value / threshold
--db=CHARACTER
path of database
--db_species=CHARACTER
path of species level database
--reverse_comp=CHARACTER
Reverse complement sequences for taxonomic assignments?
-T CHARACTER, --slots=CHARACTER
Number of threads to perform the analyses
-f CHARACTER, --fun_dir=CHARACTER
Directory containing the R functions
--seed=CHARACTER
seed number for random generator
-h, --help
Show this help message and exit
To illustrate this, we can update the taxonomic classification of the ASV from the phyloseq object we generated using the Human Oral Microbiome Database and stored in the metabaRpipe repository we cloned.
conda activate metabaRpipe
Rscript \
${MY_DIR}/metabaRpipe/Rscripts/run_phyloseq_dada2_tax.Rscript \
--phyloseq_path dada2/phyloseq.RDS \
--tax_threshold 60 \
--output 04_dada2_taxonomy \
--db ${MY_DIR}/metabaRpipe/databases/eHOMD_RefSeq_dada2_V15.22.fasta.gz \
--db_species ${MY_DIR}/metabaRpipe/databases/eHOMD_RefSeq_dada2_assign_species_V15.22.fasta.gz \
--reverse_comp TRUE \
-T 4
We could also run the function directly from R/ Rstudio:
Below the original tax_table()
require(tidyverse); require(phyloseq)
readRDS("dada2/phyloseq.RDS") %>% tax_table() %>% head()
Taxonomy Table: [ 6 taxa by 7 taxonomic ranks ]:
Kingdom Phylum Class Order Family Genus Species
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
ASV001 Bacteria Proteobacteria Gammapr… Burkho… Burkhol… Ralsto… ""
ASV002 Bacteria Proteobacteria Alphapr… Rhizob… Rhizobi… Aureim… ""
ASV003 Bacteria Proteobacteria Gammapr… Burkho… Burkhol… Massil… ""
ASV004 Bacteria Proteobacteria Gammapr… Burkho… Burkhol… Parabu… "unknow…
ASV005 Bacteria Proteobacteria Gammapr… Burkho… Burkhol… Massil… "sp0128…
ASV006 Bacteria Proteobacteria Gammapr… Burkho… Burkhol… Massil… ""
source("https://raw.githubusercontent.com/fconstancias/metabaRpipe-source/master/Rscripts/functions.R")
readRDS("dada2/phyloseq.RDS") %>%
phyloseq_dada2_tax(physeq = .,
threshold = 60,
db ="~/metabaRpipe/databases/eHOMD_RefSeq_dada2_V15.22.fasta.gz", #check the path
db_species ="~/metabaRpipe/databases/eHOMD_RefSeq_dada2_assign_species_V15.22.fasta.gz", #check the path
nthreads = 2,
tryRC = TRUE,
return = TRUE,
full_return = FALSE) -> physeq_new_tax
The eHOMD_RefSeq_dada2_V15.22
updated tax_table()
physeq_new_tax %>% tax_table() %>% head()
Taxonomy Table: [ 6 taxa by 7 taxonomic ranks ]:
Kingdom Phylum Class Order Family Genus Species
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
ASV0001 Bacteria Actinobacteria Actinob… Bifidob… Bifidob… Bifid… unknown
ASV0002 Bacteria Proteobacteria Gammapr… Enterob… Enterob… Esche… unknown
ASV0003 Bacteria Bacteroidetes Bactero… Bactero… Prevote… Prevo… unknown
ASV0004 Bacteria Bacteroidetes Bactero… Bactero… Prevote… Prevo… unknown
ASV0005 Bacteria Bacteroidetes Bactero… Bactero… Bactero… Bacte… unknown
ASV0006 Bacteria Firmicutes Bacilli Lactoba… Strepto… Strep… unknown
dada2
's authors recommendations on the minBoot
* i.e.*, threshold
parameter:
An important parameter to consider when running assignTaxonomy(...) is minBoot, which sets the minimum bootstrapping support required to return a taxonomic classification. The original paper recommended a threshold of 50 for sequences of 250nts or less (as these are) but a threshold of 80 generally. More details
- You can also perform taxonomic assignment of the ASV sequences using DECIPHER IDtaxa function.
The DECIPHER
formatted databases (i.e., Training sets for organismal classification (nucleotides) can be downloaded here.
As we have seen before, this can be performed from the terminal:
conda activate metabaRpipe
Rscript \
${MY_DIR}/run_phyloseq_DECIPHER_tax.Rscript \
--phyloseq_path dada2/phyloseq.RDS \
--export dada2/04_dada2_taxonomy \
--reverse_comp TRUE \
--db ~/db/DADA2/SILVA_SSU_r132_March2018.RData \ # check where you downloaded your database
--tax_threshold 60 \
-T 4
or from R/Rstudio.
require(tidyverse)
require(phyloseq)
source("https://raw.githubusercontent.com/fconstancias/metabaRpipe-source/master/Rscripts/functions.R")
readRDS("dada2/physeq.RDS") %>%
phyloseq_DECIPHER_tax(physeq = .,
threshold = 60,
db="~/db/DADA2/SILVA_SSU_r132_March2018.RData" # check where you downloaded your database
tryRC = TRUE,
return = TRUE) -> physeq_new_tax
DECIPHER
's authors recommendations on the confidence level
* i.e.*, threshold
parameter.
Select a minimum confidence threshold for classifications. We recommend using a confidence of 60% (very high) or 50% (high). More details
This can be done within R/Rstudio:
require(tidyverse); require(phyloseq)
source("https://raw.githubusercontent.com/fconstancias/metabaRpipe-source/master/Rscripts/functions.R")
readRDS("dada2/phyloseq.RDS") %>%
physeq_add_metadata(physeq = .,
metadata = "~/metabaRpipe/test-data/metadata.xlsx" %>%
readxl::read_xlsx()) -> ps_tax_phylo_meta
ps_tax_phylo_meta
Install the following tools/ packages:
R packages
- similarly as done before:
conda activate metabaRpipe
R
devtools::install_github("tobiasgf/lulu")
devtools::install_github("mikemc/speedyseq")
vsearch
in the dedicated conda environment:
conda activate metabaRpipe
conda install -c bioconda vsearch -y
source("https://raw.githubusercontent.com/fconstancias/metabaRpipe-source/master/Rscripts/functions.R")
readRDS("dada2/phyloseq.RDS") %>%
phyloseq_vsearch_lulu_cluster_ASV(vsearch ="/Users/test/miniconda3/envs/metabaRpipe4/bin/vsearch") -> phyloseq_lulu_clust
322 ASV were present in the original phyloseq object:
readRDS("dada2/phyloseq.RDS")
phyloseq-class experiment-level object
otu_table() OTU Table: [ 322 taxa and 6 samples ]:
sample_data() Sample Data: [ 6 samples by 18 sample variables ]:
tax_table() Taxonomy Table: [ 322 taxa by 7 taxonomic ranks ]:
refseq() DNAStringSet : [ 322 reference sequences ]
taxa are rows
321 vsearch-lulu
clustering :
phyloseq_lulu_clust$physeq_curated
phyloseq-class experiment-level object
otu_table() OTU Table: [ 321 taxa and 6 samples ]:
sample_data() Sample Data: [ 6 samples by 18 sample variables ]:
tax_table() Taxonomy Table: [ 321 taxa by 7 taxonomic ranks ]:
refseq() DNAStringSet : [ 321 reference sequences ]
taxa are rows
More information is stored under:
phyloseq_lulu_clust$curated_result$otu_map
Check the lulu repo for more information.
Load the phyloseq
object in R - 322 ASV:
readRDS("dada2/phyloseq.RDS") -> ps
ps
phyloseq-class experiment-level object
otu_table() OTU Table: [ 322 taxa and 6 samples ]:
sample_data() Sample Data: [ 6 samples by 18 sample variables ]:
tax_table() Taxonomy Table: [ 322 taxa by 7 taxonomic ranks ]:
refseq() DNAStringSet : [ 322 reference sequences ]
taxa are rows
Run ASV clustering at 97% identity:
ps %>%
phyloseq_DECIPHER_cluster_ASV(threshold = 97) -> out
The updated phyloseq object contains now 234 cluster of ASV:
out$physeq_clustered
phyloseq-class experiment-level object
otu_table() OTU Table: [ 234 taxa and 6 samples ]:
sample_data() Sample Data: [ 6 samples by 18 sample variables ]:
tax_table() Taxonomy Table: [ 234 taxa by 7 taxonomic ranks ]:
refseq() DNAStringSet : [ 234 reference sequences ]
taxa are rows
More information about the clusters are stored in the generated object:
out$cluster_table %>% arrange(-cluster_size)
# A tibble: 322 × 18
# Groups: cluster_0.03 [234]
ASV cluster_0.03 cluster_size `R1F1-S66` `R1F2-S300` `R1F3-S90` `Y2A15-2M-06-S78` `Y2A15-2M-12-S77`
<chr> <int> <int> <int> <int> <int> <int> <int>
1 ASV005 4 8 0 0 172 0 1235
2 ASV018 4 8 0 0 0 0 251
3 ASV019 4 8 0 0 102 0 117
4 ASV027 4 8 0 0 0 14 124
5 ASV062 4 8 0 0 0 15 23
6 ASV067 4 8 0 0 34 0 0
7 ASV070 4 8 30 0 0 0 0
8 ASV108 4 8 0 0 16 0 0
9 ASV011 8 7 0 0 238 0 99
10 ASV023 8 7 0 0 92 0 90
# … with 312 more rows, and 10 more variables: Y3-R1F4-S136 <int>, Kingdom <chr>, Phylum <chr>,
# Class <chr>, Order <chr>, Family <chr>, Genus <chr>, Species <chr>, ASV_length <int>,
# ASV_sequence <chr>
Install picrust2
in the dedicated conda environment:
conda activate metabaRpipe
conda install -c bioconda -c conda-forge picrust2=2.3.0-b -y
Check the official picrust2 repository for more details regarding the parameters.
conda activate metabaRpipe
Rscript /Users/test/Documents/GitHub/metabaRpipe/Rscripts/run_phyloseq_picrust2.Rscript --help
Loading required package: optparse
Usage: /Users/test/Documents/GitHub/metabaRpipe/Rscripts/run_phyloseq_picrust2.Rscript [options]
Options:
-p CHARACTER, --phyloseq_path=CHARACTER
Path of the input phyloseq object
--picrust2_bin=CHARACTER
picrust2 script path
--working_dir=CHARACTER
Name of the working directory
--output=CHARACTER
Name of the output directory - should not exist before running the analysis
--traits=CHARACTER
Traits c(COG,EC,KO,PFAM,TIGRFAM)
--min_reads=CHARACTER
--min_samples=CHARACTER
-m CHARACTER, --hsp_method =CHARACTER
" SP method to use."mp": predict discrete traits using
max parsimony. "emp_prob": predict discrete traits
based on empirical state probabilities across tips.
"subtree_average": predict continuous traits using
subtree averaging. "pic": predict continuous traits
with phylogentic independent contrast. "scp":
reconstruct continuous traits using squared-change
parsimony "
--no_gap_fill=CHARACTER
Disabel gap filling
--add_description=CHARACTER
add description pf Pway
--load_picrust2_data=CHARACTER
import picrust2 data: could generate large files...
--return=CHARACTER
return picrust2 data / write to a directory
-T NUMERIC, --slots=NUMERIC
Number of threads to perform the analyses
-f CHARACTER, --fun_dir=CHARACTER
Directory containing the R functions
-h, --help
Show this help message and exit
Rscript ${MY_DIR}/metabaRpipe/Rscripts/run_phyloseq_picrust2.Rscript --phyloseq_path ~/dada2/phyloseq.RDS \
--output ~/dada2/picrust2-out \
--traits COG,EC,KO,PFAM \
--add_description TRUE
Check the official picrust2 repository for more details regarding the outputs.
conda activate metabaRpipe
Rscript ${MY_DIR}/metabaRpipe/Rscripts/phyloseq_export_qiime.Rscript --help
Loading required package: optparse
Usage: metabaRpipe/Rscripts/phyloseq_export_qiime.Rscript [options]
Options:
-i CHARACTER, --input=CHARACTER
Path of the phyloseq object
-o CHARACTER, --output=CHARACTER
output directory
-f CHARACTER, --fun_dir=CHARACTER
Directory containing the R functions
-h, --help
Show this help message and exit
Rscript \
${MY_DIR}/metabaRpipe/Rscripts/phyloseq_export_qiime.Rscript -i dada2/physeq_phylo/phyloseq_phylo.RDS \
-o qiime2
ls qiime2/
asv.fna asv_biom.biom asv_neweek.tre qiime1_mapping_file.txt qiime2_mapping_file.txt tax.txt
Install qiime2
in a dedicated environment https://docs.qiime2.org/2022.2/install/native/
wget https://data.qiime2.org/distro/core/qiime2-2022.2-py38-osx-conda.yml
conda env create -n qiime2-2022.2 --file qiime2-2022.2-py38-osx-conda.yml
# OPTIONAL CLEANUP
rm qiime2-2022.2-py38-osx-conda.yml
# Activate the conda environment
conda activate qiime2-2022.2
# Test your installation
qiime --help
Import the data in qiime2
:
conda activate qiime2-2022.2
# Navigate were the output of phyloseq_export_qiime.Rscript were generated:
cd qiime2/
# List/ check the files are here:
ls
asv.fna asv_biom.biom asv_neweek.tre qiime1_mapping_file.txt qiime2_mapping_file.txt tax.txt
# Import OTU/ASV table:
qiime tools import \
--input-path asv_biom.biom \
--type 'FeatureTable[Frequency]' \
--input-format BIOMV100Format \
--output-path qiime2_otu.qza
# Import taxonomic table:
qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path tax.txt \
--output-path qiime2_taxonomy.qza
# Import phylogenetic tree:
qiime tools import \
--input-path asv_neweek.tre \
--output-path asv_neweek.qza \
--type 'Phylogeny[Rooted]'
# Import ASV sequences:
qiime tools import \
--input-path asv.fna \
--output-path asv_rep_set.qza \
--type 'FeatureData[Sequence]'
# Import metadata:
qiime metadata tabulate \
--m-input-file qiime2_mapping_file.txt \
--o-visualization qiime2_metadata.qzv
You might want to ... in progress ..
QIIME™
(pronounced chime) stands for Quantitative Insights Into Microbial Ecology. It is a bioinformatic pipeline developed to ease the process of bioinformatic analysis of raw sequencing microbiome data. PCR primer removal and the main core of the pipeline are based on the same tools and we use similar parameters. metabaRpipe
allows to perform the steps within R
and offers the flexibility and power of all the available R packages/
functions
. In addition, presets
allow to process your data in a single command. Below the QIIME2
steps correspondong to the metabaRpipe
ones.
You can install QIIME2
on MAC
, windows
or linux
systems using conda Alternatively, QIIME2
cam be run using a virtual machine.
QIIME2
uses cutadapt - former version of atropos - in order to detect and remove PCR primers from raw metabarcoding sequencing data more details [here](cutadapt https://docs.qiime2.org/2022.2/plugins/available/cutadapt/trim-paired/).
QIIME2
relies on dada2
R
package to process PCR primers free sequencing into ASV / sample count table using the following qiime2 plugin which actually calls a dada2
R
script:
less /Users/test/miniconda3/envs/qiime2-2022.2/bin/run_dada_paired.R
#!/usr/bin/env Rscript
###################################################
# This R script takes an input two directories of
# .fastq.gz files, corresponding to matched forward
# and reverse sequence files,
# and outputs a tsv file of the dada2 processed sequence
# table. It is intended for use with the QIIME2 plugin
# for DADA2.
#
# Rscript run_dada_paired.R input_dirF input_dirR output.tsv track.tsv filtered_dirF filtered_dirR 240 160 0 0 2.0 2 pooled 1.0 0 100000
####################################################
####################################################
# DESCRIPTION OF ARGUMENTS #
####################################################
# NOTE: All numeric arguments should be zero or positive.
# NOTE: All numeric arguments save maxEEF/R are expected to be integers.
# NOTE: Currently the filterered_dirF/R must already exist.
# NOTE: ALL ARGUMENTS ARE POSITIONAL!
#
### FILE SYSTEM ARGUMENTS ###
#
# 1) File path to directory with the FORWARD .fastq.gz files to be processed.
# Ex: path/to/dir/with/FWD_fastqgzs
#
# 2) File path to directory with the REVERSE .fastq.gz files to be processed.
# Ex: path/to/dir/with/REV_fastqgzs
...
### LEARN ERROR RATES ###
# Dereplicate enough samples to get nreads.learn total reads
cat("2) Learning Error Rates\n")
errF <- suppressWarnings(learnErrors(filtsF, nreads=nreads.learn, multithread=multithread))
errR <- suppressWarnings(learnErrors(filtsR, nreads=nreads.learn, multithread=multithread))
### PROCESS ALL SAMPLES ###
# Loop over rest in streaming fashion with learned error rates
denoisedF <- rep(0, length(filtsF))
ddsF <- vector("list", length(filtsF))
ddsR <- vector("list", length(filtsF))
mergers <- vector("list", length(filtsF))
cat("3) Denoise samples ")
for(j in seq(length(filtsF))) {
drpF <- derepFastq(filtsF[[j]])
ddsF[[j]] <- dada(drpF, err=errF, multithread=multithread, verbose=FALSE)
drpR <- derepFastq(filtsR[[j]])
ddsR[[j]] <- dada(drpR, err=errR, multithread=multithread, verbose=FALSE)
cat(".")
}
cat("\n")
if(poolMethod == "pseudo") {
cat(" Pseudo-pool step ")
### TEMPORARY, to be removed once 1.12 makes its way to Q2
### Needed for now to manage pseudo-pooling memory, as 1.10 didn't do this appropriately.
### pseudo_priors code copied from dada2.R
stF <- makeSequenceTable(ddsF)
pseudo_priorsF <- colnames(stF)[colSums(stF>0) >= 2 | colSums(stF) >= Inf]
rm(stF)
stR <- makeSequenceTable(ddsR)
pseudo_priorsR <- colnames(stR)[colSums(stR>0) >= 2 | colSums(stR) >= Inf]
rm(stR)
### \pseudo_priors code copied from dada2.R
### code copied from previous loop through samples in this script
for(j in seq(length(filtsF))) {
drpF <- derepFastq(filtsF[[j]])
ddsF[[j]] <- dada(drpF, err=errF, priors=pseudo_priorsF,
multithread=multithread, verbose=FALSE)
drpR <- derepFastq(filtsR[[j]])
ddsR[[j]] <- dada(drpR, err=errR, priors=pseudo_priorsR,
multithread=multithread, verbose=FALSE)
cat(".")
}
cat("\n")
...
https://docs.qiime2.org/2022.2/tutorials/phylogeny/?highlight=phylogeny#id20
CIRAD users (cluster)- conda environment
exemple running from HPLC slurm / ...- phyloseq_to_clusters (enterotype)
change env name on C18 workstation -> metabaRpipe- FM's
vsearch swarm
pipeline. add https://zenodo.org/account/settings/github/ -> DOIadd phylogenetic tree to a phyloseq objectchange nameadd possibility to skip primer removal: skipping run_atropos() or changing atropos parameter?replace taxonomic assignments of a phyloseq object using alternative approach/ databasecluster ASV using DECIPHERcluster ASV using vsearch lulurun picrust2 from a phyloseq object