This README serves as a virtual lab notebook. We will document the scripts and the order that they're run in on here. This pipeline is built for use on a server that utilizes SLURM. I'm currently in the process of cleaning up this repo and improving documentation, so expect things to be incomplete but improving.
The repo is in the middle of undergoing a reorganization to where the scripts should be run in numeric order for simplicity. Expect there to be irrelevant bits that haven't been cleaned up yet.
- flexbar
- R 3.4.2 or later
- DADA2
- vegan
- LULU OTU curation package
- Plotly R package (properly configured for online use)
- phyloseq
- stringr
- ggplot2
- DESeq2
- parallel
- breakaway
- RDP Classifier
- Java
- BLCA classifier
- swarm2
- VSEARCH
- Drive5 python scripts, place in a folder named
d5_py
within this folder. - ...likely more. To be added soon.
Make a main project folder. Inside make subdirectories for each run (e.g. BCS1, BCS2, BCS3, BCS4, etc). The raw reads (FASTQ or FASTQ.GZ) can go into these subdirectories. Unzip any FASTQ.GZ files to maintain consistency across different folders. In the future, this step doesn't need to be done (.GZ files preferable), but minor changes need to be made to the flexbar scripts to accomodate this.
The main directory should also contain a tab-separated value file called key.txt
that contains information corresponding library number, adapter index number (sample number from the Illumina run), and primer tag number to ML ID numbers and other sample metadata. Also in the main project folder, make another plaintext file call libraries.txt
that contains the names of each of the run subdirectories (e.g. BCS1, BCS2, BCS3, etc), which one name on each line.
Make a FASTA file (or multiple FASTA files if multiple barcoding schemes are used) containing the primer index barcodes in the main project folder. This is called ML_barcodes.fasta
in these scripts.
You will need to make a similar FASTA file containing all of the adapters that you wish flexbar to trim for. Here, it is called truseq_adapters.fasta
.
This repository can be cloned into its own folder inside of the main project folder (name it "scripts" or something). Inside of your scripts folder, make a subdirectory called out_err_files
to contain log files and error logs.
- make_file_root_list.sh
- flexbar.sh
- flexbar2.sh
Move all unassigned reads into a new subdirectory called unassigned
, if any are remaining for whatever reason after running flexbar2.sh
. This should be performed automatically by the script now.
The phyloseq.sh
and BCS_phyloseq.R
scripts run community analyses and visualization (based on the Plotly tool). You may need to make major changes to this section for your own analysis or configure Plotly if you want to use the existing visualizations. For some reason, the RDP implementation in DADA2 doesn't perform well for us, possibly due to memory or scaling issues. We run RDP outside of R/DADA2 to get around this.
R scripts can be placed in the same folder as the other scripts (in a folder inside of the folders with all the directories containing the sequenced libraries). Code for renaming files generated by flexbar2.sh will be contained here as well as R code for running DADA2 analysis.
- rename.sh
- filtertrim.sh
- removechim.sh
- RDP_classify.sh
- phyloseq.sh