This repository contains the complete analysis workflow used to benchmark the OptiFit algorithm in mothur and produce the accompanying manuscript. Find details on how to use OptiFit and descriptions of the parameter options on the mothur wiki: https://mothur.org/wiki/cluster.fit/.
Sovacool KL, Westcott SL, Mumphrey MB, Dotson GA, Schloss PD. 2022. OptiFit: An Improved Method for Fitting Amplicon Sequences to Existing OTUs. mSphere. http://dx.doi.org/10.1128/msphere.00916-21
A bibtex entry for LaTeX users:
@article{sovacool_optifit_2022,
author = {Kelly L. Sovacool and Sarah L. Westcott and M. Brodie Mumphrey and Gabrielle A. Dotson and Patrick D. Schloss},
title = {OptiFit: an Improved Method for Fitting Amplicon Sequences to Existing OTUs},
journal = {mSphere},
year = {2022},
doi = {10.1128/msphere.00916-21}
URL = {https://journals.asm.org/doi/10.1128/msphere.00916-21},
The workflow is split into five subworkflows:
- 0_prep_db — download & preprocess reference databases.
- 1_prep_samples — download, preprocess, & de novo cluster the sample datasets.
- 2_fit_reference_db — fit datasets to reference databases.
- 3_fit_sample_split — split datasets; cluster one fraction de novo and fit the remaining sequences to the de novo OTUs.
- 4_vsearch — run vsearch clustering for comparison.
The main workflow (Snakefile
) creates plots from the results of
the subworkflows and renders the paper.
-
Before cloning, configure git symlinks:
git config --global core.symlinks true
Otherwise, git will create text files in place of symlinks.
-
Clone this repository.
git clone https://github.com/SchlossLab/Sovacool_OptiFit_mSphere_2022 cd Sovacool_OptiFit_mSphere_2022
-
Install the dependencies.
Almost all are listed in the conda environment file. Everything needed to run the analysis workflow is listed here.
conda env create -f config/env.simple.yaml conda activate optifit
Additionally, I used a custom version of
ggraph
for the algorithm figure. You can install it withdevtools
from R:devtools::install_github('kelly-sovacool/ggraph', ref = 'iss-297_ggtext')
If you do not have LaTeX already, you'll need to install a LaTeX distribution before rendering the manuscript as a PDF. You can use
tinytex
to do so:tinytex::install_tinytex()
I also used
latexdiffr
to create a PDF with changes tracked prior to submitting revisions to the journal.devtools::install_github("hughjonesd/latexdiffr")
-
Run the entire pipeline.
Locally:
snakemake --cores 4
Or on an HPC running slurm:
sbatch code/slurm/submit_all.sh
(You will first need to edit your email and slurm account info in the submission script and cluster config.)
.
├── OptiFit.Rproj
├── README.md
├── Snakefile
├── code
│ ├── R
│ ├── bash
│ ├── py
│ ├── slurm
│ └── tests
├── config
│ ├── cluster.json
│ ├── config.yaml
│ ├── config_test.yaml
│ ├── env.export.yaml
│ ├── env.simple.yaml
│ └── slurm
│ └── config.yaml
├── docs
│ ├── paper.md
│ ├── paper.pdf
│ └── slides
├── exploratory
│ ├── 2018_fall_rotation
│ ├── 2019_winter_rotation
│ ├── 2020-05_May-Oct
│ ├── 2020-11_Nov-Dec
│ ├── 2021
│ │ ├── figures
│ │ ├── plots.Rmd
│ │ ├── plots.md
│ ├── AnalysisRoadmap.md
│ └── DeveloperNotes.md
├── figures
├── log
├── paper
│ ├── figures.yaml
│ ├── head.tex
│ ├── msphere.csl
│ ├── paper.Rmd
│ ├── preamble.tex
│ └── references.bib
├── results
│ ├── aggregated.tsv
│ ├── stats.RData
│ └── summarized.tsv
└── subworkflows
├── 0_prep_db
│ ├── README.md
│ └── Snakefile
├── 1_prep_samples
│ ├── README.md
│ ├── Snakefile
│ ├── data
│ │ ├── human
│ │ └── SRR_Acc_List.txt
│ │ ├── marine
│ │ └── SRR_Acc_List.txt
│ │ ├── mouse
│ │ └── SRR_Acc_List.txt
│ │ └── soil
│ │ └── SRR_Acc_List.txt
│ └── results
│ ├── dataset_sizes.tsv
│ └── opticlust_results.tsv
├── 2_fit_reference_db
│ ├── README.md
│ ├── Snakefile
│ └── results
│ ├── denovo_dbs.tsv
│ ├── optifit_dbs_results.tsv
│ └── ref_sizes.tsv
├── 3_fit_sample_split
│ ├── README.md
│ ├── Snakefile
│ └── results
│ ├── optifit_crit_check.tsv
│ └── optifit_split_results.tsv
└── 4_vsearch
├── README.md
├── Snakefile
└── results
└── vsearch_results.tsv