Skip to content

Latest commit

 

History

History
35 lines (24 loc) · 2.24 KB

README.md

File metadata and controls

35 lines (24 loc) · 2.24 KB

MAGscreen - discovering new microbial species

Snakemake workflow to identify novel microbial species from a set of genomes.

Genomes are first quality-filtered based on the CheckM stats then compared against a genome database using Mash and MUMmer. Unknown hits are extracted, clustered at species-level using dRep and further quality-controlled with GUNC.

Installation

  1. Install conda and snakemake

  2. Clone repository

git clone https://github.com/alexmsalmeida/magscreen.git

How to run

  1. Edit config.yml with the selected input, output and databases arguments. The input should point to the paths of the directory containing the .fa assemblies to analyse and a path to the .csv file with CheckM completeness and contamination scores. The databases folder should contain the GUNC diamond database and a custom Mash database (.msh) with the genomes you want to screen against.

  2. (option 1) Run the pipeline locally (adjust -j based on the number of available cores)

snakemake --use-conda -k -j 4
  1. (option 2) Run the pipeline on a cluster (e.g., SLURM)
snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'sbatch -A ALMEIDA-SL3-CPU -p icelake-himem --time=12:00:00 --ntasks={cluster.nCPU} --mem={cluster.mem} -o {cluster.output}'

Output

The main output is located in the directory new_species/ which contains the best-quality representative genomes (.fa files) of each new species. New species matching all of the following criteria are filtered out:

  • Flagged by GUNC: clade_separation_score >0.45; contamination_portion >0.05; reference_representation_score >0.5
  • Are singletons (dRep clusters with only one member)
  • Are <90% complete based on CheckM