Skip to content

Snakemake workflow to screen a set of genomes (e.g., MAGs) against an existing genome database and identify novel species

License

Notifications You must be signed in to change notification settings

slambrechts/magscreen

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MAGscreen - discovering new microbial species

Snakemake workflow to identify novel microbial species from a set of genomes.

Genomes are first quality-filtered based on the CheckM stats then compared against a genome database using Mash and MUMmer. Unknown hits are extracted, clustered at species-level using dRep and further quality-controlled with GUNC.

Installation

  1. Install conda and snakemake

  2. Clone repository

git clone https://github.com/alexmsalmeida/magscreen.git

How to run

  1. Edit config.yml file to point to the input, output and databases directories. Input directory should contain the .fa assemblies to analyse and a .csv file with CheckM completeness and contamination scores. The databases folder should contain the GUNC diamond database and a custom Mash database (.msh) with the genomes you want to screen against.

  2. (option 1) Run the pipeline locally (adjust -j based on the number of available cores)

snakemake --use-conda -k -j 4
  1. (option 2) Run the pipeline on a cluster (e.g., LSF)
snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'bsub -n {cluster.nCPU} -M {cluster.mem} -o {cluster.output}'

Output

The main output is located in the directory new_species/ which contains the best-quality representative genomes (.fa files) of each new species. New species matching all of the following criteria are filtered out:

  • Flagged by GUNC: clade_separation_score >0.45; contamination_portion >0.05; reference_representation_score >0.5
  • Are singletons (dRep clusters with only one member)
  • Are <90% complete based on CheckM

About

Snakemake workflow to screen a set of genomes (e.g., MAGs) against an existing genome database and identify novel species

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%