Skip to content

Snakemake workflow to screen a set of genomes (e.g., MAGs) against an existing genome database and identify novel species

License

Notifications You must be signed in to change notification settings

alexmsalmeida/magscreen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MAGscreen - discovering new microbial species

Snakemake workflow to identify novel microbial species from a set of genomes.

Genomes are first quality-filtered based on the CheckM stats then compared against a genome database using Mash and MUMmer. Unknown hits are extracted, clustered at species-level using dRep and further quality-controlled with GUNC.

Installation

  1. Install conda and snakemake

  2. Clone repository

git clone https://github.com/alexmsalmeida/magscreen.git

How to run

  1. Edit config.yml with the selected input, output and databases arguments. The input should point to the paths of the directory containing the .fa assemblies to analyse and a path to the .csv file with CheckM completeness and contamination scores. The databases folder should contain the GUNC diamond database and a custom Mash database (.msh) with the genomes you want to screen against.

  2. (option 1) Run the pipeline locally (adjust -j based on the number of available cores)

snakemake --use-conda -k -j 4
  1. (option 2) Run the pipeline on a cluster (e.g., SLURM)
snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'sbatch -A ALMEIDA-SL3-CPU -p icelake-himem --time=12:00:00 --ntasks={cluster.nCPU} --mem={cluster.mem} -o {cluster.output}'

Output

The main output is located in the directory new_species/ which contains the best-quality representative genomes (.fa files) of each new species. New species matching all of the following criteria are filtered out:

  • Flagged by GUNC: clade_separation_score >0.45; contamination_portion >0.05; reference_representation_score >0.5
  • Are singletons (dRep clusters with only one member)
  • Are <90% complete based on CheckM

About

Snakemake workflow to screen a set of genomes (e.g., MAGs) against an existing genome database and identify novel species

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages