MAGscreen - discovering new microbial species

Snakemake workflow to identify novel microbial species from a set of genomes.

Genomes are first quality-filtered based on the CheckM stats then compared against a genome database using Mash and MUMmer. Unknown hits are extracted, clustered at species-level using dRep and further quality-controlled with GUNC.

Installation

Install conda and snakemake
Clone repository

git clone https://github.com/alexmsalmeida/magscreen.git

How to run

Edit config.yml with the selected input, output and databases arguments. The input should point to the paths of the directory containing the .fa assemblies to analyse and a path to the .csv file with CheckM completeness and contamination scores. The databases folder should contain the GUNC diamond database and a custom Mash database (.msh) with the genomes you want to screen against.
(option 1) Run the pipeline locally (adjust -j based on the number of available cores)

snakemake --use-conda -k -j 4

(option 2) Run the pipeline on a cluster (e.g., SLURM)

snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'sbatch -A ALMEIDA-SL3-CPU -p icelake-himem --time=12:00:00 --ntasks={cluster.nCPU} --mem={cluster.mem} -o {cluster.output}'

Output

The main output is located in the directory new_species/ which contains the best-quality representative genomes (.fa files) of each new species. New species matching all of the following criteria are filtered out:

Flagged by GUNC: clade_separation_score >0.45; contamination_portion >0.05; reference_representation_score >0.5
Are singletons (dRep clusters with only one member)
Are <90% complete based on CheckM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MAGscreen - discovering new microbial species

Installation

How to run

Output

Files

README.md

Latest commit

History

README.md

File metadata and controls

MAGscreen - discovering new microbial species

Installation

How to run

Output