Skip to content

Repository contributing to a manuscript "DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic genomes"

Notifications You must be signed in to change notification settings

Mohamed-Elmanzalawi/DFAST_QC_Benchmark

Repository files navigation

DFAST_QC: Quality Assessment and Taxonomic Identification Tool for Prokaryotic Genomes

Repository contributing to the manuscript "DFAST_QC: Quality Assessment and Taxonomic Identification Tool for Prokaryotic Genomes"

You can read our manuscript at the following link: Mohamed Elmanzalawi, Takatomo Fujisawa, Hiroshi Mori, Yasukazu Nakamura, Yasuhiro Tanizawa; DFAST_QC: Quality Assessment and Taxonomic Identification Tool for Prokaryotic Genomes; bioRxiv 2024.07.22.604526. GCB


Authors: Mohamed Elmanzalawi and Yasuhiro Tanizawa

This repository contains scripts to replicate DFAST_QC v1.0.0 benchmarking results.

General note: in our benchmark analysis all the tools were installed in a conda environment called "dfast_qc". In each of these scripts, we activate the conda environment. The scripts:

So either delete the conda activation line in these scripts or create a "dfast_qc" conda environment and install all the Dependencies.

The conda env we used is available as yml file which can be used to create it by running the following command:

conda env create -f dfast_qc.yml

After activating the conda env export the path to GTDB-Tk database on your machine. For example:

export GTDBTK_DATA_PATH=/new/path/to/gtdbtk/db

Dependencies

  • DFAST_QC v1.0.0. Please ensure you are using version 1.0.0 as any change in the version might yield different results or errors.

    You can install Conda and run the following command:

conda install bioconda::dfast_qc=1.0.0
conda install anaconda::pandas=2.2.0 

Detailed Benchmarking steps

GenBank analysis:

1. Generate the benchmarking data.

python test_datasets_GenBank.py -g path/to/dfast_qc/genome/directory

Here -g is the pathway for the DFAST_QC genome directory. This is essential as the script will compile a list of genomes from this location and utilize them during the filtering process.

The script will download all the genome assemblies from GenBank from the assembly_summary_genbank.txt file using a modified version of dfast_file_downloader.py. Also ANI_report_prokaryotes.txt will be downloaded for filtering purposes. Both are available here. It then merges these datasets and filters them as follows:

  • Retrieve all genomes that are either Bacteria or Archaea. (found in assembly_summary_genbank.txt)

  • Collect only those genomes with the version status "Latest" and marked as "OK" in the taxonomy check. (found in ANI_report_prokaryotes.txt)

  • Exclude genomes found in the DFAST_QC genome directory. (the script will generate the list automatically)

  • Select only non-type material genomes. (found in assembly_summary_genbank.txt)

  • Removing uncultured and unidentified prokaryotes. (using specific keywords like uncultured,Candidatus .. etc)

  • Finally, select one random sample from every species.

This script will also automatically download dfast_file_downloader.py, which is necessary for downloading the genomes.

Alternatively, the script also provides a dummy data file named "genebank_prok_1_per_species_dummy.tsv" containing 5 genome accessions. This file serves as a test dataset to validate the workflow, enabling users to detect and resolve any errors caused by format changes in the files before processing the entire dataset.

2. Submit a Job to the NIG-SC to get DFAST_QC results.

Don't forget to change the number of genomes in argument[#$ -t {number_of_genomes}] in the job script run_dfastqc_GenBank.sh. The number is provided once you finish running test_datasets_GenBank.py.

Example: [#$ -t 9000]

Then You can run the script.

qsub run_dfastqc_GenBank.sh /home/melmanzalawi/dfast_qc genebank_test_data/genebank_prok_1_per_species_accession.tsv

In case you want to use the dummy data run the following command

qsub run_dfastqc_GenBank.sh /home/melmanzalawi/dfast_qc genebank_test_data/genebank_prok_1_per_species_accession_dummy.tsv
  • The First argument is the path to the DFAST_QC directory.

  • The second argument is the path to the dummy data or any species accession data. If not provided the script will use the real data generated for the test data script.

3. Combine all DFAST_QC results to get the summary file.

The script will also select genomes that have reference genomes only using "reference_genomes.tsv" which can be generated using DFAST_QC v1.0.0 by the following command:

dqc_admin_tools.py dump_sqlite_db

After copying the file "reference_genomes.tsv" to the same directory as the script, you may proceed to execute the script.

python summarize_GenBank_results.py -r DQC_genebank_results/ -as genbank_test_data/assembly_summary_genbank.txt -ani genbank_test_data/ANI_report_prokaryotes.txt 

-r is the pathway for the DFAST_QC results directory.

-as is the pathway for assembly_summary_genbank.txt file.

-ani is the pathway for ANI_report_prokaryotes.txt file.

You can also run it using the NIG-SC by running the following job:

qsub summary_GenBank_job.sh

GEMs analysis:

Genomic catalog of Earth’s microbiomes (GEMS)

1. Generate the benchmarking data.

python test_datasets_GEM.py

The script will download genomes from the genomic catalog of Earth’s microbiome and select 10000 random samples. Similar to the GenBank script, it performs functions such as generating dummy data.

2. Submit a Job to the NIG-SC to get DFAST_QC results.

qsub run_dfastqc_GEM.sh /home/melmanzalawi/dfast_qc gem_test_data/gem_mags_10000_ID_list.tsv

In case you want to use the dummy data run the following command

qsub run_dfastqc_GEM.sh /home/melmanzalawi/dfast_qc gem_test_data/gem_mags_5_ID_list_dummy.tsv
  • The First argument is the path to the DFAST_QC directory.

  • The second argument is the path to the dummy data or any species accession data. If not provided the script will use the real data generated for the test data script.

3. Submit a Job to the NIG-SC to get GTDB-TK results.

qsub run_gtdbtk_GEM.sh

In case you want to use the dummy data run the following command

qsub run_gtdbtk_GEM.sh dummy

3. Combine all GTDB-TK results into one file.

python summarize_GTDBtk_results.py

4. Combine both GTDB-TK final result file and DFASt_QC to get the summary file.

python summarize_GEM_results.py -r DQC_GEM_results/ -g result_GTDBtk_all.tsv

About

Repository contributing to a manuscript "DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic genomes"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published