Skip to content

PhyloProfile and the NCBI taxonomy database

Vinh Tran edited this page May 23, 2023 · 12 revisions

Working with NCBI taxonomy in PhyloProfile

PhyloProfile utilises the taxonomy information from the NCBI Taxonomy database to provide several related features, such as sorting the input taxa based on the selected reference species, compress the resolution of the phylogenetic profile to the higher taxonomy ranks, or to sub-select taxa that are belong to a specified species domain.

image

The taxonomy information in PhyloProfile is stored in different files within the /R/library/PhyloProfile/PhyloProfile/data folder. To check where PhyloProfile package is installed, im R Terminal type

find.package("PhyloProfile")

Two most important files are the preProcessedTaxonomy.txt and taxonomyMatrix.txt

  • preProcessedTaxonomy.txt file stored a local NCBI taxonomy database, pre-processed from the taxdmp files
  • taxonomyMatrix.txt file is used for sorting the input taxa in the profile plot as well as dynamically changing the working systematic rank. By default, this file has the info for the QfO reference species

Update, reset, import and export taxonomy database

Screenshot 2023-05-09 at 09 55 17

Update the taxonomy DB

If your phylogenetic profiles include taxa not present in the set, you need to parse the new taxa. If the new taxa are found in the preProcessedTaxonomy.txt file, the taxonomy information can be easily parsed. To do this, you can click on the Parse taxonomy info button in the PhyloProfile Shiny App. However, if some of the new taxa are not in the outdated pre-processing NCBI taxonomy database, you will see the Add taxonomy info button instead (). In this case, you can manually add the taxa or update the preProcessedTaxonomy.txt file using the Update NCBI taxonomy DB function. The updating process may take a few minutes depending on your internet connection, so please be patient and wait until it is completed.

Note: this does not apply for working with taxa that do not have a valid NCBI taxonomy ID. In that case, please check this post.

Reset the taxonomy DB

If you experience any issues related to taxonomy in PhyloProfile, such as errors in parsing new taxa, missing input taxa, or unusual ordering of taxa in the profile plot, you should consider resetting the NCBI taxonomy database using the Reset NCBI taxonomy DB function.

Import and export taxonomy DB

The taxonomy database files (newTaxa.txt, idList.txt, rankList.txt, taxonNamesReduced.txt, taxonomyMatrix.txt and preCalcTree.nw if available) of the current working project can be export to another location for backing up or reusing later using the Export taxonomy DB files function.

From the version 1.14.3, you will have an option to use your exported taxonomy DB, without the need of replacing the data in the default location (/R/library/PhyloProfile/PhyloProfile/data).

image

However, if you don't want to give the path to your customised taxonomy DB every time working with PhyloProfile, you can use the function Import taxonomy DB files to overwrite the default data with yours.

Working with taxa that do not have NCBI taxonomy IDs

Alt Text

How PhyloProfile sorts taxa

image

We first download data from the NCBI taxonomy database and processes it to generate the preProcessedTaxonomy.txt file. This file is in a tab-delimited format and contains the taxonomy ID, scientific name, taxonomy rank, and the parent rank's directed ID for each node.

Next, we retrieve the complete taxonomy hierarchy string for each input species. This hierarchy includes all 42 defined ranks (e.g., strain, subspecies, species, genus) with fixed positions, as well as "undefined" ranks (e.g., clade, section, subsection) that can appear multiple times and in different positions. These undefined ranks are referred to as norank. The rank names and IDs for both defined and undefined ranks are stored in the rankList.txt and idList.txt files.

image

We utilizes the rankList.txt and idList.txt files to align all the taxonomy hierarchy strings, resulting in the taxonomyMatrix.txt file. In this process, missing ranks or IDs for each taxon are filled in with the previously available ones. This ensures that all taxa have taxonomy strings of the same length, which facilitates the reconstruction of the NCBI taxonomy tree. This approach improves the resolution of the taxonomy tree and results in a more accurate sorting of taxa in the phylogenetic profile.

image

We have implemented this algorithm into PhyloProfile after improving the class2tree function of the taxize R package.