Skip to content

UITOTO is an R package designed for deriving, testing, and visualizing Diagnostic Molecular Combinations (DMCs). The package also features a Shiny App that can be executed either locally or online (please visit https://atorresgalvis.shinyapps.io/MolecularDiagnoses/).

License

Notifications You must be signed in to change notification settings

atorresgalvis/UITOTO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

  

The UITOTO R package addresses the challenges associated with finding, testing, and visualizing reliable Diagnostic Molecular Combinations (DMCs), especially those arising from high-throughput taxonomy. The package also features a user-friendly Shiny App that can be accessed online or locally in RStudio.

🧰 Pre-Installation

You may need to complete a pre-installation process to ensure your environment is configured with the prerequisites required for UITOTO installation. The complete list of packages from CRAN required by UITOTO could be provided by typing in R:

packages <- c("dplyr", "ggplot2", "readr", "seqinr", "shiny", "shinyjs", "shinyWidgets")

Afterward, you can install the packages that have not yet been installed all at once with:

installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

As UITOTO uses some packages from Bioconductor (Biostrings and DECIPHER), it is highly recommended to follow the instructions included in https://www.bioconductor.org/install/. The BiocManager package is used for managing Bioconductor resources, so to get it you should use:

# Updated to 24/10/2024.
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.20")

After that, you can install Biostrings:

BiocManager::install("Biostrings")

As well as the DECIPHER package:

BiocManager::install("DECIPHER")

Now you should be ready to install UITOTO. However, it is advisable to restart the RStudio session or simply close and reopen the program.

💾 Installation

Now that everything is ready, you have different ways for installing the package.

  1. You can install the released version of UITOTO from with:
# Install and load devtools, if you don't have it, with the commands:
# 		install.packages("devtools"); library(devtools)
devtools::install_github("atorresgalvis/UITOTO")
  1. Additionally, you can download the source package (e.g., the folder UITOTO_1.0.0.tar.gz) and install it using mouse-only navigation 🖱 in RStudio:
Installation of UITOTO using mouse-only navigation in RStudio.

Installation of UITOTO using mouse-only navigation in RStudio.

Then, you should load the package into your work session:

library(UITOTO)

👩‍💻 Get Started

  • 🏃‍♂️ Running the UITOTO Shiny app locally

    runUITOTO()

    ⚠️ IMPORTANT: By default, users of Shiny apps can only upload files up to 5 MB. You can increase this limit by setting the option before executing the UITOTO shiny app. For example, to allow up to 12 MB use:

    options(shiny.maxRequestSize = 12 * 1024^2)
    #And then run the UITOTO shiny app normally
    runUITOTO()
    UITOTO Shiny app home page.

    UITOTO Shiny app home page.

  • 🔍 Find Diagnostic Molecular Combinations (DMCs)

    You could use the module Find DMCs of the UITOTO Shiny app for identifying reliable DMCs. However, for very time-consuming searches ⏳, the command-driven version is strongly recommended. For this, you will need to use the OpDMC function. You will notice that the function includes eight different arguments for configuring searches. However, most of them are provided with default settings, enhancing user-friendliness while also enabling customized DMC searches. The complete syntax of the OpDMC function is outlined below. Information regarding each argument can be found in the package manual, function help documentation (use ?OpDMC in the console), or directly within the UITOTO Shiny app:

    OpDMC("FastaFile.fasta", 
          "SpeciesList.csv", 
          iter = 20000, 
          MnLen = 4, 
          exclusive = 4, 
          RefStrength = 0.25, 
          OutName = "OpDMC_output.csv", 
          GapsNew = FALSE
    )

    👀 Note well that the UITOTO Shiny app can also function as a scripter. This means you don’t have to worry about the syntax of the commands; you simply need to drag the files and modify the settings using mouse-only navigation 🖱. The app will then automatically display the equivalent R commands based on the actions you performed visually:

    Module 'Find DMCs' of the UITOTO Shiny app.

    Module 'Find DMCs' of the UITOTO Shiny app.

    The two mandatory arguments that the user must provide are the names of the input files to be used during the searches. The format and conventions of the input files used by UITOTO are quite simple. First, just provide a single-column csv file with the list of entities (e.g., species, genera, OTUs) you wish to diagnose:

    Megaselia_fengae
    Megaselia_oliverleei
    Megaselia_singaporensis
    Megaselia

    On the other hand, the OpDMC function reads each one of the rows of the previous file as a character/string vector and selects those sequences from the fasta file that have the string in its header. So, it doesn’t matter the format of the header of your sequence as long as it contains the recognition pattern/string. For example, the fasta file could look like this::

    >ZRCBDP0359666_Senph04Megaselia_oliverleei_holotype
    -tt-tcgtcatctattgcacatatcggctatgctgttgatttagcaattttctcccttcacttggccggaattttttttattttaggagcagtaaatttttttactacaattattggtatacgctcatcaggaatctcattcgaccgaatgcctttatttgtaaaatacgtaggaattaccgcccttttacttcttttat--------gtattagccggagctatcacaatattattaacagatcgaaattttaatacatcatttttcgaccctgcaggaggaggagatccccttttatatca----------
    >Megaselia_fengae
    tttctcgtcatctattgcacatatcggctatgctgttgatttagcaattttctcccttcacttggccggaattttttttattttaggagcagtaaatttttttactacaattattggtatacgctcatcaggaatctcattcgaccgaatgcctttatttgtaaaatacgtaggaattaccgcccttttacttcttttatcgtattagccggagctatcacaatattattaacagatcgaaattttaatacatcatttttcgaccctgcaggaggaggagatccccttttatatca-----------------   
    >ZRCBDP0364918_LPU09_PU14_Pulau-Ubin_19Apr2018|Megaselia_singaporensis
    cggagctatcacaatattatt---a--attg-g--ta-acg-tc--tca-c-g-aat-ct--tt-atttgtaaaatacg-taggaattaccg-cccttt-tacttcttttatcgtattagccggagctatcacaatattattaacagatcgaaattttaatacatcatttttcgaccctgcaggaggaggagatccccttttatatcaatctattgcacatatcggctatgctgttgatttagcaattttctcccttcacttgatctattgcacatatcggctatgctgttgatttagcaattttctcccttc

    That's it! Once you have your input files, you are ready to perform DMC searches using the command-drive version or the UITOTO Shiny app.

  • 🧬 Identifying unknown –aligned and unaligned– sequences using DMCs

    Once you have identified Diagnostic Molecular Combinations (DMCs) using UITOTO, you can use these DMCs to identify unknown sequences. This functionality is accessible via the Shiny app Graphical User Interface (GUI) under the module Taxonomic verification and identification using DMCs or through the command-line functions: Identifier, ALnID, and IdentifierU.

    The Identifier function is the simplest of the three, designed specifically for aligned sequences. It identifies unknown sequences by matching them against a pool of DMCs provided by the user. Therefore, it requires just two input files: a FASTA file with the unknown sequences and a CSV file containing the DMCs. The input files should be formatted as follows:

    >UnknownSequence1
    aactttatacttcatttttggagcttgagcaggaatagttggtacttcactaagaattataattcgagctgaattagggcacccaggtgctttaattggtgatgatcaaatttataatgtaattgtaactgcacatgcatttattataattttttttatagttataccaattataataggaggatttggaaattgattagtacctttaatattaggagctcctgatatagcctttcctcgaataaataatataagattttgaatattacctccatctttaacactattattggcaagtagtatagtagaaaacggagctggaacaggatgaactgtctatcctccactttcttcaaatattgctcatagtggagcttcggttgatttagcaattttttctctccacttagctggtatctcatctattttaggagctgtaaattttattacaacaattattaatatacgttcatcaggaatcacttttgatcgaatacctttatttgtatgatcagttggaattacagctattttactacttctttctttaccagtattagccggagctattactatacttttaacagatcgaaattttaatactgccttttttgaccctgccggaggaggagatccaattctttatcaacatttattt
    >UnknownSequence2
    aacgttatattttatttttggtgcctgagctggaatagtgggtacttctcttagtatcataattcgagcagaattgggtcaccctggtgcattaattggggacgatcaaatttataatgtaattgttaccgctcatgcatttattataattttctttatagtaataccaattataataggaggatttggaaattgattagtacctttaatactaggagctcctgatatagcttttcctcgaataaataatataagtttttgaatattacccccatctttaactcttttattagccagaagtatagtagaaaatggggctggaacaggatgaactgtttatcctccactttcttctagaatcgcccatagaggagcttctgtagacttagcaattttttctcttcatttagcgggaatttcctcaattttaggagctgttaattttattactactattattaatatacgatcatcaggaattacttttgaccgaatacctttatttgtctgatctgttggaattacagctcttttattacttttatcattaccagtattagctggagcaattactatattattaacagatcgaaattttaacacctcattttttgatccagctggaggaggagaccctattctttatcaacatttattt
    >UnknownSequence3
    aacgttatatttcatttttggagcatgagctggaatagtaggaacatcattaagaattataattcgagccgaattaggacatcctggtgctttaattggtgacgatcaaatttataatgtaattgttactgcacatgctttcattataattttctttatagtaatacctattataatgggaggatttggtaattgacttgtacctttaatattaggagctcctgatatagccttccctcgaataaataatataagtttttgaatattacctccttctttaactcttttattagccagaagtatagtagaaaatggagctgggacaggatgaacagtttacccacctctttcttctagtattgctcatagtggggcatctgtagatttagcaattttctcattacaccttgctggaatttcttcaattttaggagctgtaaattttattactacaattattaatatgcgagcttcaggaattacttttgatcgaatacctttatttgtttgatcagtaggaattacagctttactccttttattatcacttcctgttttagcaggagctattactatacttttaactgatcgaaattttaacacttctttcttcgatccagcaggaggaggagatccaattttatatcaacacttattt
    >UnknownSequence4
    aactttatattttatttttggggcatgagcaggaatagtgggaacttctttaagaataataattcgtgctgaattaagccaccccggtgctttaattggagatgatcaaatttacaatgttattgtcactgcccatgcatttattataattttttttatagttatacctattataataggtggatttggtaattgacttgtccctctaatactaggagcacctgatatagcttttcctcgtataaataatataagtttttgaatacttcctccatccctaacattacttattgcaagaagtatagtagaaaatggggctggtacaggttgaacagtttaccctcccctatcctcaagaattgctcatagtggcgcttctgttgatctagctattttttctcttcatatagcaggaatttcttctattttaggagcagttaattttattacaacaattattaatatacgatcaataggaatcacttttgaccgtatacctttatttgtttgatcagttggtattactgccattctattattactatcattgcctgttttagcaggagctattacaatacttttaacagaccgaaactttaacacttctttttttgatccggccggtggtggagacccaatcttataccaacatttattc
    "","Species","DMC","Alter-DMC"
    "1","Species_1","[58: C, 85: T, 127: G, 247: T, 274: C]","[1: G, 2: C, 127: G, 274: C, 313: T]"
    "2","Species_2","[85: T, 106: G, 118: C, 268: C]","[106: G, 118: C, 169: T, 172: T]"
    "3","Species_3","[92: A, 170: A, 184: A, 208: C, 256: C, 283: G]","[7: A, 62: C, 70: A, 92: A, 185: C, 283: G]"

    👀 Please note that the DMCs file produced by UITOTO (i.e., the output of the OpDMC function, as detailed in the previous section) includes three columns: “Species,” “DMC,” and “Alter-DMC.” This is because the file provides both the primary/final DMC (“DMC” column) and an alternative DMC (“Alter-DMC” column) for each species. However, the Identifier function only requires the “Species” and “DMC” columns; making the “Alter-DMC” column optional. Thus, DMC files generated by other software can also be used in UITOTO as long as they include the two mandatory columns.

    Once we have the two mandatory input files, we can run the Identifier function following its syntax outlined below:

    Identifier(
        DMC_Output = "OpDMC_output.csv",
        dataset = "DatasetAlignedQuery.fas",
        mismatches = 1,
        OutName = "IdentificationOutput.csv",
        MissLogFile = "LogMissing.csv"
    )

    As we mentioned, both input files are mandatory. However, the remaining arguments of the function are optional. The mismatches argument defines the maximum number of mismatches allowed between the DMC and the unknown sequences, with a default value of 1. OutName allows specifying the name of the output CSV file for the identification results (default is “IdentificationOutput.csv”), while MissLogFile defines the file name for logging comparisons that could not be performed due to missing data; setting this to “none” skips logging, which is not recommended for alignments with missing or gappy regions. Detailed information for each argument can be found in the package manual, or by typing ?Identifier in the console.

    💡 Note: The ALnID function operates similarly to the Identifier function, but first it aligns unaligned unknown sequences from a FASTA file with the sequences used to derive the DMCs. This function is particularly useful when the unknown sequences are not aligned with each other or with the sequences used to derive the DMCs. Finally, the IdentifierU function employs an alignment-free approach, using a dynamic sliding window to iteratively compare patterns with the provided DMC patterns.

    📊 Performance evaluation and cross-validation 📈
    In addition to identifying unknown sequences, DMCs can be used to evaluate their performance through cross-validation. This involves assessing the accuracy and reliability of the DMCs by applying them to a separate set of sequences and measuring how well they identify these sequences. To conduct this evaluation, you should first split your dataset into training and testing datasets. The training dataset is then used to derive the DMCs (e.g., by using the OpDMC function), while the testing dataset is used to assess their performance. Although UITOTO does not include a built-in cross-validation feature or module, you can manually split your data and use the ALnID or Identifier functions to evaluate the DMCs. Additionally, you may follow the cross-validation pipeline and scripts developed in the UITOTO software article (see citation section below), where metrics commonly used in machine learning techniques, such as discriminant analysis and deep learning, were employed to evaluate the effectiveness of DMCs as a classification tool. This pipeline was also used by Amorin et al., (2024).

  • 🖍 Visualizing DMCs 👩‍🎨🖌

    The “Visualize DMCs” module in UITOTO is designed to assist users in generating detailed, publication-quality visualizations of Diagnostic Molecular Combinations (DMCs). This module is accessible exclusively via the Shiny app interface and offers interactive features that facilitate the comparison and interpretation of DMCs across various taxa. Users can customize visualizations with options such as color schemes, plot types, and annotation features, tailoring the output to meet their specific needs. Additionally, the visualizations can be exported in high-resolution formats, making them suitable for inclusion in scientific publications and presentations.

    To use the module, open the UITOTO Shiny app and navigate to the “Visualize DMCs” tab. Begin by importing your DMC data file, which can be obtained using the OpDMC function (see above), along with a FASTA file containing the sequences on which the DMCs will be illustrated. Ensure that the FASTA file includes at least one specimen/sequence for each species represented in the DMC file, and that these sequences contain a recognizable string in their names; by default, this string is “_holotype” (i.e., verify that the FASTA file includes at least one sequence with this recognition-string for each species listed in the DMCs file).

    Once your data is loaded, utilize the interactive tools to create plots and tables, customizing the visual appearance to your preference. You can then export your visualizations in various formats or directly copy and paste them into your manuscripts or presentations. Here is an example of a visualization in the UITOTO Shiny app:

    Module 'Visualize DMCs' of the UITOTO Shiny app.

    Module ‘Visualize DMCs’ of the UITOTO Shiny app.

📝 Citation and references

  • Amorim, D.S., Oliveira, S.S., Balbi, M.I.P.A., Ang, Y., Torres, A., Yeo, D., Srivathsan, A., Meier, R. (2024). An integrative taxonomic treatment of the Mycetophilidae (Diptera: Bibionomorpha) from Singapore reveals 115 new species on 730 km². Integrative Systematics: Stuttgart Contributions to Natural History, in press.
  • Torres, A., Lee, L., Srivathsan, A., & Meier, R. (2024). UITOTO: User Interface for Optimal Molecular Diagnoses in High-Throughput Taxonomy (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.11236953

About

UITOTO is an R package designed for deriving, testing, and visualizing Diagnostic Molecular Combinations (DMCs). The package also features a Shiny App that can be executed either locally or online (please visit https://atorresgalvis.shinyapps.io/MolecularDiagnoses/).

Resources

License

Stars

Watchers

Forks

Packages

No packages published