Skip to content

Latest commit

 

History

History
36 lines (31 loc) · 5.68 KB

02.introduction.md

File metadata and controls

36 lines (31 loc) · 5.68 KB

Introduction

Proteins are fundamental nanoscale instruments of life, and knowledge of their local molecular features informs function.Advances in sequencing technology(citation?) increased the accessibility of discovering novel protein sequences, however, annotating these sequences with functional domains is hindered by database bias towards the ___ organisms representing ___% of publications in PubMed. Fewer than 1% of the at least 10 million animals have 100 or more annotated genes in the UniProtKB/SwissProt (2022_01, 23-Feb-2022) database, when most animals species have 1000s of predicted genes (Figure 1a). Fewer than 1% of the at least 10 million Eukaryota have 100 or more annotated genes and the top 20 most studied organisms comprise 21.5% of the total number of entries in the UniProtKB/SwissProt (2022_01, 23-Feb-2022) database @https://www.uniprot.org/uniprotkb/statistics [to be replaced with 2024_01 archive version when not available anymore]. As extinction rates increase due to changing global climate, the need to understand each species’ unique niche and contribution to life on earth is ever more urgent. Thus, as sequence annotation is traditionally performed by raw sequence similarity to previously observed data, over 99% of Metazoans have little chance of having sequence similarity to existing data, and it is very difficult to assign any annotation. In image-based machine learning, reference datasets are often expanded using data augmentation techniques such as flipping, translating, cropping, or rotating the original image to detect comparable features in a new context @doi:10.1109/IIPHDW.2018.8388338. However, in the biological sequence space, data augmentation is challenging as not many schemas exist. Recent work @https://doi.org/10.1101/2021.02.18.431877 investigated protein data augmentation using the replacement of individual amino acids, shuffling, reversion, and subsampling subsequences. Reduced amino acid alphabets accomplish the task of data augmentation by implicitly encoding many protein synonyms of similar biochemical structures, thus expanding the adjacent possible protein sequences.

Key: Degenerate amino acids are a form of data augmentation for protein sequences, expanding the “known unknown” of adjacent possible protein sequences that may be found in understudied organisms. Even the most effective tools are limited by their training data. Sequence-specific tools such as BLAST and HMMER may result in no matches for divergent organisms, as the sequence is so different. As a constant arms race between hosts and pathogens, immune gene sequences evolve at a much faster rate than other genes, amplifying large evolutionary distances and making immune genes especially difficult to annotate. While the full spectrum of protein sequences compatible with eukaryotic life have not yet been observed, biological sequence data is highly structured and nonrandom, and many proteins retain similar structures even with different underlying amino acid sequences. In machine learning, the lack of data is mitigated by data augmentation. For example, an expanded image dataset may be created by flipping, rotating, cropping, or translating the original images. Like flipping an image, degenerate amino acid alphabets inform adjacent possible protein sequences from existing protein sequence databases.

Key: The hydrophobic-polar representation of amino acids is a sequence-derived proxy for structural information. As structure informs function, a tool like AlphaFold complements functional annotation by providing 3D structural information of how the protein may execute the function. A proxy for three-dimensional structure can be approximated through hydrophobic and polar sequences. Early work @doi:10.1021/bi00327a032 recognized three-dimensional organization of proteins is often driven by hydrophobic and polar interactions, where hydrophobic residues tend to group together, while polar amino acids tend to be on the surface. By abstracting the protein sequence to consider only hydrophobic and polar properties of amino acid residues, orders of magnitude more possible underlying amino acid sequences can be represented in the same latent space.

In earlier work introducing “k-mer homology” @https://doi.org/10.1101/2021.07.09.450799, we used a six-letter degenerate amino acid alphabet to propagate cell type labels across scRNA-seq datasets from a species with a well-annotated genome, the mouse, to human and a species with a poorly annotated genome, the Chinese horseshoe bat. Key: We previously identified highly polymorphic adaptive immunity genes in unaligned sequences of bat single-cell RNA-seq using k-mers. We also identified previously unannotated genes in the Chinese horseshoe bat by finding unaligned sequences sharing amino acid k-mers with annotated human genes, such as components of the major histocompatibility complex (MHC), HLA-DRB1, and HLA-DRB6. Here, we extend our previous work by applying the concept of k-mer homology to identify distant homologs. While sequence data is readily available, the full spectrum of biologically valid sequences is unknown, limited to only sequences that have so far been observed.

Key: Reduced alphabets increase the effective size of the reference amino acid database. By expanding the reference database with related sequences, reduced alphabets help algorithms identify adjacently congruent sequences, that may not have exactly the same amino acid sequence but similar biochemical properties. The full extent of k-mers compatible with Eukaryotic life is not yet known but can be implied by reduced amino acid alphabets.