OMAmer is a novel alignment-free protein family assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. It is based on an innovative method using evolutionary-informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, it has provided better and quicker subfamily-level assignments than a method based on closest sequences (using DIAMOND).
Requires Python >= 3.8. Download the package from the PyPI, resolving the dependencies by using pip install omamer
.
Alternatively, clone this repository and install manually.
Note: Python 3.12 is currently not supported, until the numba
package is updated (issue).
Pre-built databases are available for the latest OMA release from the download section on the OMA Browser website.
- LUCA: https://omabrowser.org/All/LUCA.h5
- Metazoa: https://omabrowser.org/All/Metazoa.h5
- Viridiplantae: https://omabrowser.org/All/Viridiplantae.h5
- Saccharomyceta: https://omabrowser.org/All/Saccharomyceta.h5
- Primates: https://omabrowser.org/All/Primates.h5
Their names indicate the root-taxon parameter used. Other non-required parameters were left to default.
Note: databases included in the Zenodo upload from the manuscript are not supported by the most recent version of OMAmer. We recommend using the most recent release with databases built on the most recent OMA browser release.
Assign proteins to families and subfamilies in a pre-existing database.
Required arguments: --db
, --query
usage: omamer search [-h] -d DB -q QUERY [--threshold THRESHOLD] [--family_alpha FAMILY_ALPHA] [-fo] [-n TOP_N_FAMS] [--reference_taxon REFERENCE_TAXON]
[-o OUT] [--include_extant_genes] [-c CHUNKSIZE] [-t {0,1,2,3,4,5,6,7,8}] [--log_level {debug,info,warning}] [--silent]
Short Flag | Flag | Default | Description |
---|---|---|---|
-d |
--db |
Path to existing database (including filename) | |
-q |
--query |
Path to FASTA formatted sequences | |
--threshold |
0.1 | Threshold applied on the OMAmer-score that is used to vary the specificity of predicted HOGs. The lower the theshold the more (over-)specific predicted HOGs will be. | |
--family_alpha |
1e-6 | Significance threshold used when filtering families. | |
-fo |
--family_only |
False | If set, only place at the family level. Useful for certain analysis. Note: subfamily_medianseqlen in the results is for the family level. |
-n |
--top_n_fams |
1 | Number of top level families to place into. By default, placed into only the best scoring family. |
-o |
--out |
stdout | Path to output. If not set, defaults to stdout. |
--include_extant_genes |
Include extant gene IDs as comma separated entry in results | ||
-c |
--chunksize |
10000 | Number of queries to process at once. |
-t |
--nthreads |
1 | Number of threads to use |
--log_level |
info | Logging level (options debug, info, warning) | |
--silent |
Set to silence the output. |
Output is in the form of a tab-seperated value file (TSV), with metadata added to the header using !<tag>: <value>
. A parser can be imported for further analysis in python as from omamer.results_reader import results_reader
.
The sequence identifier from the input FASTA-formatted sequences.
The identifier of the hierarchical orthologous group (HOG) in OMA, which you can access through the OMA browser search bar or its REST API (https://omabrowser.org/api/docs).
A HOG identifier is composed of the root-HOG identifier (following “HOG:” and before the first dot), which is followed by its sub-HOGs (before each subsequent dot). For example, for subfamily HOG:0487954.3l.27l, HOG:0487954 is the root-HOG (HOG without-parent), HOG:0487954.3l is its child and HOG:0487954.3l.27l its grandchild.
The taxonomic level that the predicted HOG is defined at.
p-value of having as many or more of k-mers in common under a binomial distribution. Reported in negative natural log units.
Count of k-mers in common with the family / root level HOG.
Family count, normalised by the expected number of hits for the query's sequence length, with the family's k-mer content.
The OMAmer-score of the predicted HOG. At the subfamily level, this score captures the excess of similarity that is shared between the query and a given HOG, thus excluding the similarity with regions conserved in more ancestral HOGs.
Count of k-mers in common with the sub-family / HOG.
Count of k-mers in common with the sub-family / HOG.
Median length of the sequences that are present in the predicted HOG. In the case of family-only placement, this is instead reported at the root-HOG level.
The proportion of the query sequence overlapping with k-mers of reference root-HOGs. This may be helpful to reject partially homologous matches that are problematic in some applications.
Optionally printed (see --include_extant_genes
). Comma-seperated list of extant gene IDs of predicted HOG. The OMA browser can be used to find out more information. In particular, using the REST API, or via the Python API Client.
This is currently reliant on the OMA browser's database file and the species phylogeny of HOGs. Building using OrthoXML files available shortly.
Required arguments: --db
, --oma_path
usage: omamer mkdb [-h] --db DB [--nthreads NTHREADS] [--min_fam_size MIN_FAM_SIZE] [--min_fam_completeness MIN_FAM_COMPLETENESS] [--logic {AND,OR}]
[--root_taxon ROOT_TAXON] [--hidden_taxa HIDDEN_TAXA] [--species SPECIES] [--reduced_alphabet] [--k K] --oma_path OMA_PATH
[--log_level {debug,info,warning}]
Flag | Default | Description |
---|---|---|
--db |
Path to new database (including filename) | |
--nthreads |
1 | Number of threads to use |
--min_fam_size |
6 | Only root-HOGs with a protein count passing this threshold are used. |
--min_fam_completeness |
0.5 | Only root-HOGs passing this threshold are used. The completeness of a HOG is defined as the number of observed species divided by the expected number of species at the HOG taxonomic level. |
--logic |
OR | Logic used between the two above arguments to filter root-HOGs. Options are "AND" or "OR". |
--root_taxon |
root of speciestree.nwk | HOGs defined at, or descending from, this taxon are uses as root-HOGs. |
--hidden_taxa |
The proteins from these taxa are removed before the database computation. Usage: a file containing taxa on seperate lines (scientific name). These must match EXACTLY with the node name in the tree given. | |
--reduced_alphabet |
Use reduced alphabet from Linclust paper | |
--k |
6 | k-mer length |
--oma_path |
Path to a directory with both OmaServer.h5 and speciestree.nwk | |
--log_level |
info | Logging level |
- fixes issue #34 (numpy2 incompatibility)
- experimental support to build omamer databases from orthoxml/fasta files
- update github action to latest versions
- fixes issue #30
- update github action to latest versions
- changed method for hiding taxa in build process. Now takes a file containing taxa to hide on separate lines.
- checks and improved feedback for root taxon and requested taxa to hide.
- root taxon set by default to the root level in speciestree.nwk (previously hard-coded to default to LUCA)
- remove dependency for filehash library
- return better error message if build dependencies are not met, but trying to building an omamer database
- minor fixes
- Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM.
- Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many k-mers as we have observed by chance, for each family that has a match to a given query.
- UX improvements - more feedback during interactive search runs, whilst maintaining small log files.
- Fixes an issue when storing the pre-conputed statistics
- Improved loading time for standard search by pre-computing statistics
- Adding new command line option "info" to show the metadata of the dataset used to build the omamer database.
- Automated deployment to PyPI
- Removed PyHAM dependency
- Added
--min_fam_completeness
,--logic
,--score
and--reference_taxon
options - New output format
- Debugging
- Debugging
- Added hidden_taxa and threshold arguments
- Initial release
OMAmer is a free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
OMAmer is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with OMAmer. If not, see http://www.gnu.org/licenses/.
Victor Rossier, Alex Warwick Vesztrocy, Marc Robinson-Rechavi, Christophe Dessimoz, OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches, Bioinformatics, 2021;, btab219, https://doi.org/10.1093/bioinformatics/btab219