Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about OMArk rules and omamer result #35

Open
CAShuangchao opened this issue Oct 19, 2024 · 2 comments
Open

question about OMArk rules and omamer result #35

CAShuangchao opened this issue Oct 19, 2024 · 2 comments

Comments

@CAShuangchao
Copy link

Hello, I would like to ask OMArk rules for determining gene conservation or consistency--what specific rules are used to determine these data in omamer results--have any detailed instructions? thanks.

@YanNevers
Copy link
Collaborator

Hello!

OMArk will automatically process the OMAmer results, while recovering some additional data from the OMAmer database. All details are described in OMArk's paper at https://www.nature.com/articles/s41587-024-02147-w .

However, here is a quick overview:

For completeness. OMark query the OMAmer database to extract all orthologous groups (HOGs) at the taxon of interest and select the ones present in more than 80% of species - conserved HOGs. If thoses are HOGs are also found in the OMAmer input; it reports it as present (as single or duplicate depending on the number of occurences in the file )

For the consistency part, it is slightly more complicated. Again, OMArk query the OMAmer database to obtain the list of all HOGs known to exist in the clade of interest. If the HOGs found by OMAmer for any protein it will be classified as phylogenetically Consistent (blue); the one with no match in the OMAmer file are classified as Unknown.
For the ones that are left, they are classified as Contamination if the placement corresponds to a contamination (that OMArk assess earlier in the process - see the paper) or Inconsistent otherwise.

For the structural consistency (fragment/partial mapping): OMArk uses the data provided in the OMAmer output directly.
A sequence is classified as fragments if the query length (qseqlen in omamer) is less than half as long as the median protein length in the HOG it was placed into (subfamilymedianseqlen in omamer)
A sequence is classified as Partial mapping if the kmer matches are detected over only part of the sequence. In the OMAmer output, the qseq _overlap parameter corresponds to the proportion of the sequence that is comprised between the first kmer in common with the HOG of interest, and the last. If this value is under 0.8 OMArk wil report the sequence as partial mapping.

I hope this answers your question.

Best wishes,
Yannis

@CAShuangchao
Copy link
Author

Thank you for your answer. OMArk is a great tool that has given me some inspiration in dealing with the logic of HOG attribution for target genes and dealing with some of the problems I encountered in my project.

Thanks,
huangchao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants