Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can cath-resolve-hits be used to merge InterProScan results? #70

Open
xunsheng opened this issue Apr 9, 2019 · 2 comments
Open

Can cath-resolve-hits be used to merge InterProScan results? #70

xunsheng opened this issue Apr 9, 2019 · 2 comments

Comments

@xunsheng
Copy link

xunsheng commented Apr 9, 2019

Thanks for providing this awesome tool! As we know InterProScan results contain evalues for multiple domain identification programs, can we use cath-resolve-hits to merge based on their evalues? Thanks.

@tonyelewis
Copy link
Contributor

Thanks for using CRH and for getting in touch. We're glad to hear that you're pleased with it.

At present, CRH expects input data in either HMMER or "raw" format with either scores or evalues ( https://cath-tools.readthedocs.io/en/latest/tools/cath-resolve-hits/#getting-started ). If you want to combine scores from different programs, it's probably best to convert your data into the raw format.

In principle, there should be no problems with such data containing hits from different sources (such as different programs). Some possible issues…

  • In practice, the evalues from different input sources may not be entirely consistent or you may have reasons to prefer one source over another. In this case, it's probably best to directly manipulate the evalues you're passing to CRH to upweight/downweight the sources according to your needs.
  • It probably makes sense to do something like prepending the name of the source to the name of each match ID to allow you to identify each hit's source in the results.

Have I understood your query correctly? Does this address your query?

We've previously considered adding better support from an optional input field that allows the user to specify a source/category for each entry and then:

  • allow some way to specify up-weighting/down-weighting for each source
  • include each hit's source in the output

Would a potential feature like that map closely to what you want here?

@xunsheng
Copy link
Author

Thank you so much for the prompt response. Yes, your answer clears everything up.
The InterProScan integrated more than 10 HMM-based protein domain prediction programs, and the results contain the unique domain ID. It's easy to guess their source based on the domain ID. Sometimes use multiple sources is because not one database could cover all the conservative domains, and not all domains have an annotated name or description of their functions.

Yes, the potential feature will be awesome! CRH is the best tool so far I can find to perform domain reduction based on scores/evalues and overlaps, which is much better than a script to solve only overlaps. A further thought is the experts with the right background could look into the 14 tools integrated by InterProScan, and give suggested weighting based on their algorithms.
Thanks again for the nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants