Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate function emits lots of errors and is slow/verbose #803

Closed
justaddcoffee opened this issue Sep 22, 2024 · 3 comments
Closed

Annotate function emits lots of errors and is slow/verbose #803

justaddcoffee opened this issue Sep 22, 2024 · 3 comments

Comments

@justaddcoffee
Copy link
Collaborator

For example, this command takes 1 minute to annotate one string:

(oaklib-py3.11) ~/PythonProject/ontology-access-kit main $ time runoak -i sqlite:obo:mondo annotate "cystic fibrosis" >& annotate.stdout

real    1m4.518s
user    0m59.803s
sys     0m2.804s

and generates 580 errors like this (all to do with OMIMPS's):

ERROR:root:Skipping statements(subject=MONDO:0000005,predicate=skos:exactMatch,object=<https://omim.org/phenotypicSeries/PS203655>,value=None,datatype=None,language=None,); ValueError: <https://omim.org/phenotypicSeries/PS203655> is not a valid URI or CURIE
(oaklib-py3.11) ~/PythonProject/ontology-access-kit main $ grep -c ERROR annotate.stdout 
580

and 144 warnings like this (all to do with NCBI genes):

(oaklib-py3.11) ~/PythonProject/ontology-access-kit main $ grep -c WARNING annotate.stdout 
144
WARNING:root:Skipping <http://identifiers.org/ncbigene/554324> as it is not a valid CURIE

Please close if this is expected - just thought this might be worth looking into

I'm on commit 8746e90, Python 3.11, MacOS 12.5

@justaddcoffee
Copy link
Collaborator Author

Adding: when I use the -W flag this command finishes in around 2 seconds, without any errors

@cmungall
Copy link
Collaborator

The errors aren't coming from the annotate command. OAK reports these when there are URIs without prefixes in the ontology. This was fixed in mondo a while ago, you likely have a very old mondo in your ~/.data.

Note that the cache management is not something most users will need to care about after the next release thanks to @gouttegd's edits in #799.

Yes, using -W to match the whole string is fast because it doesn't need an index (and this is what you want for your use case here).

If you do need to do actual NER, and you aren't using a backend that supports NER natively (e. ontoportal), then you should create and reuse an index:

runoak annotate -h:

Usage: runoak annotate [OPTIONS] [WORDS]...

  Annotate a piece of text using a Named Entity Recognition annotation.

  Some endpoints such as BioPortal have built-in support for annotation; in
  these cases the endpoint functionality is used:

  Example:

      runoak -i bioportal: annotate "enlarged nucleus in T-cells from
      peripheral blood"

  For other endpoints, the built-in OAK annotator is used. This currently uses
  a basic algorithm based on lexical matching.

  Example:

      runoak -i sqlite:obo:cl annotate "enlarged nucleus in T-cells from
      peripheral blood"

  Using the builtin annotator can be slow, as the lexical index is re-built
  every time. To preserve this, use the ``--lexical-index-file`` (``-L``)
  option to specify a file to save. On subsequent iterations the file is
  reused.

The creation of the index does seem to take a lot longer than it used to, will investigate. But with -L this is a one time cost, unless the ontology is updated. However, it looks like for Mondo the index is so large that just loading it takes a long time. We can look to saving as alternate formats to speed this up.

@justaddcoffee
Copy link
Collaborator Author

Thanks @cmungall

I didn't realize you could create and reuse indexes

Closing, since it looks as though everything is basically working as expected here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants