Annotate function emits lots of errors and is slow/verbose #803

justaddcoffee · 2024-09-22T13:19:19Z

For example, this command takes 1 minute to annotate one string:

(oaklib-py3.11) ~/PythonProject/ontology-access-kit main $ time runoak -i sqlite:obo:mondo annotate "cystic fibrosis" >& annotate.stdout

real    1m4.518s
user    0m59.803s
sys     0m2.804s

and generates 580 errors like this (all to do with OMIMPS's):

ERROR:root:Skipping statements(subject=MONDO:0000005,predicate=skos:exactMatch,object=<https://omim.org/phenotypicSeries/PS203655>,value=None,datatype=None,language=None,); ValueError: <https://omim.org/phenotypicSeries/PS203655> is not a valid URI or CURIE

(oaklib-py3.11) ~/PythonProject/ontology-access-kit main $ grep -c ERROR annotate.stdout 
580

and 144 warnings like this (all to do with NCBI genes):

(oaklib-py3.11) ~/PythonProject/ontology-access-kit main $ grep -c WARNING annotate.stdout 
144

WARNING:root:Skipping <http://identifiers.org/ncbigene/554324> as it is not a valid CURIE

Please close if this is expected - just thought this might be worth looking into

I'm on commit 8746e90, Python 3.11, MacOS 12.5

The text was updated successfully, but these errors were encountered:

justaddcoffee · 2024-09-22T13:48:05Z

Adding: when I use the -W flag this command finishes in around 2 seconds, without any errors

cmungall · 2024-09-22T18:28:06Z

The errors aren't coming from the annotate command. OAK reports these when there are URIs without prefixes in the ontology. This was fixed in mondo a while ago, you likely have a very old mondo in your ~/.data.

Note that the cache management is not something most users will need to care about after the next release thanks to @gouttegd's edits in #799.

Yes, using -W to match the whole string is fast because it doesn't need an index (and this is what you want for your use case here).

If you do need to do actual NER, and you aren't using a backend that supports NER natively (e. ontoportal), then you should create and reuse an index:

runoak annotate -h:

Usage: runoak annotate [OPTIONS] [WORDS]...

  Annotate a piece of text using a Named Entity Recognition annotation.

  Some endpoints such as BioPortal have built-in support for annotation; in
  these cases the endpoint functionality is used:

  Example:

      runoak -i bioportal: annotate "enlarged nucleus in T-cells from
      peripheral blood"

  For other endpoints, the built-in OAK annotator is used. This currently uses
  a basic algorithm based on lexical matching.

  Example:

      runoak -i sqlite:obo:cl annotate "enlarged nucleus in T-cells from
      peripheral blood"

  Using the builtin annotator can be slow, as the lexical index is re-built
  every time. To preserve this, use the ``--lexical-index-file`` (``-L``)
  option to specify a file to save. On subsequent iterations the file is
  reused.

The creation of the index does seem to take a lot longer than it used to, will investigate. But with -L this is a one time cost, unless the ontology is updated. However, it looks like for Mondo the index is so large that just loading it takes a long time. We can look to saving as alternate formats to speed this up.

justaddcoffee · 2024-09-22T18:54:00Z

Thanks @cmungall

I didn't realize you could create and reuse indexes

Closing, since it looks as though everything is basically working as expected here

justaddcoffee closed this as completed Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotate function emits lots of errors and is slow/verbose #803

Annotate function emits lots of errors and is slow/verbose #803

justaddcoffee commented Sep 22, 2024

justaddcoffee commented Sep 22, 2024

cmungall commented Sep 22, 2024

justaddcoffee commented Sep 22, 2024

Annotate function emits lots of errors and is slow/verbose #803

Annotate function emits lots of errors and is slow/verbose #803

Comments

justaddcoffee commented Sep 22, 2024

justaddcoffee commented Sep 22, 2024

cmungall commented Sep 22, 2024

justaddcoffee commented Sep 22, 2024