Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignoring duplicate exact synonyms that are acronyms in robot report #1175

Closed
allenbaron opened this issue Jan 8, 2024 · 8 comments · Fixed by #1179
Closed

Ignoring duplicate exact synonyms that are acronyms in robot report #1175

allenbaron opened this issue Jan 8, 2024 · 8 comments · Fixed by #1179

Comments

@allenbaron
Copy link
Contributor

Given information-artifact-ontology/ontology-metadata#135, is the plan now for robot report to exclude from warnings duplicate exact synonyms that are annotated as acronyms? Overlapping acronyms are fairly common.

This is a follow-up to the slightly tangential comment made in #748 (comment) by dosumis.

Slightly tangential, but we really need a way to mark synonyms as allowable duplicate with labels (maybe using synonym type?). We have many cases in FBbt where the same acronym is used in the literature for multiple distinct anatomical structures (pretty common in anatomy). We add these are synonyms with a reference to back them up. This is frequently useful to anyone looking to find a term based on what they find in the literature - curators and users. I guess the rule originally comes from GO where this is less of an issue with names for processes/MFs?

@matentzn
Copy link
Contributor

@allenbaron I will help pushing this through. Do you know SPARQL? Could you try to redesign this query to achieve this goal: https://github.com/ontodev/robot/blob/master/robot-core/src/main/resources/report_queries/duplicate_exact_synonym.rq

If you have trouble with this you can ping @anitacaron (on slack also) who may have a soft spot for someone with QC related SPARQL problems :)

@matentzn
Copy link
Contributor

The one caveat I want to say: if we do this, we have to use FILTER NOT EXISTS which is extremely slow - keep that in mind when you write this, and try it on something like DO, HPO and UBERON to be sure that it wont be too inefficient.

@anitacaron
Copy link

Isn't it another exception for the label-synonym-polysemy-violation?

There's already an exception for abbreviation (OMO:0003000)

@allenbaron
Copy link
Contributor Author

Yes, acronym (OMO:0003012) is a new synonym type that would also be an exception.

Honestly, the query at UBERON linked by @anitacaron (with minor modification) is probably the best bet for updating the duplicate_exact_synonym.rq query in ROBOT. Using a subquery only slows things down a bit compared to the current query but it's definitely simpler and probably faster for managing exceptions. I think the only changes to it would be:

  1. Remove rdfs:label from VALUES statement.
  2. Add a VALUES statement for the exceptions (abbreviation & now acronym).
  3. Possibly drop the use of UCASE.
    • The current duplicate_exact_synonym.rq query will not report duplicates synonyms with variation in case or language tag (Duplicate label/synonym checks need to normalize literal type #748). Were those intentional design choices? Just noting that the UBERON query also will not report duplicate synonyms if they differ in language tag.

I know @jamesaoverton is particularly concerned with ROBOT's backward compatibility, which I appreciate. Would these changes be a concern in that regard?

@allenbaron
Copy link
Contributor Author

allenbaron commented Jan 12, 2024

I decided to look more closely at execution time differences using doid-edit.owl and uberon.owl (because I had it on hand, not the edit file).

Just switching to the subquery approach without adding in the exclusion of synonym types or using UCASE takes about 1.07-1.43 times longer (DO: current = 6.13s, subquery = 6.57s; UBERON: current = 17.8s, subquery = 25.4s). Adding in the exclusion and UCASE slows things down further by ~ 2s for either DO & UBERON.

@matentzn
Copy link
Contributor

@allenbaron thanks for the analysis!

Possibly drop the use of UCASE

I personally think we should introduce this now - I cannot imagine a single case where the duplicate synonym check should be case sensitive.. Of, course, this needs to be well documented!

variation in case or language tag

This is much more complicated, as you would want to

  1. reject duplicates within the same language and
  2. permit duplicates across languages.

Not sure how this should be solved!

Do you want to make a PR and see how it goes?

@allenbaron
Copy link
Contributor Author

As an alternative to creating an exclusion for abbreviations and acronyms, could we introduce a new synonym predicate, something like skos:closeMatch for synonyms oboInOwl:hasCloseSynonym?

I guess a new synonym predicate probably has more cons than pros. If we were really going to do something like this, we probably should've just made abbreviations and acronyms their own synonym predicates instead of making them synonym types.

I'll work to open a PR for updating the SPARQL query soon.

@matentzn
Copy link
Contributor

could we introduce a new synonym predicate, something like skos:closeMatch for synonyms oboInOwl:hasCloseSynonym?

I don't think we should use that system for acronyms, which are "exact" synonyms, but now that you say this - it seems super weird to me that there are no close synonyms! I never noticed that! Wow!

I'll work to open a PR for updating the SPARQL query soon.

Thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants