Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK Adapters vs. Invalid URI/CURIEs #684

Open
glass-ships opened this issue Dec 4, 2023 · 5 comments
Open

OAK Adapters vs. Invalid URI/CURIEs #684

glass-ships opened this issue Dec 4, 2023 · 5 comments

Comments

@glass-ships
Copy link
Contributor

(Possibly related: #499)

It seems that any/many adapter methods are throwing errors/warnings about invalid URI/CURIEs:

...
ERROR:root:Skipping statements(subject=MONDO:8000018,predicate=skos:exactMatch,object=<http://identifiers.org/mesh/D065635>,value=None,datatype=None,language=None,); ValueError: <http://identifiers.org/mesh/D065635> is not a valid URI or CURIE
ERROR:root:Skipping statements(subject=MONDO:8000019,predicate=skos:exactMatch,object=<http://identifiers.org/mesh/C567620>,value=None,datatype=None,language=None,); ValueError: <http://identifiers.org/mesh/C567620> is not a valid URI or CURIE
...
WARNING:root:Skipping <https://www.wikidata.org/prop/P2701> as it is not a valid CURIE
WARNING:root:Skipping <https://www.wikidata.org/prop/P274> as it is not a valid CURIE
WARNING:root:Skipping <https://www.wikidata.org/prop/P275> as it is not a valid CURIE
...

It seems possible the issue is with phenio.db itself, but @madanucd informs me this happens with the mondo adapter as well.

The adapter does seem to work regardless, but it is possible that the resulting database adapter is missing information about everything it skipped.

Unsure how to best debug/approach this, thoughts?

@matentzn
Copy link
Contributor

matentzn commented Dec 5, 2023

Mid term solution: https://github.com/INCATools/semantic-sql/pull/77/files (this means these specific IRIs wont show up as invalid after the next OBO OAK release).

Please bring this issue to @cmungall attention.

@cmungall
Copy link
Collaborator

See also #688

In general the provider of upstream sources should try and provide prefix mappings such that the number of uncontracted URIs is minimal

I realize that this is harder than it should be for sql sources, due to the way semsql prefixes work. I apologize for this, and there will be a fix in place when we move away from rdftab. See INCATools/semantic-sql#41.

Even with this fix in place, there will inevitably be cases where URIs cannot be contracted (again, the number now is artifically high). In these cases we need to decide on the correct behavior:

  1. throw error
  2. treat URIs as if they are CURIEs (potentially renaming CURIE type to IDENTIFIER)
  3. Quote URIs to discriminate them from CURIEs (note: this is currently the behavior of the semsql adapter, but it's not the behavior of the rdf adapters, this inconsistency is confusing for users)

I tend towards 3. I think this was @joeflack4's instincts with #688. However, I would do it this way:

  1. rename CURIE to IDENTIFIER, such that this is used universally in signatures. This will be a large diff, but it will not affect runtime
  2. Formally define this as IDENTIFIER = CURIE | QUOTED_URI. Again, this is typing information that does not affect runtime, but it helps make the intentions clear

We would also need a PR to change the default behavior of the rdf adapter

@joeflack4
Copy link
Contributor

On the typing front, I approve of the IDENTIFIER proposal.

@matentzn
Copy link
Contributor

For completeness, there is also this option: ignore triples with entities that cannot be contracted and print a warning during semsql.

For programmatic usage its always going to be a bit annoying to have to distinguish between CURIE and "something else" based on some syntactic convention but I guess the "quote" solution is alright. Its not much different tbh the the <> situation we have right now, but at least we don't have to use a.replace('<','') anymore.

Maybe not a decision for Chrismtas period

@joeflack4
Copy link
Contributor

joeflack4 commented Dec 29, 2023

RE this option:

ignore triples with entities that cannot be contracted and print a warning during semsql.

I'm OK with that sort of thing so long as it is parameterized. It would suck to be have such things be ignored and no way to circumvent that.

I guess the "quote" solution is alright. Its not much different tbh the the <>

I think when Chris says "quoted", he means "enclosed with brackets" as such. That's the only URI "quoting", so to speak, that I've seen coming back from semsql. We should probably say "enclosed with brackets" instead anyway to avoid confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants