Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queries on ontology-backed fields, add "isa" operator? #234

Open
schristley opened this issue Aug 27, 2019 · 13 comments
Open

queries on ontology-backed fields, add "isa" operator? #234

schristley opened this issue Aug 27, 2019 · 13 comments
Labels
ADC API V2 AIRR Data Commons API V2
Milestone

Comments

@schristley
Copy link
Member

I brought this up in email, but creating an issue so it doesn't get lost.

This issue came to me the other day and I don't think we've talked about it yet, or at least not in detail. We might want to add this to our agenda to discuss in a future WG call.

Let's take cell_subset as an example. If a user performs a query and indicates "B cell" (CL_0000236), they can mean two possible things. One, they want data which is exactly cell_subset == "B cell". This is the current behavior of the API. Alternatively, they may want data that is B cell or any of its subtypes. Therefore, if a repertoire has cell_subset as "naive B cell" (CL_0000788), this repertoire will not be included in the former query but will be included in the latter query.

There currently isn't an easy way for a user to specific that latter query. Right now they would need to gather a list of all the subtypes of "B cell" and construct a large OR expression to capture all of them. That seems pretty onerous and error prone for the user.

One suggestion is to add an "isa" operator which conveys this meaning. So the former query, cell_subset == "B cell", indicates exact match, while the latter query, cell_subset isa "B cell", indicates B cell or any of its subtypes.

Defining that additional query operator is easy. The challenge is how do repositories implement this? If they are using a RDF triple store (which nobody is now) then it's easy, but your typical SQL or NoSQL databases have a harder time. They would have to do a similar thing of constructing a large OR expression to capture them (or other ideas just as ugly). This also means the repository needs to know about the ontology so that it can gather together the appropriate terms.

@schristley schristley added the ADC API V1 AIRR Data Commons API V1 label Aug 27, 2019
@schristley
Copy link
Member Author

schristley commented Aug 27, 2019

@bussec, made this email response

I agree, that's an important point. My (naive) expectation was that the current behavior would already be "isa"-ish. IIRC IEDB solved this problem for the species taxonomy by storing all nodes (i.e. between the ontology root and the annotated term) in a separate field, so that they can search across it... but I assume that this is what you consider "ugly" ;-)

@schristley
Copy link
Member Author

Bjoern Peters had a followup

The standard approach we use is to store transitive closure (https://en.wikipedia.org/wiki/Transitive_closure) of the taxonomy; essentially a table that has two columns storing 'parent id, child id' pairs. Queries for all children of a given parent can be integrated into standard SQL then and are lightning fast if the table is properly indexed even for very large taxonomies.

@schristley schristley added ADC API V2 AIRR Data Commons API V2 and removed ADC API V1 AIRR Data Commons API V1 labels Oct 10, 2019
@bcorrie
Copy link
Contributor

bcorrie commented Dec 6, 2019

This came up in our local group discussion, as we are embarking on an implementation around ontologies, both at the user interface/gateway level as well as at the service query level.

So for confirmation, we have the following in ADC API v1:

  • We have {"op":"=", "content":{"field":"sample.cell_subset.value", "value":"B cell"}} will search for an exact match on the "B cell" string in the value of the ontology field cell_subset.
  • We have {"op":"=", "content":{"field":"sample.cell_subset.id", "value":"CL_0000236"}} will search for an exact match on the "CL_0000236" string in the id of the ontology field cell_subset.

We will work on the definition of an "isa" operator in ADC API v2 for taxonomy/ontology based terms which would capture the more powerful concept of finding all onotlogy entities that lie beneath the queried ontology node.

Is that correct?

@bcorrie
Copy link
Contributor

bcorrie commented May 20, 2021

@schristley @bussec with the recent ontology sprint finishing, wondering if we can renew this discussion?

My previous comment above makes sense to me, should we try and move this forward? I think this is more of a definition/semantics thing as the spec doesn't need to change. It is the expected result of the query that needs to be defined.

And then of course our repositories need to implement it 8-)

@bcorrie
Copy link
Contributor

bcorrie commented May 20, 2021

Hmm, I thought there already was an "isa" operator, but there is not. So we do need to add it.

@bussec
Copy link
Member

bussec commented Jun 3, 2021

@bcorrie Looking at this again I think one thing that we need to clarify is which relation in an ontology we would follow. As far as I can see OBO uses subClassOf (http://www.w3.org/2000/01/rdf-schema#subClassOf).

@bcorrie
Copy link
Contributor

bcorrie commented Jun 3, 2021

Good point... Not sure how variable that is and how many ontologies have complex relationships. Has anyone checked??? I have kind of assumed that most of our Ontologies (or the way we thing of them) are considered Trees and therefore a subClassOf relationship probably makes sense (or would suffice). Not sure if we need to specify a relationship (can we assume) and if we need to how do we do it???

@schristley
Copy link
Member Author

schristley commented Jun 4, 2021

@bcorrie Looking at this again I think one thing that we need to clarify is which relation in an ontology we would follow. As far as I can see OBO uses subClassOf (http://www.w3.org/2000/01/rdf-schema#subClassOf).

That's the correct relation if you are talking terms that are Class'es, and that's true for all the biomedical ontologies that I'm familiar with. Not all biomedical ontologies are Trees though, but they are DAGs (i.e. multiple class inheritance).

@schristley schristley added this to the ADC V2 milestone Jan 17, 2022
@bcorrie
Copy link
Contributor

bcorrie commented Nov 13, 2023

@schristley we are targeting v2.0 release for AIRR meeting in June. This issue seems to gel well with AIRR Knowledge Commons efforts, but likely this won't hit that deadline. I am suggesting we move this out of the ADC v2.0 Milestone to ADC v2.1 (https://github.com/airr-community/airr-standards/milestone/9). Any objections?

@schristley
Copy link
Member Author

@schristley we are targeting v2.0 release for AIRR meeting in June. This issue seems to gel well with AIRR Knowledge Commons efforts, but likely this won't hit that deadline. I am suggesting we move this out of the ADC v2.0 Milestone to ADC v2.1 (https://github.com/airr-community/airr-standards/milestone/9). Any objections?

@bcorrie I guess that's going with the idea that the API version is updated even though there are no API changes, just the schema is changing. I still have mixed feelings about that, but I see pros/cons to both sides. Anyways, regarding the specific question, no objections. Also the PR #550 mentions adding the distinct operator too, not sure if that's a separate issue, if not probably should create one as I expect #550 is too old and will be deleted at some point?

@bcorrie
Copy link
Contributor

bcorrie commented Dec 5, 2023

@schristley if we leave this in v2.0, it boils down to both VDJServer and iReceptor Turnkey implementing it. I am leaning towards leaving it in v2.0, as anything beyond v2.0 is very nebulous. If v2.0 is released in June at the AIRR Meeting, then we would want to implement have this implemented some time shortly after that in the repositories. I think it would be good to have this in v2.0 and implemented in the ADC in some short time frame after that. Thoughts? I think this is doable for iReceptor Turnkey.

@schristley
Copy link
Member Author

@bcorrie That's reasonable to me. I think it's doable in the time frame under the assumption this only involves updating the data in the /repertoire end point. I've already started discussions with James Overton as part of AKC work about gathering ontologies (what they call Source of Terminologies - SOT) to support operations such as this. Our though was to start with the airr-standards ontologies. There are different techniques that can be done to handle the query depending upon the database technology.

@bcorrie
Copy link
Contributor

bcorrie commented Feb 14, 2024

Moving this to a non v2.0 tag, as based on discussions around AKC I think we want to do this properly rather than rush for v2.0

@bcorrie bcorrie modified the milestones: ADC 2.0, ADC 2.1 Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADC API V2 AIRR Data Commons API V2
Projects
None yet
Development

No branches or pull requests

3 participants