-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
queries on ontology-backed fields, add "isa" operator? #234
Comments
@bussec, made this email response I agree, that's an important point. My (naive) expectation was that the current behavior would already be "isa"-ish. IIRC IEDB solved this problem for the species taxonomy by storing all nodes (i.e. between the ontology root and the annotated term) in a separate field, so that they can search across it... but I assume that this is what you consider "ugly" ;-) |
Bjoern Peters had a followup The standard approach we use is to store transitive closure (https://en.wikipedia.org/wiki/Transitive_closure) of the taxonomy; essentially a table that has two columns storing 'parent id, child id' pairs. Queries for all children of a given parent can be integrated into standard SQL then and are lightning fast if the table is properly indexed even for very large taxonomies. |
This came up in our local group discussion, as we are embarking on an implementation around ontologies, both at the user interface/gateway level as well as at the service query level. So for confirmation, we have the following in ADC API v1:
We will work on the definition of an "isa" operator in ADC API v2 for taxonomy/ontology based terms which would capture the more powerful concept of finding all onotlogy entities that lie beneath the queried ontology node. Is that correct? |
@schristley @bussec with the recent ontology sprint finishing, wondering if we can renew this discussion? My previous comment above makes sense to me, should we try and move this forward? I think this is more of a definition/semantics thing as the spec doesn't need to change. It is the expected result of the query that needs to be defined. And then of course our repositories need to implement it 8-) |
Hmm, I thought there already was an "isa" operator, but there is not. So we do need to add it. |
@bcorrie Looking at this again I think one thing that we need to clarify is which relation in an ontology we would follow. As far as I can see OBO uses subClassOf ( |
Good point... Not sure how variable that is and how many ontologies have complex relationships. Has anyone checked??? I have kind of assumed that most of our Ontologies (or the way we thing of them) are considered Trees and therefore a subClassOf relationship probably makes sense (or would suffice). Not sure if we need to specify a relationship (can we assume) and if we need to how do we do it??? |
That's the correct relation if you are talking terms that are Class'es, and that's true for all the biomedical ontologies that I'm familiar with. Not all biomedical ontologies are Trees though, but they are DAGs (i.e. multiple class inheritance). |
@schristley we are targeting v2.0 release for AIRR meeting in June. This issue seems to gel well with AIRR Knowledge Commons efforts, but likely this won't hit that deadline. I am suggesting we move this out of the ADC v2.0 Milestone to ADC v2.1 (https://github.com/airr-community/airr-standards/milestone/9). Any objections? |
@bcorrie I guess that's going with the idea that the API version is updated even though there are no API changes, just the schema is changing. I still have mixed feelings about that, but I see pros/cons to both sides. Anyways, regarding the specific question, no objections. Also the PR #550 mentions adding the |
@schristley if we leave this in v2.0, it boils down to both VDJServer and iReceptor Turnkey implementing it. I am leaning towards leaving it in v2.0, as anything beyond v2.0 is very nebulous. If v2.0 is released in June at the AIRR Meeting, then we would want to implement have this implemented some time shortly after that in the repositories. I think it would be good to have this in v2.0 and implemented in the ADC in some short time frame after that. Thoughts? I think this is doable for iReceptor Turnkey. |
@bcorrie That's reasonable to me. I think it's doable in the time frame under the assumption this only involves updating the data in the /repertoire end point. I've already started discussions with James Overton as part of AKC work about gathering ontologies (what they call Source of Terminologies - SOT) to support operations such as this. Our though was to start with the airr-standards ontologies. There are different techniques that can be done to handle the query depending upon the database technology. |
Moving this to a non v2.0 tag, as based on discussions around AKC I think we want to do this properly rather than rush for v2.0 |
I brought this up in email, but creating an issue so it doesn't get lost.
This issue came to me the other day and I don't think we've talked about it yet, or at least not in detail. We might want to add this to our agenda to discuss in a future WG call.
Let's take cell_subset as an example. If a user performs a query and indicates "B cell" (CL_0000236), they can mean two possible things. One, they want data which is exactly cell_subset == "B cell". This is the current behavior of the API. Alternatively, they may want data that is B cell or any of its subtypes. Therefore, if a repertoire has cell_subset as "naive B cell" (CL_0000788), this repertoire will not be included in the former query but will be included in the latter query.
There currently isn't an easy way for a user to specific that latter query. Right now they would need to gather a list of all the subtypes of "B cell" and construct a large OR expression to capture all of them. That seems pretty onerous and error prone for the user.
One suggestion is to add an "isa" operator which conveys this meaning. So the former query, cell_subset == "B cell", indicates exact match, while the latter query, cell_subset isa "B cell", indicates B cell or any of its subtypes.
Defining that additional query operator is easy. The challenge is how do repositories implement this? If they are using a RDF triple store (which nobody is now) then it's easy, but your typical SQL or NoSQL databases have a harder time. They would have to do a similar thing of constructing a large OR expression to capture them (or other ideas just as ugly). This also means the repository needs to know about the ontology so that it can gather together the appropriate terms.
The text was updated successfully, but these errors were encountered: