Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change germline_set_ref to be a CURIE #770

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bcorrie
Copy link
Contributor

@bcorrie bcorrie commented Mar 1, 2024

Closes #553

@bcorrie
Copy link
Contributor Author

bcorrie commented Mar 1, 2024

@williamdlees I made an attempt to make germline_set_ref a CURIE. It passes checks, have a look at what I did and see if it seems OK.

@bcorrie
Copy link
Contributor Author

bcorrie commented Mar 1, 2024

The main issue I see is that the python test data has IDs from IMGT for germline_set_ref of the form:

"germline_set_ref": "IMGT:Homo sapiens:2022.1.31"

Would this be referring to one of these: https://www.imgt.org/download/V-QUEST/archives/

@bcorrie
Copy link
Contributor Author

bcorrie commented Mar 1, 2024

Also, the CURIE PREFIX I used is OGRDB_GERMLINESET. CURIEs are supposed to make things short, and this prefix is long but if you are going to have other things that you want to look up (Alleles?) using a CURIE, we need something descriptive and different.

                "allele_description_id": "OGRDB:A00301",
                "allele_description_ref": "OGRDB:Mouse_IGH:IGHV-2DBF",

@williamdlees
Copy link
Contributor

williamdlees commented Mar 1, 2024

Thanks Brian. This looks good to me and I think it is the way to go. If other providers of germline sets (I can only think of Mixcr and IMGT for the time being) wish to offer sets in MiAIRR format, we can encourage them to provide a standardised URL and create a CURIE. But until they do, it isn't really an issue. And we do have other fields for name, date and version that can fill in if necessary.

@schristley schristley self-requested a review March 2, 2024 18:20
@schristley
Copy link
Member

Hi @williamdlees , my plan for our AKC 1-on-1 next week was to review this with you in the context of the OGRDB API because right now what's in the pull request does not resolve to a URL that returns the germline set. We will need to change the PR and/or the API so that it does. It won't be hard to do.

@schristley
Copy link
Member

Also, the CURIE PREFIX I used is OGRDB_GERMLINESET. CURIEs are supposed to make things short, and this prefix is long but if you are going to have other things that you want to look up (Alleles?) using a CURIE, we need something descriptive and different.

@bcorrie My suggestion is to not make separate CURIEs for each data type. If you remember how James described it, there is a global part and a local part. OGRDB: is a sufficient prefix for everything the global OGRDB service provides. While the local part, @williamdlees will have control over. That design allows OGRDB to provide additional services down the road without requiring new CURIE prefixes, instead the local part can be enhanced.

@williamdlees
Copy link
Contributor

williamdlees commented Mar 3, 2024 via email

@schristley
Copy link
Member

I think the overall motivation of this change is to provide a URI that will download the set, rather than this standardized form.

That's correct, though there is a bit more to this. We want germline sets to conform to the FAIR principles, and more importantly for the OGRDB service (and the data it provides like germline sets) to conform to the FAIR principles.

As I understand a CURIE, it would provide a shorthand that avoids the need to write out the URL in full in the rest of the schema definition.

Exactly, it is primarily shorthand for the full URL, but it is important not to think of it purely as "URL to download the germline set". It is more than that. It is a permanent ID so that if you look at two studies, you can simply compare the IDs to know that they are using the exact same germline set, or not. It is also "F"indable and "R"eusable because I can download the exact same germline set based upon its ID. Some people tend to interpret the "R" as reproducibility.

The thing that worries me is that it bakes OGRDB into the definition. We don’t intend OGRDB to be the only source of germline sets. It would be great if others, for example IMGT and MiLabs, started to support the MiAIRR standard for germline sets. However, if they do this, would we need to add CURIES to the schema definition for their systems?

Yes, if they want to conform to the FAIR principles, which they should. It is easy for us to add CURIEs for their systems. The "I"nteroperability of FAIR is where the standard schemas and formats become important. But interoperability doesn't imply that everybody has to agree and conform to the same schema, there are alternative ways to be interoperable.

To do so would be much more effort on our part than we save by adding a URL shortcut. And I fear it might also discourage them, and send a general message that, for the MiAIRR standard, OGRDB is the single repository for reference sets. Is there a way to use a CURIE optionally in a URI? If so, this might be a way forward, although I don’t see much benefit in the CURIE here myself. The alternative would be to change the field description to ‘URL of the germline set in the repository from which it can be downloaded’ and use the URL above as an example.

I'm not sure why you think this would be more effort on our part. It's really effort on their part. The push toward FAIR-ness is pretty much unstoppable at this point, and if they want to be used and relevant then they need to conform to the FAIR principles. Note that this is not a blanket statement, thou must use the AIRR standard. And while "standard" can be used as a bludgeon to keep people in line, it can also just mean that a group agrees to the same set of general principles. While the field does not (technically) require that germline sets be in the AIRR standard format, because this is the AIRR Community and these are our AIRR Standards, there is that implication.

@williamdlees
Copy link
Contributor

williamdlees commented Mar 3, 2024 via email

@williamdlees
Copy link
Contributor

williamdlees commented Mar 3, 2024 via email

@schristley
Copy link
Member

Do we have to use a CURIE to comply with FAIR?

Nope, it is just a useful shorthand. In fact, IEDB isn't using CURIEs, you can see in their export table that they provide IRIs.

There are some small advantages to CURIEs, 1) it's shorter and thus uses less space in a database, likely not relevant for germline sets but imagine rearrangements where you could be talking GBs of extra data (I don't know if IEDB stores the complete IRI in the database, it may actually just store IDs and construct the IRI when generating the export, which makes this point mute), and 2) if there's every a crazy reason for https://ogrdb.airr-community.org to be moved, changing the CURIE pointer is a lot easier than rewriting all of the IRIs. But these really are small points. And on the flip side, there's an advantage to just having the IRI because you don't need to do the CURIE resolution.

@javh
Copy link
Contributor

javh commented Aug 12, 2024

Bump. Did we reach a consensus on this?

@javh javh added this to the AIRR 2.0 milestone Aug 12, 2024
@williamdlees
Copy link
Contributor

williamdlees commented Aug 12, 2024 via email

@schristley
Copy link
Member

schristley commented Nov 24, 2024

Hi @williamdlees , Now with the OGRDB API defined and live, I think we can resolve this properly. The path /germline/set/{germline_set_id}/{release_version} looks like the most specific way to reference a germline set. That's compatible with using a CURIE, e.g. OGRDB:germline/set/9606.IGH_VDJ/9 where OGRDB is defined as https://ogrdb.airr-community.org/api_v2/. What do you think?

@williamdlees
Copy link
Contributor

williamdlees commented Nov 24, 2024 via email

@schristley
Copy link
Member

Hi @bcorrie , in this current PR, there are two fields germline_database and germline_set_ref in data processing where germline_database has a text description of the germline set, while germline_set_ref is a CURIE that is resolvable (currently just OGRDB supports). Would the idea be to keep both of these fields in data processing? If yes, then I think that resolves William's issue above of having a way to describe the source when a CURIE isn't available.

@schristley
Copy link
Member

@bcorrie Should we only use the CURIE if it is resolvable and provides an AIRR-compatible GermlineSet JSON response? Or should we also use CURIE for IMGT as you suggest above to point to a web file? The former gives us the possibility of doing automated stuff (in the AKC for example) because we are guaranteed of the data format, in the latter case we would have to know in which cases we can do that (OGRDB) and which cases we cannot (IMGT, and maybe others)

@schristley
Copy link
Member

Or should we also use CURIE for IMGT as you suggest above to point to a web file?

Feedback from the AKC Schema KG is that we should also use "CURIE" for IMGT and other germline databases to have consistent reference IDs for them. "CURIE" in quotes because it was mentioned that CURIE has a quite specific definition and AIRR Standards use isn't always consistent with that definition, we might want to call it something else just to avoid confusion. Regardless, an ID or URI would be preferable to the open text field of germline_database

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Use CURIEs to link Germline to Repertoire/Rearrangement
4 participants