Change germline_set_ref to be a CURIE #770

bcorrie · 2024-03-01T00:06:59Z

Closes #553

bcorrie · 2024-03-01T00:09:23Z

@williamdlees I made an attempt to make germline_set_ref a CURIE. It passes checks, have a look at what I did and see if it seems OK.

bcorrie · 2024-03-01T00:12:39Z

The main issue I see is that the python test data has IDs from IMGT for germline_set_ref of the form:

"germline_set_ref": "IMGT:Homo sapiens:2022.1.31"

Would this be referring to one of these: https://www.imgt.org/download/V-QUEST/archives/

bcorrie · 2024-03-01T00:22:05Z

Also, the CURIE PREFIX I used is OGRDB_GERMLINESET. CURIEs are supposed to make things short, and this prefix is long but if you are going to have other things that you want to look up (Alleles?) using a CURIE, we need something descriptive and different.

                "allele_description_id": "OGRDB:A00301",
                "allele_description_ref": "OGRDB:Mouse_IGH:IGHV-2DBF",

williamdlees · 2024-03-01T14:24:48Z

Thanks Brian. This looks good to me and I think it is the way to go. If other providers of germline sets (I can only think of Mixcr and IMGT for the time being) wish to offer sets in MiAIRR format, we can encourage them to provide a standardised URL and create a CURIE. But until they do, it isn't really an issue. And we do have other fields for name, date and version that can fill in if necessary.

schristley · 2024-03-02T18:27:26Z

Hi @williamdlees , my plan for our AKC 1-on-1 next week was to review this with you in the context of the OGRDB API because right now what's in the pull request does not resolve to a URL that returns the germline set. We will need to change the PR and/or the API so that it does. It won't be hard to do.

schristley · 2024-03-02T18:39:11Z

Also, the CURIE PREFIX I used is OGRDB_GERMLINESET. CURIEs are supposed to make things short, and this prefix is long but if you are going to have other things that you want to look up (Alleles?) using a CURIE, we need something descriptive and different.

@bcorrie My suggestion is to not make separate CURIEs for each data type. If you remember how James described it, there is a global part and a local part. OGRDB: is a sufficient prefix for everything the global OGRDB service provides. While the local part, @williamdlees will have control over. That design allows OGRDB to provide additional services down the road without requiring new CURIE prefixes, instead the local part can be enhanced.

williamdlees · 2024-03-03T09:30:23Z

I’ve been thinking about Scott’s helpful comments. Currently the field is described as ‘Unique identifier of the germline set and version, in standardized form (Repo:Label:Version)’. I think the overall motivation of this change is to provide a URI that will download the set, rather than this standardized form. That seems a reasonable thing to do, and doesn’t impact any user code out there, because the standardized form isn’t really usable by code today. An example of a URL from OGRDB, which works today, would be https://ogrdb.airr-community.org/api/germline/set/Human/IGH_VDJ/8/airr_ex. As I understand a CURIE, it would provide a shorthand that avoids the need to write out the URL in full in the rest of the schema definition. The thing that worries me is that it bakes OGRDB into the definition. We don’t intend OGRDB to be the only source of germline sets. It would be great if others, for example IMGT and MiLabs, started to support the MiAIRR standard for germline sets. However, if they do this, would we need to add CURIES to the schema definition for their systems? To do so would be much more effort on our part than we save by adding a URL shortcut. And I fear it might also discourage them, and send a general message that, for the MiAIRR standard, OGRDB is the single repository for reference sets. Is there a way to use a CURIE optionally in a URI? If so, this might be a way forward, although I don’t see much benefit in the CURIE here myself. The alternative would be to change the field description to ‘URL of the germline set in the repository from which it can be downloaded’ and use the URL above as an example. All the best William

schristley · 2024-03-03T21:09:42Z

I think the overall motivation of this change is to provide a URI that will download the set, rather than this standardized form.

That's correct, though there is a bit more to this. We want germline sets to conform to the FAIR principles, and more importantly for the OGRDB service (and the data it provides like germline sets) to conform to the FAIR principles.

As I understand a CURIE, it would provide a shorthand that avoids the need to write out the URL in full in the rest of the schema definition.

Exactly, it is primarily shorthand for the full URL, but it is important not to think of it purely as "URL to download the germline set". It is more than that. It is a permanent ID so that if you look at two studies, you can simply compare the IDs to know that they are using the exact same germline set, or not. It is also "F"indable and "R"eusable because I can download the exact same germline set based upon its ID. Some people tend to interpret the "R" as reproducibility.

The thing that worries me is that it bakes OGRDB into the definition. We don’t intend OGRDB to be the only source of germline sets. It would be great if others, for example IMGT and MiLabs, started to support the MiAIRR standard for germline sets. However, if they do this, would we need to add CURIES to the schema definition for their systems?

Yes, if they want to conform to the FAIR principles, which they should. It is easy for us to add CURIEs for their systems. The "I"nteroperability of FAIR is where the standard schemas and formats become important. But interoperability doesn't imply that everybody has to agree and conform to the same schema, there are alternative ways to be interoperable.

To do so would be much more effort on our part than we save by adding a URL shortcut. And I fear it might also discourage them, and send a general message that, for the MiAIRR standard, OGRDB is the single repository for reference sets. Is there a way to use a CURIE optionally in a URI? If so, this might be a way forward, although I don’t see much benefit in the CURIE here myself. The alternative would be to change the field description to ‘URL of the germline set in the repository from which it can be downloaded’ and use the URL above as an example.

I'm not sure why you think this would be more effort on our part. It's really effort on their part. The push toward FAIR-ness is pretty much unstoppable at this point, and if they want to be used and relevant then they need to conform to the FAIR principles. Note that this is not a blanket statement, thou must use the AIRR standard. And while "standard" can be used as a bludgeon to keep people in line, it can also just mean that a group agrees to the same set of general principles. While the field does not (technically) require that germline sets be in the AIRR standard format, because this is the AIRR Community and these are our AIRR Standards, there is that implication.

williamdlees · 2024-03-03T21:26:04Z

ThatOn 3 Mar 2024, at 21:10, Scott Christley ***@***.***> wrote: I think the overall motivation of this change is to provide a URI that will download the set, rather than this standardized form. That's correct, though there is a bit more to this. We want germline sets to conform to the FAIR principles, and more importantly for the OGRDB service (and the data it provides like germline sets) to conform to the FAIR principles. As I understand a CURIE, it would provide a shorthand that avoids the need to write out the URL in full in the rest of the schema definition. Exactly, it is primarily shorthand for the full URL, but it is important not to think of it purely as "URL to download the germline set". It is more than that. It is a permanent ID so that if you look at two studies, you can simply compare the IDs to know that they are using the exact same germline set, or not. It is also "F"indable and "R"eusable because I can download the exact same germline set based upon its ID. Some people tend to interpret the "R" as reproducibility. The thing that worries me is that it bakes OGRDB into the definition. We don’t intend OGRDB to be the only source of germline sets. It would be great if others, for example IMGT and MiLabs, started to support the MiAIRR standard for germline sets. However, if they do this, would we need to add CURIES to the schema definition for their systems? Yes, if they want to conform to the FAIR principles, which they should. It is easy for us to add CURIEs for their systems. The "I"nteroperability of FAIR is where the standard schemas and formats become important. But interoperability doesn't imply that everybody has to agree and conform to the same schema, there are alternative ways to be interoperable. To do so would be much more effort on our part than we save by adding a URL shortcut. And I fear it might also discourage them, and send a general message that, for the MiAIRR standard, OGRDB is the single repository for reference sets. Is there a way to use a CURIE optionally in a URI? If so, this might be a way forward, although I don’t see much benefit in the CURIE here myself. The alternative would be to change the field description to ‘URL of the germline set in the repository from which it can be downloaded’ and use the URL above as an example. I'm not sure why you think this would be more effort on our part. It's really effort on their part. The push toward FAIR-ness is pretty much unstoppable at this point, and if they want to be used and relevant then they need to conform to the FAIR principles. Note that this is not a blanket statement, thou must use the AIRR standard. And while "standard" can be used as a bludgeon to keep people in line, it can also just mean that a group agrees to the same set of general principles. While the field does not (technically) require that germline sets be in the AIRR standard format, because this is the AIRR Community and these are our AIRR Standards, there is that implication. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

williamdlees · 2024-03-03T21:33:13Z

Scott I take your point about the URI being a permanent identifier, but that does not to my mind have anything to do with CURIES. Do we have to use a CURIE to comply with FAIR? If not I would much rather not in this instance on the grounds that 1 we save no appreciable effort by using a CURIE here and 2 it does take effort to check out the schema, make a change to it, coordinate with other changes and all the other stuff we end up doing when making a change to the standard.Thanks for your helpWilliam

schristley · 2024-03-03T23:00:33Z

Do we have to use a CURIE to comply with FAIR?

Nope, it is just a useful shorthand. In fact, IEDB isn't using CURIEs, you can see in their export table that they provide IRIs.

There are some small advantages to CURIEs, 1) it's shorter and thus uses less space in a database, likely not relevant for germline sets but imagine rearrangements where you could be talking GBs of extra data (I don't know if IEDB stores the complete IRI in the database, it may actually just store IDs and construct the IRI when generating the export, which makes this point mute), and 2) if there's every a crazy reason for https://ogrdb.airr-community.org to be moved, changing the CURIE pointer is a lot easier than rewriting all of the IRIs. But these really are small points. And on the flip side, there's an advantage to just having the IRI because you don't need to do the CURIE resolution.

javh · 2024-08-12T18:29:42Z

Bump. Did we reach a consensus on this?

williamdlees · 2024-08-12T20:11:27Z

I don't think so. I am all for defining a specific syntax for the germline set reference. From my point of view, there would be no problem making it a CURIE if we're ok with the right-hand-side being a URL, which as far as I can see is permitted by the CURIE syntax definition. You could say 'what's the point'? but I don't think the specific case for making the reference a CURIE has been fully articulated. I think we've probably gone through the arguments for and against other options, so I'll hold off repeating them! All the best William

…

------ Original Message ------ From "Jason Vander Heiden" ***@***.***> To "airr-community/airr-standards" ***@***.***> Cc "William Lees" ***@***.***>; "Mention" ***@***.***> Date 12/08/2024 19:30:05 Subject Re: [airr-community/airr-standards] Change germline_set_ref to be a CURIE (PR #770)

Bump. Did we reach a consensus on this? — Reply to this email directly, view it on GitHub <#770 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACXBI7L3RDXY6TBRQAXNHXLZRD5K3AVCNFSM6AAAAABEA3HMFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBUGY3DANRQG4>. You are receiving this because you were mentioned.Message ID: ***@***.***>

schristley · 2024-11-24T08:01:44Z

Hi @williamdlees , Now with the OGRDB API defined and live, I think we can resolve this properly. The path /germline/set/{germline_set_id}/{release_version} looks like the most specific way to reference a germline set. That's compatible with using a CURIE, e.g. OGRDB:germline/set/9606.IGH_VDJ/9 where OGRDB is defined as https://ogrdb.airr-community.org/api_v2/. What do you think?

williamdlees · 2024-11-24T15:28:23Z

Hi Scott, As in previous mails on this thread, my concern isn't so much how OGRDB is represented in this field, but how other sets would be represented. How would a user represent an IMGT reference set, or one from the Mixcr site, for example? We can say that it would be helpful if Mixcr and IMGT defined CURIEs that we could incorporate into our schema, and committed to maintain them as permanent identifiers, but I think there's very little chance that they will. Perhaps we could have a compound of two fields: a CURIE and a URI, or also incorporate a third field that contains a text field describing the source and the revision date? All the best William

…

------ Original Message ------ From "Scott Christley" ***@***.***> To "airr-community/airr-standards" ***@***.***> Cc "William Lees" ***@***.***>; "Mention" ***@***.***> Date 24/11/2024 08:02:06 Subject Re: [airr-community/airr-standards] Change germline_set_ref to be a CURIE (PR #770)

Hi @williamdlees <https://github.com/williamdlees> , Now with the OGRDB API <https://ogrdb.airr-community.org/api_v2/swagger/#/> defined and live, I think we can resolve this properly. The path /germline/set/{germline_set_id}/{release_version} looks like the most specific way to reference a germline set. That's compatible with using a CURIE, e.g. OGRDB:germline/set/9606.IGH_VDJ/9 where OGRDB is defined as https://ogrdb.airr-community.org/api_v2/. What do you think? — Reply to this email directly, view it on GitHub <#770 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACXBI7PT5NFEXH6O7TBP2DD2CGBX5AVCNFSM6AAAAABEA3HMFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJVHA3DCMZWGA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

schristley · 2024-11-24T17:52:14Z

Hi @bcorrie , in this current PR, there are two fields germline_database and germline_set_ref in data processing where germline_database has a text description of the germline set, while germline_set_ref is a CURIE that is resolvable (currently just OGRDB supports). Would the idea be to keep both of these fields in data processing? If yes, then I think that resolves William's issue above of having a way to describe the source when a CURIE isn't available.

schristley · 2024-11-24T17:58:50Z

@bcorrie Should we only use the CURIE if it is resolvable and provides an AIRR-compatible GermlineSet JSON response? Or should we also use CURIE for IMGT as you suggest above to point to a web file? The former gives us the possibility of doing automated stuff (in the AKC for example) because we are guaranteed of the data format, in the latter case we would have to know in which cases we can do that (OGRDB) and which cases we cannot (IMGT, and maybe others)

schristley · 2024-12-17T18:00:43Z

Or should we also use CURIE for IMGT as you suggest above to point to a web file?

Feedback from the AKC Schema KG is that we should also use "CURIE" for IMGT and other germline databases to have consistent reference IDs for them. "CURIE" in quotes because it was mentioned that CURIE has a quite specific definition and AIRR Standards use isn't always consistent with that definition, we might want to call it something else just to avoid confusion. Regardless, an ID or URI would be preferable to the open text field of germline_database

Add OGRDB germline_set_ref as a CURIE

0dc4f75

schristley self-requested a review March 2, 2024 18:20

bcorrie mentioned this pull request Mar 5, 2024

CURIE conundrum airr-knowledge/issues#32

Closed

javh added this to the AIRR 2.0 milestone Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change germline_set_ref to be a CURIE #770

Change germline_set_ref to be a CURIE #770

bcorrie commented Mar 1, 2024

bcorrie commented Mar 1, 2024

bcorrie commented Mar 1, 2024

bcorrie commented Mar 1, 2024

williamdlees commented Mar 1, 2024 •

edited

Loading

schristley commented Mar 2, 2024

schristley commented Mar 2, 2024

williamdlees commented Mar 3, 2024 via email •

edited by schristley

Loading

schristley commented Mar 3, 2024

williamdlees commented Mar 3, 2024 via email

williamdlees commented Mar 3, 2024 via email •

edited by schristley

Loading

schristley commented Mar 3, 2024

javh commented Aug 12, 2024

williamdlees commented Aug 12, 2024 via email

schristley commented Nov 24, 2024 •

edited

Loading

williamdlees commented Nov 24, 2024 via email

schristley commented Nov 24, 2024

schristley commented Nov 24, 2024

schristley commented Dec 17, 2024

Change germline_set_ref to be a CURIE #770

Are you sure you want to change the base?

Change germline_set_ref to be a CURIE #770

Conversation

bcorrie commented Mar 1, 2024

bcorrie commented Mar 1, 2024

bcorrie commented Mar 1, 2024

bcorrie commented Mar 1, 2024

williamdlees commented Mar 1, 2024 • edited Loading

schristley commented Mar 2, 2024

schristley commented Mar 2, 2024

williamdlees commented Mar 3, 2024 via email • edited by schristley Loading

schristley commented Mar 3, 2024

williamdlees commented Mar 3, 2024 via email

williamdlees commented Mar 3, 2024 via email • edited by schristley Loading

schristley commented Mar 3, 2024

javh commented Aug 12, 2024

williamdlees commented Aug 12, 2024 via email

schristley commented Nov 24, 2024 • edited Loading

williamdlees commented Nov 24, 2024 via email

schristley commented Nov 24, 2024

schristley commented Nov 24, 2024

schristley commented Dec 17, 2024

williamdlees commented Mar 1, 2024 •

edited

Loading

williamdlees commented Mar 3, 2024 via email •

edited by schristley

Loading

williamdlees commented Mar 3, 2024 via email •

edited by schristley

Loading

schristley commented Nov 24, 2024 •

edited

Loading