-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change germline_set_ref to be a CURIE #770
base: master
Are you sure you want to change the base?
Conversation
@williamdlees I made an attempt to make germline_set_ref a CURIE. It passes checks, have a look at what I did and see if it seems OK. |
The main issue I see is that the python test data has IDs from IMGT for germline_set_ref of the form:
Would this be referring to one of these: https://www.imgt.org/download/V-QUEST/archives/ |
Also, the CURIE PREFIX I used is OGRDB_GERMLINESET. CURIEs are supposed to make things short, and this prefix is long but if you are going to have other things that you want to look up (Alleles?) using a CURIE, we need something descriptive and different.
|
Thanks Brian. This looks good to me and I think it is the way to go. If other providers of germline sets (I can only think of Mixcr and IMGT for the time being) wish to offer sets in MiAIRR format, we can encourage them to provide a standardised URL and create a CURIE. But until they do, it isn't really an issue. And we do have other fields for name, date and version that can fill in if necessary. |
Hi @williamdlees , my plan for our AKC 1-on-1 next week was to review this with you in the context of the OGRDB API because right now what's in the pull request does not resolve to a URL that returns the germline set. We will need to change the PR and/or the API so that it does. It won't be hard to do. |
@bcorrie My suggestion is to not make separate CURIEs for each data type. If you remember how James described it, there is a global part and a local part. |
I’ve been thinking about Scott’s helpful comments.
Currently the field is described as ‘Unique identifier of the germline set and version, in standardized form (Repo:Label:Version)’. I think the overall motivation of this change is to provide a URI that will download the set, rather than this standardized form. That seems a reasonable thing to do, and doesn’t impact any user code out there, because the standardized form isn’t really usable by code today. An example of a URL from OGRDB, which works today, would be https://ogrdb.airr-community.org/api/germline/set/Human/IGH_VDJ/8/airr_ex. As I understand a CURIE, it would provide a shorthand that avoids the need to write out the URL in full in the rest of the schema definition. The thing that worries me is that it bakes OGRDB into the definition.
We don’t intend OGRDB to be the only source of germline sets. It would be great if others, for example IMGT and MiLabs, started to support the MiAIRR standard for germline sets. However, if they do this, would we need to add CURIES to the schema definition for their systems? To do so would be much more effort on our part than we save by adding a URL shortcut. And I fear it might also discourage them, and send a general message that, for the MiAIRR standard, OGRDB is the single repository for reference sets.
Is there a way to use a CURIE optionally in a URI? If so, this might be a way forward, although I don’t see much benefit in the CURIE here myself. The alternative would be to change the field description to ‘URL of the germline set in the repository from which it can be downloaded’ and use the URL above as an example.
All the best
William
|
That's correct, though there is a bit more to this. We want germline sets to conform to the FAIR principles, and more importantly for the OGRDB service (and the data it provides like germline sets) to conform to the FAIR principles.
Exactly, it is primarily shorthand for the full URL, but it is important not to think of it purely as "URL to download the germline set". It is more than that. It is a permanent ID so that if you look at two studies, you can simply compare the IDs to know that they are using the exact same germline set, or not. It is also "F"indable and "R"eusable because I can download the exact same germline set based upon its ID. Some people tend to interpret the "R" as reproducibility.
Yes, if they want to conform to the FAIR principles, which they should. It is easy for us to add CURIEs for their systems. The "I"nteroperability of FAIR is where the standard schemas and formats become important. But interoperability doesn't imply that everybody has to agree and conform to the same schema, there are alternative ways to be interoperable.
I'm not sure why you think this would be more effort on our part. It's really effort on their part. The push toward FAIR-ness is pretty much unstoppable at this point, and if they want to be used and relevant then they need to conform to the FAIR principles. Note that this is not a blanket statement, thou must use the AIRR standard. And while "standard" can be used as a bludgeon to keep people in line, it can also just mean that a group agrees to the same set of general principles. While the field does not (technically) require that germline sets be in the AIRR standard format, because this is the AIRR Community and these are our AIRR Standards, there is that implication. |
ThatOn 3 Mar 2024, at 21:10, Scott Christley ***@***.***> wrote:
I think the overall motivation of this change is to provide a URI that will download the set, rather than this standardized form.
That's correct, though there is a bit more to this. We want germline sets to conform to the FAIR principles, and more importantly for the OGRDB service (and the data it provides like germline sets) to conform to the FAIR principles.
As I understand a CURIE, it would provide a shorthand that avoids the need to write out the URL in full in the rest of the schema definition.
Exactly, it is primarily shorthand for the full URL, but it is important not to think of it purely as "URL to download the germline set". It is more than that. It is a permanent ID so that if you look at two studies, you can simply compare the IDs to know that they are using the exact same germline set, or not. It is also "F"indable and "R"eusable because I can download the exact same germline set based upon its ID. Some people tend to interpret the "R" as reproducibility.
The thing that worries me is that it bakes OGRDB into the definition. We don’t intend OGRDB to be the only source of germline sets. It would be great if others, for example IMGT and MiLabs, started to support the MiAIRR standard for germline sets. However, if they do this, would we need to add CURIES to the schema definition for their systems?
Yes, if they want to conform to the FAIR principles, which they should. It is easy for us to add CURIEs for their systems. The "I"nteroperability of FAIR is where the standard schemas and formats become important. But interoperability doesn't imply that everybody has to agree and conform to the same schema, there are alternative ways to be interoperable.
To do so would be much more effort on our part than we save by adding a URL shortcut. And I fear it might also discourage them, and send a general message that, for the MiAIRR standard, OGRDB is the single repository for reference sets. Is there a way to use a CURIE optionally in a URI? If so, this might be a way forward, although I don’t see much benefit in the CURIE here myself. The alternative would be to change the field description to ‘URL of the germline set in the repository from which it can be downloaded’ and use the URL above as an example.
I'm not sure why you think this would be more effort on our part. It's really effort on their part. The push toward FAIR-ness is pretty much unstoppable at this point, and if they want to be used and relevant then they need to conform to the FAIR principles. Note that this is not a blanket statement, thou must use the AIRR standard. And while "standard" can be used as a bludgeon to keep people in line, it can also just mean that a group agrees to the same set of general principles. While the field does not (technically) require that germline sets be in the AIRR standard format, because this is the AIRR Community and these are our AIRR Standards, there is that implication.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Scott I take your point about the URI being a permanent identifier, but that does not to my mind have anything to do with CURIES. Do we have to use a CURIE to comply with FAIR? If not I would much rather not in this instance on the grounds that 1 we save no appreciable effort by using a CURIE here and 2 it does take effort to check out the schema, make a change to it, coordinate with other changes and all the other stuff we end up doing when making a change to the standard.Thanks for your helpWilliam
|
Nope, it is just a useful shorthand. In fact, IEDB isn't using CURIEs, you can see in their export table that they provide IRIs. There are some small advantages to CURIEs, 1) it's shorter and thus uses less space in a database, likely not relevant for germline sets but imagine rearrangements where you could be talking GBs of extra data (I don't know if IEDB stores the complete IRI in the database, it may actually just store IDs and construct the IRI when generating the export, which makes this point mute), and 2) if there's every a crazy reason for https://ogrdb.airr-community.org to be moved, changing the CURIE pointer is a lot easier than rewriting all of the IRIs. But these really are small points. And on the flip side, there's an advantage to just having the IRI because you don't need to do the CURIE resolution. |
Bump. Did we reach a consensus on this? |
I don't think so.
I am all for defining a specific syntax for the germline set reference.
From my point of view, there would be no problem making it a CURIE if
we're ok with the right-hand-side being a URL, which as far as I can see
is permitted by the CURIE syntax definition. You could say 'what's the
point'? but I don't think the specific case for making the reference a
CURIE has been fully articulated.
I think we've probably gone through the arguments for and against other
options, so I'll hold off repeating them!
All the best
William
…------ Original Message ------
From "Jason Vander Heiden" ***@***.***>
To "airr-community/airr-standards" ***@***.***>
Cc "William Lees" ***@***.***>; "Mention"
***@***.***>
Date 12/08/2024 19:30:05
Subject Re: [airr-community/airr-standards] Change germline_set_ref to
be a CURIE (PR #770)
Bump. Did we reach a consensus on this?
—
Reply to this email directly, view it on GitHub
<#770 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXBI7L3RDXY6TBRQAXNHXLZRD5K3AVCNFSM6AAAAABEA3HMFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBUGY3DANRQG4>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @williamdlees , Now with the OGRDB API defined and live, I think we can resolve this properly. The path |
Hi Scott,
As in previous mails on this thread, my concern isn't so much how OGRDB
is represented in this field, but how other sets would be represented.
How would a user represent an IMGT reference set, or one from the Mixcr
site, for example? We can say that it would be helpful if Mixcr and IMGT
defined CURIEs that we could incorporate into our schema, and committed
to maintain them as permanent identifiers, but I think there's very
little chance that they will. Perhaps we could have a compound of two
fields: a CURIE and a URI, or also incorporate a third field that
contains a text field describing the source and the revision date?
All the best
William
…------ Original Message ------
From "Scott Christley" ***@***.***>
To "airr-community/airr-standards" ***@***.***>
Cc "William Lees" ***@***.***>; "Mention"
***@***.***>
Date 24/11/2024 08:02:06
Subject Re: [airr-community/airr-standards] Change germline_set_ref to
be a CURIE (PR #770)
Hi @williamdlees <https://github.com/williamdlees> , Now with the OGRDB
API <https://ogrdb.airr-community.org/api_v2/swagger/#/> defined and
live, I think we can resolve this properly. The path
/germline/set/{germline_set_id}/{release_version} looks like the most
specific way to reference a germline set. That's compatible with using
a CURIE, e.g. OGRDB:germline/set/9606.IGH_VDJ/9 where OGRDB is defined
as https://ogrdb.airr-community.org/api_v2/. What do you think?
—
Reply to this email directly, view it on GitHub
<#770 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXBI7PT5NFEXH6O7TBP2DD2CGBX5AVCNFSM6AAAAABEA3HMFKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJVHA3DCMZWGA>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @bcorrie , in this current PR, there are two fields |
@bcorrie Should we only use the CURIE if it is resolvable and provides an AIRR-compatible GermlineSet JSON response? Or should we also use CURIE for IMGT as you suggest above to point to a web file? The former gives us the possibility of doing automated stuff (in the AKC for example) because we are guaranteed of the data format, in the latter case we would have to know in which cases we can do that (OGRDB) and which cases we cannot (IMGT, and maybe others) |
Feedback from the AKC Schema KG is that we should also use "CURIE" for IMGT and other germline databases to have consistent reference IDs for them. "CURIE" in quotes because it was mentioned that CURIE has a quite specific definition and AIRR Standards use isn't always consistent with that definition, we might want to call it something else just to avoid confusion. Regardless, an ID or URI would be preferable to the open text field of |
Closes #553