-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MODs Identifier dump #2
Comments
Hi Julie - It would be super helpful to have (even informal) use cases for these kinds of requests? It would help (at least ZFIN) fulfill the request and possibly initiate a discussion about #6 in this repo. |
I'm gonna reboot this conversation if possible. Sierra - I think the purpose is for us to simply explore the range of IDs that are hosted at the MODs, for the eventual purpose of elements such as external cross references to other resources, pointers to annotation elements to things like GO ids or EC numbers, and to determine if the name-space/usage of those items is consistent across the MODs. The other purpose - I believe - is do look at the internal content - identifiers linked to organisms, strains, variants, genes for a similar purpose - we are wondering about how the practices for IDs are handled across the groups. Does that help? |
Doesn’t make sense to provide internal IDs. They are internal for a reason, we don’t share them because we don’t want users to use them. There was no response on this was the question didn’t make sense to us. |
It sounds to me like the concepts of public and resolvable might be being conflated? Just to break it down... This task can be scoped to just the IDs that appear in the data dumps. Any ID that is so deeply internal that doesn't make it even as far as the dumps can be safely ignored for the purpose of this task. However, any ID that DOES makes it to the dumps should be described in such a way that the ID is not abused by others using the dumps (eg. mistaken for durable when it is not; mistaken for resolvable when it is not, mistaken for the same ID when it is different, or mistaken for different when it is the same).
Does this make sense? |
Julie, Thanks, that clears up the question. It will now be straightforward to provide you our IDs that are on webpages or in dump files. MOD policies say these IDs should all be resolvable. I'll pass on this refined request. So the URL for the sheet is easier to find, copying it here again: https://docs.google.com/spreadsheets/d/1orgx-657PUQE0qBpFRPEbsKDDynaLfke-UxcEVR_pxA/edit#gid=0 |
For some background on the thorny edge case of public but stable, public but not resolvable, etc. and why we (monarch) care about them, feel free to look here. The issue is very old and unresolved. Some of the comments now obsolete / overtaken by events, but the principles are still the same. |
A while ago, I also wrote up a summary here of what the identifier surrogacy options are for integrators. Happy to have feedback on it. |
@jmcmurry Just to clarify: you would like each MOD to produce a file containing all of their (dump-containing) IDs, in the format described by the spreadsheet. Correct? Would you like us to host these files so that they can picked up? Or should we deposit them somewhere central? |
Yes; please deposit in the cloud somewhere and send the link; thanks :) |
Just to get started, I created a google doc that is a copy of Julie's. |
P.S. Yes, it's editable |
No, please I would very much like to suggest we use a BDBag for this, not a spreadsheet. We can help you with this.
A BDBag will have proper manifest, checksums, can use tooling to retrieve, contain metadata. We have already seen in the initial instance where using a bag was able to increase the FAIRness of the data exchange.
THanks,
Carl
…----------------------------------------------------------
Dr. Carl Kesselman
Dean’s Professor, Epstein Department of Industrial and Systems Engineering
Fellow, Information Sciences Institute
Viterbi School of Engineering
Professor, Preventive Medicine
Keck School of Medicine
University of Southern California
4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695
Phone: +1 (310) 448-9338
Email: [email protected]<mailto:[email protected]>
Web: http://www.isi.edu/~carl
On Apr 19, 2018, at 7:17 AM, Kevin Howe <[email protected]<mailto:[email protected]>> wrote:
@jmcmurry<https://github.com/jmcmurry> Just to clarify: you would like each MOD to produce a file containing all of their (dump-containing) IDs, in the format described by the spreadsheet. Correct?
Where would you like us to host these files so that they can picked up? Or should we deposit them somewhere central?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#2 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADbjXraGbX8scJWoONjVeKCHjvPLjSV4ks5tqJyNgaJpZM4SktyW>.
|
Since I don't know anything about BDBags, I can't comment or assist. I'm still confused as to the scope of this request, even with the qualifiers "publicly resolvable" and "in a dump file". Which dump file? Any dump file? ANYthing that's resolvable at MGI by any public identifier (MGI: or external). I'm guessing there would be well over 100 lines for MGI. Or am I missing something? |
Hi @jmcmurry - Building on what @JoelRichardson said, do you have a id resolving map for all possible ID prefixes anywhere yet? ie: it would probably be good if we provided cross references to the same resource with the same prefix. Ie: NCBI_Gene vs. Gene. There are many cases not covered by the file at GO nor the Alliance one that we've started, if we include ontological cross references (which many of the MODs store and would fall in this generic request without clarification?) thx again, Sierra |
@jmcmurry - re: the doc with ZIRC as an example - another option for your document, might be to use ZFIN (in this example, possibly other MODs as well) as the id resolver for these biological materials. As I understand it, since ZL#'s represent biological material that can be discontinued, they aren't good ids to use in perpetuity. Many resource centers are like this as you point out. ZFIN however, stores the representative content of these materials and could act as a "permanent" resource. |
Ok, so this is just a single file with the IRIs for the terms? If we want to include that term list with other data, or the term list is in more then one file then you will like to have them in a well defined aggregate. If it is just a single file, then what I would suggest is that we identify someplace to store it (AWS S2?) we mint an identifier for it (we can do that) and use that to reference this dump.
Thanks,
Carl
…----------------------------------------------------------
Dr. Carl Kesselman
Dean’s Professor, Epstein Department of Industrial and Systems Engineering
Fellow, Information Sciences Institute
Viterbi School of Engineering
Professor, Preventive Medicine
Keck School of Medicine
University of Southern California
4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695
Phone: +1 (310) 448-9338
Email: [email protected]<mailto:[email protected]>
Web: http://www.isi.edu/~carl
On Apr 19, 2018, at 8:01 AM, Joel Richardson <[email protected]<mailto:[email protected]>> wrote:
Since I don't know anything about BDBags, I can't comment or assist.
In any case...
I'm still confused as to the scope of this request, even with the qualifiers "publicly resolvable" and "in a dump file". Which dump file? Any dump file? ANYthing that's resolvable at MGI by any public identifier (MGI: or external). I'm guessing there would be well over 100 lines for MGI. Or am I missing something?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#2 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADbjXquarabW3SzTxZyPNhAjWHlaHgq4ks5tqKbEgaJpZM4SktyW>.
|
Would you like this file regularly, or is this a one time survey? |
I (at least) would benefit from some additional guidance on the scope. I will use a specific example: in WormBase, gene records have primary ids of the form WBGene\d+ (e.g. WBGene00006763). These appear in our dumps, and are resolvable (kind of; see below). However, there are a bunch of other identifiers associated with a gene: symbol, systematic name, previous names etc (e.g. "unc-26", "JC8.10", "CELE_JC8.10"). We conceptually treat these as properties of the gene records, rather than identifiers; and in general, they are not resolvable (although they appear in our dumps, and are searchable; and if the search results in one clear unambiguous hit, a redirect to the entity page results ). Is it correct that for the sake of this exercise, all of these should be considered as "identifiers", and included in the file? In that case, how should non-resolvable ids be represented in the file? For WormBase, there is a further complication that our primary entity ids are generally resolvable only on a class-by-class basis. This is due to a decision taken early in the life of the project to re-use identifiers between classes (e.g. there are two objects identified as "JC8.10a", one being a Transcript and the other being the CDS of that transcript). In our interactions with GO, we have addressed this by treating WormBase as a collection of resources, with each class/ data-type having its own prefix (e.g. WB for genes, WB_REF for publications, WBls for life-stage ontology terms). However, prefixes have only been assigned for data types that pop up in our exchanges with GO. Since this exercise requires us to be comprehensive and consider all data types, we will need to generate more prefixes. Would you advise we do this unilaterally? Or is there a third-party central agency that we should work with to do that? |
I would consider symbols and names as disjoint from identifiers, but MMV. I suggest that as the Alliance is already using the GO prefix registry that this is extended for other types too. I can work with KC2 to ensure this is propagates to identifiers.org / n2t.net (we're already doing this, e.g. ensuring the TAIR records are in sync ). |
Are the examples of SGD identifiers in this spreadsheet what is requested? [https://drive.google.com/open?id=1o54ZlW0fkIqOP8gnLtbXsEbapkAsiZ_o] Just want to be sure I am on the right track before generating an enormous file. |
Great question. The relationships themselves (annotation x or y on gen ) is not needed at this point as that would admittedly be both onerous and noisy. Not only that, but the worst way to retrieve this info :) High priority:
Extremely low priority / ignore:
** For literature, a pair of IDs per article is fine. Not every single data steward is going to have a perfectly complete set of literature mappings to PMID and DOI and PMCID, so these may need to be retrieved on the fly as required by use cases. |
Thanks for the feedback and clarification. @jmcmurry |
I took a shot at documenting all the different identifiers returned in calls to our API or files on the FTP site. It's currently in a Google doc but is there another location I should submit it? https://docs.google.com/spreadsheets/d/11VIdKEG2JPDNHmdoeK2AZ8KM2Kg5QuKlvzJ84HvhoEE/edit#gid=0 |
(a reference here is already better than we're doing usually, so +1 for
starting with this :)
|
@ctb @jmcmurry - does what @jdepons and @gabinkley provided fulfill this request? |
I've updated my list of example identifiers and URLs. I removed the links to annotations and added links to external resources that are equivalent to SGD's identifiers that @jmcmurry indicated was desired. Please see updated spreadsheet below: https://docs.google.com/spreadsheets/d/1FtrS-ATOZdvcE3Bjhv8KYakHZexoL9JzalHIe0TQElE/edit?usp=sharing A final question that has been asked before, but hasn't been answered directly. Is the request for a file of all identifiers for any example in the spreadsheet or is the just list of examples sufficient right now? |
Just an update on our meeting today with this as a topic: we agreed to make spreadsheets (and post them here) for each of our MODs with representative ids that we mint at the MOD and are publicly available. The content of the spreadsheet (past this general idea), is up to the MOD. Some will have cross references, some will not. If you need something further, @jmcmurry, please let us know. |
Thanks Sierra, that sounds like a great start. Cross-references are encouraged but optional provided the xrefs can be derived from the raw data in other ways; if that isn't the case, please just let me know and we can revisit later. |
We would like a dump of all of each of the MOD identifiers in this format. Note that this includes any internal IDs which may or may not be resolvable. This will require going beyond the information provided so far by the Alliance (the identifier documentation as well as the manifest recently uploaded to the Amazon cloud).
The text was updated successfully, but these errors were encountered: