MODs Identifier dump #2

jmcmurry · 2018-03-09T18:59:14Z

We would like a dump of all of each of the MOD identifiers in this format. Note that this includes any internal IDs which may or may not be resolvable. This will require going beyond the information provided so far by the Alliance (the identifier documentation as well as the manifest recently uploaded to the Amazon cloud).

sierra-moxon · 2018-03-21T23:11:00Z

Hi Julie - It would be super helpful to have (even informal) use cases for these kinds of requests? It would help (at least ZFIN) fulfill the request and possibly initiate a discussion about #6 in this repo.
thanks a bunch! Sierra

owhite · 2018-04-13T14:43:24Z

I'm gonna reboot this conversation if possible.

Sierra - I think the purpose is for us to simply explore the range of IDs that are hosted at the MODs, for the eventual purpose of elements such as external cross references to other resources, pointers to annotation elements to things like GO ids or EC numbers, and to determine if the name-space/usage of those items is consistent across the MODs. The other purpose - I believe - is do look at the internal content - identifiers linked to organisms, strains, variants, genes for a similar purpose - we are wondering about how the practices for IDs are handled across the groups.

Does that help?

jmcherry-zz · 2018-04-14T01:10:20Z

Doesn’t make sense to provide internal IDs. They are internal for a reason, we don’t share them because we don’t want users to use them.

There was no response on this was the question didn’t make sense to us.

jmcmurry · 2018-04-19T04:04:36Z

It sounds to me like the concepts of public and resolvable might be being conflated? Just to break it down...

This task can be scoped to just the IDs that appear in the data dumps. Any ID that is so deeply internal that doesn't make it even as far as the dumps can be safely ignored for the purpose of this task. However, any ID that DOES makes it to the dumps should be described in such a way that the ID is not abused by others using the dumps (eg. mistaken for durable when it is not; mistaken for resolvable when it is not, mistaken for the same ID when it is different, or mistaken for different when it is the same).

.	Publicly Resolvable	Not publicly resolvable
Appears in datadumps	high value	potentially valuable if durable
Doesn't appear in datadumps	not a thing	no one cares

Does this make sense?

jmcherry-zz · 2018-04-19T04:27:09Z

Julie,

Thanks, that clears up the question. It will now be straightforward to provide you our IDs that are on webpages or in dump files. MOD policies say these IDs should all be resolvable. I'll pass on this refined request.

So the URL for the sheet is easier to find, copying it here again:

https://docs.google.com/spreadsheets/d/1orgx-657PUQE0qBpFRPEbsKDDynaLfke-UxcEVR_pxA/edit#gid=0

jmcmurry · 2018-04-19T04:53:15Z

For some background on the thorny edge case of public but stable, public but not resolvable, etc. and why we (monarch) care about them, feel free to look here. The issue is very old and unresolved. Some of the comments now obsolete / overtaken by events, but the principles are still the same.

jmcmurry · 2018-04-19T04:56:36Z

A while ago, I also wrote up a summary here of what the identifier surrogacy options are for integrators. Happy to have feedback on it.

khowe · 2018-04-19T14:17:48Z

@jmcmurry Just to clarify: you would like each MOD to produce a file containing all of their (dump-containing) IDs, in the format described by the spreadsheet. Correct?

Would you like us to host these files so that they can picked up? Or should we deposit them somewhere central?

jmcmurry · 2018-04-19T14:22:45Z

Yes; please deposit in the cloud somewhere and send the link; thanks :)

JoelRichardson · 2018-04-19T14:23:29Z

Just to get started, I created a google doc that is a copy of Julie's.
https://docs.google.com/spreadsheets/d/1A5-_doKRTdepELYTZoCVUTVgyAOK2enRaBhZxoNks1w/edit?usp=sharing

JoelRichardson · 2018-04-19T14:23:56Z

P.S. Yes, it's editable

carlkesselman · 2018-04-19T14:48:57Z

No, please I would very much like to suggest we use a BDBag for this, not a spreadsheet. We can help you with this. A BDBag will have proper manifest, checksums, can use tooling to retrieve, contain metadata. We have already seen in the initial instance where using a bag was able to increase the FAIRness of the data exchange. THanks, Carl

…

---------------------------------------------------------- Dr. Carl Kesselman Dean’s Professor, Epstein Department of Industrial and Systems Engineering Fellow, Information Sciences Institute Viterbi School of Engineering Professor, Preventive Medicine Keck School of Medicine University of Southern California 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695 Phone: +1 (310) 448-9338 Email: [email protected]<mailto:[email protected]> Web: http://www.isi.edu/~carl On Apr 19, 2018, at 7:17 AM, Kevin Howe <[email protected]<mailto:[email protected]>> wrote: @jmcmurry<https://github.com/jmcmurry> Just to clarify: you would like each MOD to produce a file containing all of their (dump-containing) IDs, in the format described by the spreadsheet. Correct? Where would you like us to host these files so that they can picked up? Or should we deposit them somewhere central? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#2 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADbjXraGbX8scJWoONjVeKCHjvPLjSV4ks5tqJyNgaJpZM4SktyW>.

JoelRichardson · 2018-04-19T15:01:23Z

Since I don't know anything about BDBags, I can't comment or assist.
In any case...

I'm still confused as to the scope of this request, even with the qualifiers "publicly resolvable" and "in a dump file". Which dump file? Any dump file? ANYthing that's resolvable at MGI by any public identifier (MGI: or external). I'm guessing there would be well over 100 lines for MGI. Or am I missing something?

sierra-moxon · 2018-04-19T15:31:45Z

Hi @jmcmurry - Building on what @JoelRichardson said, do you have a id resolving map for all possible ID prefixes anywhere yet? ie: it would probably be good if we provided cross references to the same resource with the same prefix. Ie: NCBI_Gene vs. Gene. There are many cases not covered by the file at GO nor the Alliance one that we've started, if we include ontological cross references (which many of the MODs store and would fall in this generic request without clarification?) thx again, Sierra

sierra-moxon · 2018-04-19T15:59:34Z

@jmcmurry - re: the doc with ZIRC as an example - another option for your document, might be to use ZFIN (in this example, possibly other MODs as well) as the id resolver for these biological materials. As I understand it, since ZL#'s represent biological material that can be discontinued, they aren't good ids to use in perpetuity. Many resource centers are like this as you point out. ZFIN however, stores the representative content of these materials and could act as a "permanent" resource.

carlkesselman · 2018-04-19T18:39:57Z

Ok, so this is just a single file with the IRIs for the terms? If we want to include that term list with other data, or the term list is in more then one file then you will like to have them in a well defined aggregate. If it is just a single file, then what I would suggest is that we identify someplace to store it (AWS S2?) we mint an identifier for it (we can do that) and use that to reference this dump. Thanks, Carl

…

---------------------------------------------------------- Dr. Carl Kesselman Dean’s Professor, Epstein Department of Industrial and Systems Engineering Fellow, Information Sciences Institute Viterbi School of Engineering Professor, Preventive Medicine Keck School of Medicine University of Southern California 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695 Phone: +1 (310) 448-9338 Email: [email protected]<mailto:[email protected]> Web: http://www.isi.edu/~carl On Apr 19, 2018, at 8:01 AM, Joel Richardson <[email protected]<mailto:[email protected]>> wrote: Since I don't know anything about BDBags, I can't comment or assist. In any case... I'm still confused as to the scope of this request, even with the qualifiers "publicly resolvable" and "in a dump file". Which dump file? Any dump file? ANYthing that's resolvable at MGI by any public identifier (MGI: or external). I'm guessing there would be well over 100 lines for MGI. Or am I missing something? — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#2 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADbjXquarabW3SzTxZyPNhAjWHlaHgq4ks5tqKbEgaJpZM4SktyW>.

sierra-moxon · 2018-04-24T20:26:42Z

Would you like this file regularly, or is this a one time survey?
Do DOIs, ISSN numbers, ORC ids count? How about ontology xrefs? ontology ids? Just ids that the MOD mints itself and are distributed? If MOD mints an id to provide an internal reference to an ontology or xref, do you want those? Biolink column is optional (would a SO Term id work)? Is this all of our ids, or just a representative sample filled into that spreadsheet? Are there any times already scheduled that we could hash this out on a phone call? thanks a bunch, @jmcmurry @carlkesselman

khowe · 2018-04-25T09:59:36Z

I (at least) would benefit from some additional guidance on the scope. I will use a specific example: in WormBase, gene records have primary ids of the form WBGene\d+ (e.g. WBGene00006763). These appear in our dumps, and are resolvable (kind of; see below). However, there are a bunch of other identifiers associated with a gene: symbol, systematic name, previous names etc (e.g. "unc-26", "JC8.10", "CELE_JC8.10"). We conceptually treat these as properties of the gene records, rather than identifiers; and in general, they are not resolvable (although they appear in our dumps, and are searchable; and if the search results in one clear unambiguous hit, a redirect to the entity page results ).

Is it correct that for the sake of this exercise, all of these should be considered as "identifiers", and included in the file? In that case, how should non-resolvable ids be represented in the file?

For WormBase, there is a further complication that our primary entity ids are generally resolvable only on a class-by-class basis. This is due to a decision taken early in the life of the project to re-use identifiers between classes (e.g. there are two objects identified as "JC8.10a", one being a Transcript and the other being the CDS of that transcript).

In our interactions with GO, we have addressed this by treating WormBase as a collection of resources, with each class/ data-type having its own prefix (e.g. WB for genes, WB_REF for publications, WBls for life-stage ontology terms). However, prefixes have only been assigned for data types that pop up in our exchanges with GO. Since this exercise requires us to be comprehensive and consider all data types, we will need to generate more prefixes. Would you advise we do this unilaterally? Or is there a third-party central agency that we should work with to do that?

cmungall · 2018-04-25T22:56:02Z

I would consider symbols and names as disjoint from identifiers, but MMV.

I suggest that as the Alliance is already using the GO prefix registry that this is extended for other types too. I can work with KC2 to ensure this is propagates to identifiers.org / n2t.net (we're already doing this, e.g. ensuring the TAIR records are in sync ).

gabinkley · 2018-04-26T19:47:48Z

Are the examples of SGD identifiers in this spreadsheet what is requested?

[https://drive.google.com/open?id=1o54ZlW0fkIqOP8gnLtbXsEbapkAsiZ_o]

Just want to be sure I am on the right track before generating an enormous file.

jmcmurry · 2018-04-27T00:57:52Z

Great question. The relationships themselves (annotation x or y on gen ) is not needed at this point as that would admittedly be both onerous and noisy. Not only that, but the worst way to retrieve this info :)

High priority:

nodes directly relevant to search (esp genes, diseases, phenotypes, anatomy, species, functions)
nodes relevant to integration (esp literature**, genotypes, alleles, variants)

Extremely low priority / ignore:

Anything that identifies not the node itself but the connection between them:
- eg. ignore this https://www.yeastgenome.org/locus/S000003131/literature

** For literature, a pair of IDs per article is fine.
The native node ID eg. https://www.yeastgenome.org/reference/S000207820
And an xref'd equivalent eg. http://dx.doi.org/10.1126/sciadv.aaq0236 [one of PMID, DOI, PMCID]
If the dump can couple the native ID and the xref equivalent great, but not absolutely essential. Capturing the relationships is not in scope for this activity.

Not every single data steward is going to have a perfectly complete set of literature mappings to PMID and DOI and PMCID, so these may need to be retrieved on the fly as required by use cases.

gabinkley · 2018-04-27T13:11:34Z

Thanks for the feedback and clarification. @jmcmurry

jdepons · 2018-04-27T19:21:04Z

I took a shot at documenting all the different identifiers returned in calls to our API or files on the FTP site. It's currently in a Google doc but is there another location I should submit it?

https://docs.google.com/spreadsheets/d/11VIdKEG2JPDNHmdoeK2AZ8KM2Kg5QuKlvzJ84HvhoEE/edit#gid=0

ctb · 2018-04-28T14:18:50Z

(a reference here is already better than we're doing usually, so +1 for starting with this :)

sierra-moxon · 2018-04-30T15:41:14Z

@ctb @jmcmurry - does what @jdepons and @gabinkley provided fulfill this request?

gabinkley · 2018-05-01T14:17:28Z

I've updated my list of example identifiers and URLs. I removed the links to annotations and added links to external resources that are equivalent to SGD's identifiers that @jmcmurry indicated was desired. Please see updated spreadsheet below:

https://docs.google.com/spreadsheets/d/1FtrS-ATOZdvcE3Bjhv8KYakHZexoL9JzalHIe0TQElE/edit?usp=sharing

A final question that has been asked before, but hasn't been answered directly. Is the request for a file of all identifiers for any example in the spreadsheet or is the just list of examples sufficient right now?

sierra-moxon · 2018-05-01T19:32:11Z

Just an update on our meeting today with this as a topic: we agreed to make spreadsheets (and post them here) for each of our MODs with representative ids that we mint at the MOD and are publicly available. The content of the spreadsheet (past this general idea), is up to the MOD. Some will have cross references, some will not. If you need something further, @jmcmurry, please let us know.

jmcmurry · 2018-05-02T07:27:30Z

Thanks Sierra, that sounds like a great start. Cross-references are encouraged but optional provided the xrefs can be derived from the raw data in other ways; if that isn't the case, please just let me know and we can revisit later.

jmcmurry mentioned this issue Mar 9, 2018

Identifier Dumps #4

Open

3 tasks

cmungall mentioned this issue Mar 12, 2018

Register data steward APIs with smartAPI #5

Closed

4 tasks

owhite added the help wanted Extra attention is needed label Mar 12, 2018

rpwagner mentioned this issue Apr 19, 2018

Team Oxygen - ETL of GTEx/TOPMed into DATS. #12

Open

owhite added the AGR label Apr 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODs Identifier dump #2

MODs Identifier dump #2

jmcmurry commented Mar 9, 2018

sierra-moxon commented Mar 21, 2018

owhite commented Apr 13, 2018 •

edited

Loading

jmcherry-zz commented Apr 14, 2018

jmcmurry commented Apr 19, 2018

jmcherry-zz commented Apr 19, 2018

jmcmurry commented Apr 19, 2018

jmcmurry commented Apr 19, 2018

khowe commented Apr 19, 2018 •

edited

Loading

jmcmurry commented Apr 19, 2018

JoelRichardson commented Apr 19, 2018

JoelRichardson commented Apr 19, 2018

carlkesselman commented Apr 19, 2018 via email

JoelRichardson commented Apr 19, 2018

sierra-moxon commented Apr 19, 2018 •

edited

Loading

sierra-moxon commented Apr 19, 2018

carlkesselman commented Apr 19, 2018 via email

sierra-moxon commented Apr 24, 2018 •

edited

Loading

khowe commented Apr 25, 2018 •

edited

Loading

cmungall commented Apr 25, 2018 •

edited

Loading

gabinkley commented Apr 26, 2018

jmcmurry commented Apr 27, 2018

gabinkley commented Apr 27, 2018 •

edited

Loading

jdepons commented Apr 27, 2018

ctb commented Apr 28, 2018 via email

sierra-moxon commented Apr 30, 2018 •

edited

Loading

gabinkley commented May 1, 2018

sierra-moxon commented May 1, 2018

jmcmurry commented May 2, 2018

MODs Identifier dump #2

MODs Identifier dump #2

Comments

jmcmurry commented Mar 9, 2018

sierra-moxon commented Mar 21, 2018

owhite commented Apr 13, 2018 • edited Loading

jmcherry-zz commented Apr 14, 2018

jmcmurry commented Apr 19, 2018

jmcherry-zz commented Apr 19, 2018

jmcmurry commented Apr 19, 2018

jmcmurry commented Apr 19, 2018

khowe commented Apr 19, 2018 • edited Loading

jmcmurry commented Apr 19, 2018

JoelRichardson commented Apr 19, 2018

JoelRichardson commented Apr 19, 2018

carlkesselman commented Apr 19, 2018 via email

JoelRichardson commented Apr 19, 2018

sierra-moxon commented Apr 19, 2018 • edited Loading

sierra-moxon commented Apr 19, 2018

carlkesselman commented Apr 19, 2018 via email

sierra-moxon commented Apr 24, 2018 • edited Loading

khowe commented Apr 25, 2018 • edited Loading

cmungall commented Apr 25, 2018 • edited Loading

gabinkley commented Apr 26, 2018

jmcmurry commented Apr 27, 2018

High priority:

Extremely low priority / ignore:

gabinkley commented Apr 27, 2018 • edited Loading

jdepons commented Apr 27, 2018

ctb commented Apr 28, 2018 via email

sierra-moxon commented Apr 30, 2018 • edited Loading

gabinkley commented May 1, 2018

sierra-moxon commented May 1, 2018

jmcmurry commented May 2, 2018

owhite commented Apr 13, 2018 •

edited

Loading

khowe commented Apr 19, 2018 •

edited

Loading

sierra-moxon commented Apr 19, 2018 •

edited

Loading

sierra-moxon commented Apr 24, 2018 •

edited

Loading

khowe commented Apr 25, 2018 •

edited

Loading

cmungall commented Apr 25, 2018 •

edited

Loading

gabinkley commented Apr 27, 2018 •

edited

Loading

sierra-moxon commented Apr 30, 2018 •

edited

Loading