Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MODs Identifier dump #2

Open
jmcmurry opened this issue Mar 9, 2018 · 28 comments
Open

MODs Identifier dump #2

jmcmurry opened this issue Mar 9, 2018 · 28 comments
Labels
AGR help wanted Extra attention is needed

Comments

@jmcmurry
Copy link
Contributor

jmcmurry commented Mar 9, 2018

We would like a dump of all of each of the MOD identifiers in this format. Note that this includes any internal IDs which may or may not be resolvable. This will require going beyond the information provided so far by the Alliance (the identifier documentation as well as the manifest recently uploaded to the Amazon cloud).

@jmcmurry jmcmurry mentioned this issue Mar 9, 2018
3 tasks
@owhite owhite added the help wanted Extra attention is needed label Mar 12, 2018
@sierra-moxon
Copy link

Hi Julie - It would be super helpful to have (even informal) use cases for these kinds of requests? It would help (at least ZFIN) fulfill the request and possibly initiate a discussion about #6 in this repo.
thanks a bunch! Sierra

@owhite
Copy link
Contributor

owhite commented Apr 13, 2018

I'm gonna reboot this conversation if possible.

Sierra - I think the purpose is for us to simply explore the range of IDs that are hosted at the MODs, for the eventual purpose of elements such as external cross references to other resources, pointers to annotation elements to things like GO ids or EC numbers, and to determine if the name-space/usage of those items is consistent across the MODs. The other purpose - I believe - is do look at the internal content - identifiers linked to organisms, strains, variants, genes for a similar purpose - we are wondering about how the practices for IDs are handled across the groups.

Does that help?

@jmcherry-zz
Copy link

Doesn’t make sense to provide internal IDs. They are internal for a reason, we don’t share them because we don’t want users to use them.

There was no response on this was the question didn’t make sense to us.

@jmcmurry
Copy link
Contributor Author

It sounds to me like the concepts of public and resolvable might be being conflated? Just to break it down...

This task can be scoped to just the IDs that appear in the data dumps. Any ID that is so deeply internal that doesn't make it even as far as the dumps can be safely ignored for the purpose of this task. However, any ID that DOES makes it to the dumps should be described in such a way that the ID is not abused by others using the dumps (eg. mistaken for durable when it is not; mistaken for resolvable when it is not, mistaken for the same ID when it is different, or mistaken for different when it is the same).

. Publicly Resolvable Not publicly resolvable
Appears in datadumps high value potentially valuable if durable
Doesn't appear in datadumps not a thing no one cares

Does this make sense?

@jmcherry-zz
Copy link

Julie,

Thanks, that clears up the question. It will now be straightforward to provide you our IDs that are on webpages or in dump files. MOD policies say these IDs should all be resolvable. I'll pass on this refined request.

So the URL for the sheet is easier to find, copying it here again:

https://docs.google.com/spreadsheets/d/1orgx-657PUQE0qBpFRPEbsKDDynaLfke-UxcEVR_pxA/edit#gid=0

@jmcmurry
Copy link
Contributor Author

For some background on the thorny edge case of public but stable, public but not resolvable, etc. and why we (monarch) care about them, feel free to look here. The issue is very old and unresolved. Some of the comments now obsolete / overtaken by events, but the principles are still the same.

@jmcmurry
Copy link
Contributor Author

A while ago, I also wrote up a summary here of what the identifier surrogacy options are for integrators. Happy to have feedback on it.

@khowe
Copy link

khowe commented Apr 19, 2018

@jmcmurry Just to clarify: you would like each MOD to produce a file containing all of their (dump-containing) IDs, in the format described by the spreadsheet. Correct?

Would you like us to host these files so that they can picked up? Or should we deposit them somewhere central?

@jmcmurry
Copy link
Contributor Author

Yes; please deposit in the cloud somewhere and send the link; thanks :)

@JoelRichardson
Copy link

Just to get started, I created a google doc that is a copy of Julie's.
https://docs.google.com/spreadsheets/d/1A5-_doKRTdepELYTZoCVUTVgyAOK2enRaBhZxoNks1w/edit?usp=sharing

@JoelRichardson
Copy link

P.S. Yes, it's editable

@carlkesselman
Copy link

carlkesselman commented Apr 19, 2018 via email

@JoelRichardson
Copy link

Since I don't know anything about BDBags, I can't comment or assist.
In any case...

I'm still confused as to the scope of this request, even with the qualifiers "publicly resolvable" and "in a dump file". Which dump file? Any dump file? ANYthing that's resolvable at MGI by any public identifier (MGI: or external). I'm guessing there would be well over 100 lines for MGI. Or am I missing something?

@sierra-moxon
Copy link

sierra-moxon commented Apr 19, 2018

Hi @jmcmurry - Building on what @JoelRichardson said, do you have a id resolving map for all possible ID prefixes anywhere yet? ie: it would probably be good if we provided cross references to the same resource with the same prefix. Ie: NCBI_Gene vs. Gene. There are many cases not covered by the file at GO nor the Alliance one that we've started, if we include ontological cross references (which many of the MODs store and would fall in this generic request without clarification?) thx again, Sierra

@sierra-moxon
Copy link

@jmcmurry - re: the doc with ZIRC as an example - another option for your document, might be to use ZFIN (in this example, possibly other MODs as well) as the id resolver for these biological materials. As I understand it, since ZL#'s represent biological material that can be discontinued, they aren't good ids to use in perpetuity. Many resource centers are like this as you point out. ZFIN however, stores the representative content of these materials and could act as a "permanent" resource.

@carlkesselman
Copy link

carlkesselman commented Apr 19, 2018 via email

@sierra-moxon
Copy link

sierra-moxon commented Apr 24, 2018

Would you like this file regularly, or is this a one time survey?
Do DOIs, ISSN numbers, ORC ids count? How about ontology xrefs? ontology ids? Just ids that the MOD mints itself and are distributed? If MOD mints an id to provide an internal reference to an ontology or xref, do you want those? Biolink column is optional (would a SO Term id work)? Is this all of our ids, or just a representative sample filled into that spreadsheet? Are there any times already scheduled that we could hash this out on a phone call? thanks a bunch, @jmcmurry @carlkesselman

@khowe
Copy link

khowe commented Apr 25, 2018

I (at least) would benefit from some additional guidance on the scope. I will use a specific example: in WormBase, gene records have primary ids of the form WBGene\d+ (e.g. WBGene00006763). These appear in our dumps, and are resolvable (kind of; see below). However, there are a bunch of other identifiers associated with a gene: symbol, systematic name, previous names etc (e.g. "unc-26", "JC8.10", "CELE_JC8.10"). We conceptually treat these as properties of the gene records, rather than identifiers; and in general, they are not resolvable (although they appear in our dumps, and are searchable; and if the search results in one clear unambiguous hit, a redirect to the entity page results ).

Is it correct that for the sake of this exercise, all of these should be considered as "identifiers", and included in the file? In that case, how should non-resolvable ids be represented in the file?

For WormBase, there is a further complication that our primary entity ids are generally resolvable only on a class-by-class basis. This is due to a decision taken early in the life of the project to re-use identifiers between classes (e.g. there are two objects identified as "JC8.10a", one being a Transcript and the other being the CDS of that transcript).

In our interactions with GO, we have addressed this by treating WormBase as a collection of resources, with each class/ data-type having its own prefix (e.g. WB for genes, WB_REF for publications, WBls for life-stage ontology terms). However, prefixes have only been assigned for data types that pop up in our exchanges with GO. Since this exercise requires us to be comprehensive and consider all data types, we will need to generate more prefixes. Would you advise we do this unilaterally? Or is there a third-party central agency that we should work with to do that?

@cmungall
Copy link

cmungall commented Apr 25, 2018

I would consider symbols and names as disjoint from identifiers, but MMV.

I suggest that as the Alliance is already using the GO prefix registry that this is extended for other types too. I can work with KC2 to ensure this is propagates to identifiers.org / n2t.net (we're already doing this, e.g. ensuring the TAIR records are in sync ).

@gabinkley
Copy link

Are the examples of SGD identifiers in this spreadsheet what is requested?

[https://drive.google.com/open?id=1o54ZlW0fkIqOP8gnLtbXsEbapkAsiZ_o]

Just want to be sure I am on the right track before generating an enormous file.

@jmcmurry
Copy link
Contributor Author

Great question. The relationships themselves (annotation x or y on gen ) is not needed at this point as that would admittedly be both onerous and noisy. Not only that, but the worst way to retrieve this info :)

High priority:

  • nodes directly relevant to search (esp genes, diseases, phenotypes, anatomy, species, functions)
  • nodes relevant to integration (esp literature**, genotypes, alleles, variants)

Extremely low priority / ignore:


** For literature, a pair of IDs per article is fine.
The native node ID eg. https://www.yeastgenome.org/reference/S000207820
And an xref'd equivalent eg. http://dx.doi.org/10.1126/sciadv.aaq0236 [one of PMID, DOI, PMCID]
If the dump can couple the native ID and the xref equivalent great, but not absolutely essential. Capturing the relationships is not in scope for this activity.

Not every single data steward is going to have a perfectly complete set of literature mappings to PMID and DOI and PMCID, so these may need to be retrieved on the fly as required by use cases.

@gabinkley
Copy link

gabinkley commented Apr 27, 2018

Thanks for the feedback and clarification. @jmcmurry

@jdepons
Copy link

jdepons commented Apr 27, 2018

I took a shot at documenting all the different identifiers returned in calls to our API or files on the FTP site. It's currently in a Google doc but is there another location I should submit it?

https://docs.google.com/spreadsheets/d/11VIdKEG2JPDNHmdoeK2AZ8KM2Kg5QuKlvzJ84HvhoEE/edit#gid=0

@ctb
Copy link
Contributor

ctb commented Apr 28, 2018 via email

@sierra-moxon
Copy link

sierra-moxon commented Apr 30, 2018

@ctb @jmcmurry - does what @jdepons and @gabinkley provided fulfill this request?

@gabinkley
Copy link

I've updated my list of example identifiers and URLs. I removed the links to annotations and added links to external resources that are equivalent to SGD's identifiers that @jmcmurry indicated was desired. Please see updated spreadsheet below:

https://docs.google.com/spreadsheets/d/1FtrS-ATOZdvcE3Bjhv8KYakHZexoL9JzalHIe0TQElE/edit?usp=sharing

A final question that has been asked before, but hasn't been answered directly. Is the request for a file of all identifiers for any example in the spreadsheet or is the just list of examples sufficient right now?

@sierra-moxon
Copy link

Just an update on our meeting today with this as a topic: we agreed to make spreadsheets (and post them here) for each of our MODs with representative ids that we mint at the MOD and are publicly available. The content of the spreadsheet (past this general idea), is up to the MOD. Some will have cross references, some will not. If you need something further, @jmcmurry, please let us know.

@jmcmurry
Copy link
Contributor Author

jmcmurry commented May 2, 2018

Thanks Sierra, that sounds like a great start. Cross-references are encouraged but optional provided the xrefs can be derived from the raw data in other ways; if that isn't the case, please just let me know and we can revisit later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AGR help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests