-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Centrality of DOI/Crossref #33
Comments
Work in progress here: https://github.com/bnewbold/biblio-glutton/tree/fatcat I just realized I might need to munge the fatcat JSON content into something more like crossref for this to work with GROBID itself (aka, have GROBID insert metadata into TEI-XML based on successful hits). |
|
Currently DOI is the main key for all db:
In addition we have the identifier mapping:
So the process is as follow: 1. get the DOI: |
For freeing us from the DOI constraint in a simple manner, I was considering using an internal invisible identifier (GluttonID) instead of the DOI, and treat DOI just as another identifier. So that would change the maps as follow:
In addition we have the identifier mapping:
One obvious issue is that there might several full metadata record formats ("core metadata"), in the future, in addition to CrossRef. So it would mean to convert everything in the same format (crossref UNIXREF?) and the problem of having information loss if the format is not comprehensive enough appears. In the past, I was resolving that by using TEI which is very comprehensive (see https://github.com/kermitt2/Pub2TEI which cover plenty of metadata formats into TEI). But here it's probably going a bit beyond the scope of the tool. So for adding fatcat metadata, one option would be to add routes to create two maps:
Add an extra indexing step in elasticsearch and to decide how to aggregate fatcat metadata. |
On arXiv Vanity, it looks like we have great matching of arXiv papers, but we’re cheating a bit.
So, nothing clever. I don’t think the latter technique is useful for you because it relies on it resolving to a DOI for Unpaywall (I think). |
Thank you for your responses! It strikes me that starting with a sqlite table that had:
Could serve most needs. Instead of doing multiple lookups, would do a query by identifier and get the whole row back. The same table could be used to populate elasticsearch. To lookup based on an elasticsearch hit, would use the first non-null identifier to do lookup, instead of needing a new GluttonID. I generated something similar (now out of date) at: https://github.com/internetarchive/fatcat/tree/master/extra/extid_map The problems I can see with this would be:
I think adding arxiv identifiers to the elastic index (and particularly the "biblio" string) should work as well or almost as well as @bfirsh's regex. The identifiers I can think of that end up in citation strings are DOI, PMID, PMCID, and Arxiv identifier. Wikipedia often has "bibcode", but I don't see those anywhere else. It's also somewhat common to use "doi:10.1234/asdf" syntax (and "arxiv:" et al), or full URLs; a pre-processing step (in the metadata match path in biblio-glutton? or in GROBID?) to match and transform such strings could help with recall for those cases. Maybe GROBID already does this, I haven't checked. A gotcha with arxiv identifiers is that they are sometimes versioned and sometimes not. If used in elasticsearch both versions probably need to be added to the "biblio" string, and the verification code needs to be aware of this. I recently discovered that PMC (pubmed central) identifiers can also be versioned with a ".1", ".2" suffix. |
Thanks a lot for the feedback @bnewbold ! I think the advantage of LMDB is its speed, it's super efficient, much faster than sqlite (for random read it's 10-50 times faster according to benchmarks), in particular for large values (like the full crossref metadata, more than 200 times faster for random read of large values). Apart from that, one or the other, one table design or just key value, does probably not matter a lot I think, but I am maybe wrong. lmdb is very stable and scales super well, but of course sqlite too. entity-fishing manages more than 1 billion values (whole wikidata, 5 parsed wikipedias, 15M embeddings), using them really fast (I think around 600k access per second) in multithreading even with 4GB RAM, so we were confident with that solution. I guess we could get the arXiv metadata via their OAI-PMH service, load that stuff and try to match some DOI at the same time. We could import Wikidata identifier for works (via entity-fishing it's very easy), it could be a useful source of metadata (for books for instance!). Yes I was surprised to see "bibcode" in Wikipedia too! I think there are only used in Astronomy, via the NASA ADS (I saw them a lot at CDS). Strange to make this particular identifier visible at Wikipedia, while it only relates usually to bibliographical entries, not full texts. NASA ADS has also an OAI-PMH service and we could get more astronomy-specific metadata (though I've never seen bibcodes I think in bibliographical references). But there's one question behind all of this :) |
Interesting that LMDB is so much faster than sqlite. Having biblio-glutton import from fatcat would certainly be convenient for me! Fatcat currently supports DOI, PMID, PMCID, arxiv (full, versioned), Wikidata QID, JSTOR id, CORE id (unused), ARK (rare), and MAG id (microsoft academic, unused). Almost all of arxiv, crossref, and pubmed identifiers are populated. I've also loaded the JALC (Japanese DOI registrar) corpus and intend to load Datacite DOIs. Would you mean including, eg, ISTEX identifier in fatcat as well? I'd be open to that, though as it would update 20+ million entities i'd want to go slow and test. The fatcat.wiki instance does have an API and can receive continuous updates by bots. Also possible to run on your own, though the main SQL database is up to about 400 GBytes so it needs a large disk and decent amount of RAM. Or of course ISTEX et al can be "enriched" as it is now, from a LMDB table. I will continue working on getting fatcat working with glutton as a metadata source and for matching, and will probably patch GROBID to allow including fatcat identifiers. I want to experiment with that a bit, then think about how more integration could happen. |
As an update on this thread, I seem to have things working pretty well now with fatcat. So far I have only loaded 1/3 of the corpus into LMDB, but once that is complete i'll probably make this available as an experimental public API. I finally noticed that the consolidate params have a My biblio-glutton changes were hack-y and break regular crossref behavior, which makes it more of a fork. That's not great, and i'd like to have something I could merge back upstream, but I probably won't until we can think of the best way to do so. To continue supporting DOI look-ups I have two tables:
This could easily be extended using the same corpus/importer to I have some confusion about whether a "MatchingDocument" represents the schema returned from elasticsearch (just minimal metadata) or the complete Crossref Work schema stored in one of the LMDB tables. In particular, what the In my setup glutton returns fatcat release schema over the wire to GROBID, so GROBID needed to patched to support this schema. These GROBID changes are cleaner and could potentially be merged. I added From informal playing around so far, and in-line with @bfirsh's interests, I think an easy improvement for glutton/GROBID would be end-to-end support for arxiv identifiers. I think GROBID already parses these out of references and could pass them along like DOI in glutton API requests; glutton would need an arxiv lookup table and complete arxiv metadata in the right schema (eg, transformed from arxiv OAI-PMH). One "gotcha" with arxiv is handling the versioned identifiers. For GROBID/glutton probably best to use the generic when matching without an identifier, but preserve version precision if used in references. I might implement this in my fatcat branch (I already have arxiv metadata in fatcat). Other improvements for the fatcat corpus would be books (for the humanities), datasets from datacite, and better conference proceedings coverage (for STEM; eg importing dblp). In addition to the "linking identifier" question earlier in this thread, it might be worth changing the schema communicated between GROBID and biblio-glutton away from Crossref Work. I would propose Citation Style Language JSON as being the best fit: both the crossref and fatcat schemas are pretty close already, and the schema is "useful" as-is because it can be used to render citation lists using existing tooling. It also feels like the best fit because it is explicitly designed to model references. Converting from Crossref to CSL in bulk or from the API shouldn't be too hard. There might even be a way to get crossref.org to return lookup results in CSL schema (you can do individual document fetches using content negotiation). This would make tasks like "add arxiv metadata" or "add dblp metadata" a matter of implementing one-way conversions to that format, which might be useful for others as well. |
Here's an experimental glutton (and GROBID) API endpoint: http://glutton.qa.fatcat.wiki/ |
Another potential larger source of metadata to match against would be Semantic Scholar, currently up to some 175 million works. I believe these are almost all Microsoft Academic Graph entities, so matching against MAG directly might make more sense. Both of these corpuses have their own pseudo-persistent identifiers. |
In my spare time I am starting to add support for fatcat (https://fatcat.wiki) metadata to biblio-glutton. To start this would probably be a branch, then see if it make sense to upstream.
As i'm poking through the code I have some conceptual questions about how Crossref metadata and DOIs are currently used. My goal is to be able to match against fatcat releases which do not have DOIs (but always have a fatcat internal identifer, and may have other external identifiers like arxiv ID or PMID).
I'd be happy to submit a README update clarifying some of these once I understand it myself. Or maybe a new file as README is getting long!
The text was updated successfully, but these errors were encountered: