Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centrality of DOI/Crossref #33

Open
bnewbold opened this issue Jun 1, 2019 · 11 comments
Open

Centrality of DOI/Crossref #33

bnewbold opened this issue Jun 1, 2019 · 11 comments
Labels

Comments

@bnewbold
Copy link
Contributor

bnewbold commented Jun 1, 2019

In my spare time I am starting to add support for fatcat (https://fatcat.wiki) metadata to biblio-glutton. To start this would probably be a branch, then see if it make sense to upstream.

As i'm poking through the code I have some conceptual questions about how Crossref metadata and DOIs are currently used. My goal is to be able to match against fatcat releases which do not have DOIs (but always have a fatcat internal identifer, and may have other external identifiers like arxiv ID or PMID).

  • is it currently possible to do bibliographic lookups for works that don't have a DOI? Eg, could pubmed metadata (title, authors, etc) for works that have a PMID but no DOI be included in the elasticsearch index? I think not, but want to confirm
  • are arxiv ids supported anywhere? cc: @bfirsh who has great matching and I assumed had lookups by arxiv id
  • for the code path where a search match against elasticsearch has been made, are all the "enrichments" of other metadata done on a DOI basis? Or are istex/pii/pmid lookups chained together to find additional identifiers and do key/value lookups with those non-DOI identifiers? I think only DOIs are used, but want to confirm
  • are lookups by PMID/ISTEX/PII only performed for API lookups that supply that identifier directly? I think this is the case but want to confirm.

I'd be happy to submit a README update clarifying some of these once I understand it myself. Or maybe a new file as README is getting long!

@bnewbold
Copy link
Contributor Author

bnewbold commented Jun 1, 2019

Work in progress here: https://github.com/bnewbold/biblio-glutton/tree/fatcat

I just realized I might need to munge the fatcat JSON content into something more like crossref for this to work with GROBID itself (aka, have GROBID insert metadata into TEI-XML based on successful hits).

@kermitt2
Copy link
Owner

kermitt2 commented Jun 1, 2019

  • DOI is indeed the pivot currently between all the db. It is currently not possible to do bibliographic lookups for works that don't have a DOI. It's not too complicated to change it, the required changes will be outlined below.

  • arxiv ids are not supported yet, we will need an additional input route.

  • yes for the moment, only identifiers having a DOI are usable for lookup/matching. We wanted to realize first a replacement for the CrossRef REST API, so we introduced this restriction.

  • PMID/PMC/ISTEX/PII are always retrieved when returning a bibliographical record.

@kermitt2
Copy link
Owner

kermitt2 commented Jun 1, 2019

Currently DOI is the main key for all db:

  • DOI -> core metadata (crossref metadata in practice)
  • DOI -> {PMID, PMC ID} (in the future also MESH classes, already partially implemented)
  • DOI -> {ISTEX ID, ark, PII} (for the moment only record with an ISTEX ID have a PII, so many PII are missing, to be improved)
  • DOI -> unpaywall record

In addition we have the identifier mapping:

  • PMID -> DOI
  • PMC ID -> DOI
  • ISTEX ID -> DOI
  • PII -> DOI

So the process is as follow:

1. get the DOI:
1.1. we have a DOI as parameter, nothing to do
1.2. we have an identifier (PMID, PMC ID, ISTEX ID, PII), we use the above maps to get the DOI
1.3. we have mixture of metadata, we optionally parse a raw bibilo string, we search via Elasticsearch, select the best match, post-validate it and get a DOI
2. get the core metadata with the DOI
3. get the other identifiers and extra-metadata with the DOI via above mapping
4. aggregate some metadata provided with the other identifiers to have the full final metadata record

@kermitt2
Copy link
Owner

kermitt2 commented Jun 1, 2019

For freeing us from the DOI constraint in a simple manner, I was considering using an internal invisible identifier (GluttonID) instead of the DOI, and treat DOI just as another identifier.

So that would change the maps as follow:

  • GluttonID -> {DOI metadata record}
  • GluttonID -> {PMID, PMC ID} (in the future also MESH classes, already partially implemented)
  • GluttonID -> {ISTEX ID, ark, PII} (for the moment only record with an ISTEX ID have a PII, so many PII are missing, to be improved)
  • GluttonID -> unpaywall record

In addition we have the identifier mapping:

  • DOI -> GluttonID
  • PMID -> GluttonID
  • PMC ID -> GluttonID
  • ISTEX ID -> GluttonID
  • PII -> GluttonID

One obvious issue is that there might several full metadata record formats ("core metadata"), in the future, in addition to CrossRef. So it would mean to convert everything in the same format (crossref UNIXREF?) and the problem of having information loss if the format is not comprehensive enough appears.

In the past, I was resolving that by using TEI which is very comprehensive (see https://github.com/kermitt2/Pub2TEI which cover plenty of metadata formats into TEI). But here it's probably going a bit beyond the scope of the tool.

So for adding fatcat metadata, one option would be to add routes to create two maps:

  • GluttonID -> fatcat record
  • fatcat identifier -> GluttonID

Add an extra indexing step in elasticsearch and to decide how to aggregate fatcat metadata.

@bfirsh
Copy link
Contributor

bfirsh commented Jun 1, 2019

On arXiv Vanity, it looks like we have great matching of arXiv papers, but we’re cheating a bit.

  • We have an aggressive regex that runs beforehand (something like \d+\.\d+), and if it matches, we assume it is an arXiv ID and don’t even touch biblio-glutton.
  • Unpaywall very often returns an arXiv link as an OA version. We run a regex against the Unpaywall URL returned by biblio-glutton, and if it’s an arXiv link, extract the arXiv ID.

So, nothing clever. I don’t think the latter technique is useful for you because it relies on it resolving to a DOI for Unpaywall (I think).

@bnewbold
Copy link
Contributor Author

bnewbold commented Jun 1, 2019

Thank you for your responses!

It strikes me that starting with a sqlite table that had:

  • all relevant identifiers (DOI, PMID, PMCID, ISTEX, PII, arxiv, etc) as indexed columns
  • bibliographic metadata for elasticsearch (title, first author, page, journal abbreviation, etc)
  • bibliographic metadata needed for GROBID enrichment (mostly the same as above, maybe a couple extra fields)
  • any other "enrichment" metadata, like unpaywall links, maybe as free-form JSON

Could serve most needs. Instead of doing multiple lookups, would do a query by identifier and get the whole row back. The same table could be used to populate elasticsearch. To lookup based on an elasticsearch hit, would use the first non-null identifier to do lookup, instead of needing a new GluttonID.

I generated something similar (now out of date) at:

https://github.com/internetarchive/fatcat/tree/master/extra/extid_map

The problems I can see with this would be:

  • maybe sqlite isn't as fast as LMDB or other "just" key/value, though probably worth benchmarking in different configurations (eg, multi-threaded read-only modes). On the other hand, if multiple lookups from multiple key/value databases are needed, that's basically the same thing sqlite is doing internally
  • not as much/rich metadata gets returned from lookups (aka, not the full crossref metadata), though that metadata is already getting truncated, and the code to parse/transform that schema is currently duplicated (in GROBID, biblio-glutton matcher, biblio-glutton lookup). The full crossref metadata could be stored in a column in the table, though my intuition is that this would be bad for performance (hurts kernel page cache).
  • updating any of the source metadata requires re-computing this table; currently only the relevant key/value database is updated. managing a new "glutton identifier" might be just as much work though.

I think adding arxiv identifiers to the elastic index (and particularly the "biblio" string) should work as well or almost as well as @bfirsh's regex. The identifiers I can think of that end up in citation strings are DOI, PMID, PMCID, and Arxiv identifier. Wikipedia often has "bibcode", but I don't see those anywhere else. It's also somewhat common to use "doi:10.1234/asdf" syntax (and "arxiv:" et al), or full URLs; a pre-processing step (in the metadata match path in biblio-glutton? or in GROBID?) to match and transform such strings could help with recall for those cases. Maybe GROBID already does this, I haven't checked.

A gotcha with arxiv identifiers is that they are sometimes versioned and sometimes not. If used in elasticsearch both versions probably need to be added to the "biblio" string, and the verification code needs to be aware of this. I recently discovered that PMC (pubmed central) identifiers can also be versioned with a ".1", ".2" suffix.

@kermitt2
Copy link
Owner

kermitt2 commented Jun 2, 2019

Thanks a lot for the feedback @bnewbold !

I think the advantage of LMDB is its speed, it's super efficient, much faster than sqlite (for random read it's 10-50 times faster according to benchmarks), in particular for large values (like the full crossref metadata, more than 200 times faster for random read of large values).

Apart from that, one or the other, one table design or just key value, does probably not matter a lot I think, but I am maybe wrong.
Using LMDB has its own advantages, in particular just having key/values simplifies the design and evolution, having no schema at all at this stage (the schemas are only enforced upstream by each dataset), with possibility to load and update individually dataset, add more as needed, etc. We can always optimize in the future. But just having a table has its own advantages as you mention.

lmdb is very stable and scales super well, but of course sqlite too. entity-fishing manages more than 1 billion values (whole wikidata, 5 parsed wikipedias, 15M embeddings), using them really fast (I think around 600k access per second) in multithreading even with 4GB RAM, so we were confident with that solution.

I guess we could get the arXiv metadata via their OAI-PMH service, load that stuff and try to match some DOI at the same time.
For bibliographical reference parsing, GROBID has a nice, complicated, regex corresponding to the two arXiv ID schemes (pre- and post-2007), it is used as features for the machine learning model and it also tries to match the extracted identifier with it (versioning is covered if I remember well). There are many training examples with arxiv ids (added with someone from CERN) and it has been tested a lot.

We could import Wikidata identifier for works (via entity-fishing it's very easy), it could be a useful source of metadata (for books for instance!).

Yes I was surprised to see "bibcode" in Wikipedia too! I think there are only used in Astronomy, via the NASA ADS (I saw them a lot at CDS). Strange to make this particular identifier visible at Wikipedia, while it only relates usually to bibliographical entries, not full texts. NASA ADS has also an OAI-PMH service and we could get more astronomy-specific metadata (though I've never seen bibcodes I think in bibliographical references).

But there's one question behind all of this :)
To some extend, maybe biblio-glutton should simply import metadata from fatcat? And we should care about the harvesting/aggregation only with fatcat? That would make both tools complementary and inter-operable, addressing two different tasks?

@bnewbold
Copy link
Contributor Author

bnewbold commented Jun 6, 2019

Interesting that LMDB is so much faster than sqlite.

Having biblio-glutton import from fatcat would certainly be convenient for me!

Fatcat currently supports DOI, PMID, PMCID, arxiv (full, versioned), Wikidata QID, JSTOR id, CORE id (unused), ARK (rare), and MAG id (microsoft academic, unused). Almost all of arxiv, crossref, and pubmed identifiers are populated. I've also loaded the JALC (Japanese DOI registrar) corpus and intend to load Datacite DOIs.

Would you mean including, eg, ISTEX identifier in fatcat as well? I'd be open to that, though as it would update 20+ million entities i'd want to go slow and test. The fatcat.wiki instance does have an API and can receive continuous updates by bots. Also possible to run on your own, though the main SQL database is up to about 400 GBytes so it needs a large disk and decent amount of RAM. Or of course ISTEX et al can be "enriched" as it is now, from a LMDB table.

I will continue working on getting fatcat working with glutton as a metadata source and for matching, and will probably patch GROBID to allow including fatcat identifiers. I want to experiment with that a bit, then think about how more integration could happen.

@bnewbold
Copy link
Contributor Author

As an update on this thread, I seem to have things working pretty well now with fatcat. So far I have only loaded 1/3 of the corpus into LMDB, but once that is complete i'll probably make this available as an experimental public API.

I finally noticed that the consolidate params have a 2 option to only insert identifiers instead of re-writing the full metadata. That's great for my use case, where i'd like to do matching/consolidation at the same time as parsing for efficiency, but want to keep the raw extracted metadata (title, references, etc) around to debug data issues and write other matching/cleanup tools in the future, without having to double-extract and double-store XML. #442 would also be helpful for this.

My biblio-glutton changes were hack-y and break regular crossref behavior, which makes it more of a fork. That's not great, and i'd like to have something I could merge back upstream, but I probably won't until we can think of the best way to do so. To continue supporting DOI look-ups I have two tables:

  • fatcat_ident -> fatcat release metadata JSON
  • DOI -> fatcat_ident

This could easily be extended using the same corpus/importer to arxiv, PMCID, PMID, etc (most things, though not PII and ISTEX currently).

I have some confusion about whether a "MatchingDocument" represents the schema returned from elasticsearch (just minimal metadata) or the complete Crossref Work schema stored in one of the LMDB tables. In particular, what the jsonOutput schema in LookupControllerTest.java is supposed to be.

In my setup glutton returns fatcat release schema over the wire to GROBID, so GROBID needed to patched to support this schema. These GROBID changes are cleaner and could potentially be merged. I added glutton_fatcat as a variant of the glutton consolidation back-end, which was just a file or two of new code. Also added better support for "raw" complete names (not separated into given/surname), because fatcat still has a lot of those from early days. Hope to remove most of these, but may be good to retain support for things like collaborations (eg, "The LIGO Collaboration" as an author name).

From informal playing around so far, and in-line with @bfirsh's interests, I think an easy improvement for glutton/GROBID would be end-to-end support for arxiv identifiers. I think GROBID already parses these out of references and could pass them along like DOI in glutton API requests; glutton would need an arxiv lookup table and complete arxiv metadata in the right schema (eg, transformed from arxiv OAI-PMH). One "gotcha" with arxiv is handling the versioned identifiers. For GROBID/glutton probably best to use the generic when matching without an identifier, but preserve version precision if used in references. I might implement this in my fatcat branch (I already have arxiv metadata in fatcat).

Other improvements for the fatcat corpus would be books (for the humanities), datasets from datacite, and better conference proceedings coverage (for STEM; eg importing dblp).

In addition to the "linking identifier" question earlier in this thread, it might be worth changing the schema communicated between GROBID and biblio-glutton away from Crossref Work. I would propose Citation Style Language JSON as being the best fit: both the crossref and fatcat schemas are pretty close already, and the schema is "useful" as-is because it can be used to render citation lists using existing tooling. It also feels like the best fit because it is explicitly designed to model references. Converting from Crossref to CSL in bulk or from the API shouldn't be too hard. There might even be a way to get crossref.org to return lookup results in CSL schema (you can do individual document fetches using content negotiation). This would make tasks like "add arxiv metadata" or "add dblp metadata" a matter of implementing one-way conversions to that format, which might be useful for others as well.

@bnewbold
Copy link
Contributor Author

bnewbold commented Jul 1, 2019

Here's an experimental glutton (and GROBID) API endpoint: http://glutton.qa.fatcat.wiki/

@bnewbold
Copy link
Contributor Author

bnewbold commented Nov 5, 2019

Another potential larger source of metadata to match against would be Semantic Scholar, currently up to some 175 million works. I believe these are almost all Microsoft Academic Graph entities, so matching against MAG directly might make more sense. Both of these corpuses have their own pseudo-persistent identifiers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants