DOI parsing fails in a few cases #1

perrette · 2017-11-01T12:11:10Z

The current method to retrieve DOI consists in search for regular expressions over the first two pages, and to keep the first one that appear.

Accepted prefixes are (lower or upper case):

'doi:', 'doi: ', 'doi ', 'dx\.doi\.org/', 'doi/'

DOI itself is searched as:

r"10\.\d\d\d\d/.*?"

And is expected to finish with:

r"[, \n]"

The method fails in a few cases:

when DOI spreads over two lines (e.g. here)
when other DOIs appear before the actual paper's DOI, for example here

These could be solved by more permissive parsing of DOI, but keep it conservative for now until a good solution is found.

Nevertheless, existing edits / fixes currently include:

underscore sometimes gets converted into an empty space by pdftotxt, so we also detect ending with any space followed by a digit. This solves at least one case.

The text was updated successfully, but these errors were encountered:

boyanpenkov · 2023-04-17T17:13:06Z

https://github.com/MicheleCotrufo/pdf2doi might be a candidate, just to continue the conversation from #28

perrette · 2023-04-18T05:10:17Z

Yes. Thanks for the suggestion. If you end up exploring that sort of things (with your large PDF database to test with !) I'd be glad if you could report back about what works best. And who knows, perhaps someone shows up who feels like merging all the good tools into @MicheleCotrufo's pdf2doi (if practical) or another stand-alone package. That should be a library with python bindings, ideally not a verbose command line tool (so that it can be used in other command line tools), though one could certainly call it with subprocess (as is already done here with poppler utils).

perrette changed the title ~~DOI fetching fails in a few cases~~ DOI parsing fails in a few cases Nov 3, 2017

perrette added the help wanted label Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOI parsing fails in a few cases #1

DOI parsing fails in a few cases #1

perrette commented Nov 1, 2017 •

edited

Loading

boyanpenkov commented Apr 17, 2023

perrette commented Apr 18, 2023

DOI parsing fails in a few cases #1

DOI parsing fails in a few cases #1

Comments

perrette commented Nov 1, 2017 • edited Loading

boyanpenkov commented Apr 17, 2023

perrette commented Apr 18, 2023

perrette commented Nov 1, 2017 •

edited

Loading