You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when other DOIs appear before the actual paper's DOI, for example here
These could be solved by more permissive parsing of DOI, but keep it conservative for now until a good solution is found.
Nevertheless, existing edits / fixes currently include:
underscore sometimes gets converted into an empty space by pdftotxt, so we also detect ending with any space followed by a digit. This solves at least one case.
The text was updated successfully, but these errors were encountered:
perrette
changed the title
DOI fetching fails in a few cases
DOI parsing fails in a few cases
Nov 3, 2017
Yes. Thanks for the suggestion. If you end up exploring that sort of things (with your large PDF database to test with !) I'd be glad if you could report back about what works best. And who knows, perhaps someone shows up who feels like merging all the good tools into @MicheleCotrufo's pdf2doi (if practical) or another stand-alone package. That should be a library with python bindings, ideally not a verbose command line tool (so that it can be used in other command line tools), though one could certainly call it with subprocess (as is already done here with poppler utils).
The current method to retrieve DOI consists in search for regular expressions over the first two pages, and to keep the first one that appear.
Accepted prefixes are (lower or upper case):
DOI itself is searched as:
And is expected to finish with:
The method fails in a few cases:
These could be solved by more permissive parsing of DOI, but keep it conservative for now until a good solution is found.
Nevertheless, existing edits / fixes currently include:
The text was updated successfully, but these errors were encountered: