Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOI parsing fails in a few cases #1

Open
perrette opened this issue Nov 1, 2017 · 2 comments
Open

DOI parsing fails in a few cases #1

perrette opened this issue Nov 1, 2017 · 2 comments

Comments

@perrette
Copy link
Owner

perrette commented Nov 1, 2017

The current method to retrieve DOI consists in search for regular expressions over the first two pages, and to keep the first one that appear.

Accepted prefixes are (lower or upper case):

'doi:', 'doi: ', 'doi ', 'dx\.doi\.org/', 'doi/'

DOI itself is searched as:

r"10\.\d\d\d\d/.*?"

And is expected to finish with:

r"[, \n]"

The method fails in a few cases:

  • when DOI spreads over two lines (e.g. here)
  • when other DOIs appear before the actual paper's DOI, for example here

These could be solved by more permissive parsing of DOI, but keep it conservative for now until a good solution is found.

Nevertheless, existing edits / fixes currently include:

  • underscore sometimes gets converted into an empty space by pdftotxt, so we also detect ending with any space followed by a digit. This solves at least one case.
@perrette perrette changed the title DOI fetching fails in a few cases DOI parsing fails in a few cases Nov 3, 2017
@boyanpenkov
Copy link

https://github.com/MicheleCotrufo/pdf2doi might be a candidate, just to continue the conversation from #28

@perrette
Copy link
Owner Author

Yes. Thanks for the suggestion. If you end up exploring that sort of things (with your large PDF database to test with !) I'd be glad if you could report back about what works best. And who knows, perhaps someone shows up who feels like merging all the good tools into @MicheleCotrufo's pdf2doi (if practical) or another stand-alone package. That should be a library with python bindings, ideally not a verbose command line tool (so that it can be used in other command line tools), though one could certainly call it with subprocess (as is already done here with poppler utils).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants