Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different abstract extracted via ocr #44

Open
wo opened this issue Feb 19, 2016 · 1 comment
Open

different abstract extracted via ocr #44

wo opened this issue Feb 19, 2016 · 1 comment
Labels
Milestone

Comments

@wo
Copy link
Owner

wo commented Feb 19, 2016

https://studies2.hec.fr/jahia/webdav/site/hec/shared/sites/mongin/acces_anonyme/page%20internet/O12.MonginExpectedHbk97.pdf

Here publication info is treated as part of the abstract if processed via ocr2xml, not if processed via pdf2html.

@wo wo added the ocr2xml label Feb 19, 2016
@wo wo added this to the new server start milestone Feb 19, 2016
@wo
Copy link
Owner Author

wo commented Apr 4, 2016

The reason is probably that the publication info is classified as having the same font by ocr2xml, but not by pdftohtml. Not really a problem. There's a significant gap between the publication info and the abstract though that should prevent treating both as abstract.

@wo wo modified the milestones: someday maybe, new server start Apr 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant