Compare Python PDF extraction libraries with sample files #30

mgorenstein · 2014-05-22T16:41:29Z

Write up mini-paper comparing performance of various text-extractors on a document with available plaintext (possibly a particular edition of the bible).

Find popular samples with clean and accurate plain text.
- For each sample, find versions of varying quality.
  - One version should be a PDF I generate from the plain text.
  - One version should be a rich PDF where text is queryable.
  - One version should require OCR.
- Trim extra info like page number and Project Gutenberg header, if possible.
  - An approach could be to locate the first and last sentence of the text, consider only between these.
  - Or, just leave it be? All extractors will pick up this noise and it's expected in typical use cases.
Determine which PDF extractor libraries to test.
- Definitely PDFMiner since I've already reverse-engineered it programatically.
Determine the measures of extraction accuracy.
- I've used various measures of string difference in the FuzzyWuzzy library with some success.
- Measure speed.
- Many of these libraries do layout analysis--is this helpful or not? Surely has an effect on speed.
Run the conversions and calculate accuracy.
- Create a testing suite that determines setting regimes with uniformly better accuracy and use those settings as the benchmark for a particular library.

grahamsack · 2014-05-22T21:15:17Z

Hi Mark -- I experimented a bit with PyPDF and PDFMiner for syllabus
extraction. PyPDF seemed to be smoother to work with. I tried extracting
syllabi into plain text and html. The extractors captured most of the text
correctly but, in cases where the formatting was complicated or had lots of
tables, it jumbled the order.

Best,

Graham

On Thu, May 22, 2014 at 12:41 PM, Mark Gorenstein
[email protected]:

Write up mini-paper comparing performance of various text-extractors on a
document with available plaintext (possibly a particular edition of the
bible).

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/30
.

denten · 2014-05-22T21:38:51Z

We are planning to do a more formal comparison. Stay tuned.

grahamsack · 2014-05-22T21:57:21Z

If you want to leverage it, I put my code for the extractor in the opensyllabus/Classifiers folder.

mgorenstein · 2014-05-23T18:57:20Z

Thanks, Graham.

mgorenstein · 2014-05-24T18:16:46Z

Libraries

PDFMiner
- Slate seems like a good wrapper for it, though is possibly outdated.
- Tutorial for annotation extraction in case we decide to go this route.
pyPDF2
PDFBox
PDFTextStream
- Worth checking out. Some sites claim Python support, but if not the Java version is still a tenable option.
- Just use os.system() to execute through Python.
GFX
- Requires installation of SWF Tools
pdfextract
- two modes--line preservation, no preservation
xpdf

Source Texts

Pride and Prejudice

mgorenstein · 2014-05-25T17:29:43Z

I'm going to move ahead with V1 given these resources. I'll make the platform flexible enough to support the addition of other PDF extractors in case we come across any serious contenders that I've missed.

Graham and Dennis: let me know if you have any suggestions, especially with the selection of source texts. I went with P&P because it's in the public domain, was written in English, and has a range of released PDFs.

I reverse-engineered the latest version of pdfminer so that we can work with PDFs programatically.

grahamsack · 2014-05-26T02:17:51Z

I had read about slate while looking into PDFMiner and I thought it sounded
very good and comparatively user-friendly, but I wasn't able to get it
working due to a dependency issue I was never able to resolve. If you can
it working, that's great as it sounds like a good library.

On Sun, May 25, 2014 at 1:29 PM, Mark Gorenstein
[email protected]:

I'm going to move ahead with V1 given these resources. I'll make the
platform flexible enough to support the addition of other PDF extractors in
case we come across any serious contenders that I've missed.

Graham and Dennis: let me know if you have any suggestions, especially
with the selection of source texts. I went with P&P because it's in the
public domain, was written in English, and has a range of released PDFs.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/30#issuecomment-44140539
.

mrenoch · 2014-08-04T03:55:19Z

This could be worth checking out, to unify and maybe simplify text extraction:

http://datascopeanalytics.com/what-we-think/2014/07/27/extract-text-from-any-document-no-muss-no-fuss

https://github.com/deanmalmgren/textract

samzhang111 · 2014-12-21T03:06:52Z

Jumping in after not contributing very much... I'm familiar with some of the people who are maintaining Apache Tika out at NASA JPL. It is a project that has a strong core team of developers, and has overlapping goals with textract. The advantage to Tika (and textract) is that you don't need separate logic for each document format, and you also get standard metadata for each document.

Tika wraps around pdfbox for pdf documents, which performed a 6 second extraction in the benchmarking stats file. I bet the slowness was caused by the bootup time of the JVM, though. If you separate the JVM initialization code with the conversion, I imagine it's more in the range of the pure-python extractors. This is how I've used Tika in python in the past: www.hackzine.org/using-apache-tika-from-python-with-jnius.html

Cheers,
Sam

chrismattmann · 2015-01-25T19:47:06Z

Thanks @samzhang111 yep, happy to provide any info here on Tika if it helps

chrismattmann · 2015-05-20T01:30:52Z

Coming back here, just FYI we have a fully supported Tika port to Python using the JAX-RS REST server. FYI: https://github.com/chrismattmann/tika-python

mgorenstein self-assigned this May 22, 2014

grahamsack closed this as completed May 22, 2014

grahamsack reopened this May 22, 2014

mgorenstein added a commit that referenced this issue May 25, 2014

initial commit for #30

c9be341

mgorenstein added a commit that referenced this issue May 25, 2014

pdfminer compatibility for #30

c9c36ce

I reverse-engineered the latest version of pdfminer so that we can work with PDFs programatically.

mgorenstein added a commit that referenced this issue May 26, 2014

PyPDF2 support for #30

7e11bda

mgorenstein added a commit that referenced this issue May 26, 2014

PDFBox integration for #30

97692f1

mgorenstein added a commit that referenced this issue May 28, 2014

xpdf integration for #30

cba04d6

mgorenstein added a commit that referenced this issue May 28, 2014

textstream integration for #30

e629cd3

mgorenstein added a commit that referenced this issue Jun 7, 2014

Timing stats are all set for #30

b558bfa

mgorenstein removed their assignment Apr 8, 2016

rufuspollock mentioned this issue Jun 30, 2017

Another comparison of PDF extraction libraries official-inquiries/tools#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare Python PDF extraction libraries with sample files #30

Compare Python PDF extraction libraries with sample files #30

mgorenstein commented May 22, 2014

grahamsack commented May 22, 2014

denten commented May 22, 2014

grahamsack commented May 22, 2014

mgorenstein commented May 23, 2014

mgorenstein commented May 24, 2014

mgorenstein commented May 25, 2014

grahamsack commented May 26, 2014

mrenoch commented Aug 4, 2014

samzhang111 commented Dec 21, 2014

chrismattmann commented Jan 25, 2015

chrismattmann commented May 20, 2015

Compare Python PDF extraction libraries with sample files #30

Compare Python PDF extraction libraries with sample files #30

Comments

mgorenstein commented May 22, 2014

grahamsack commented May 22, 2014

denten commented May 22, 2014

grahamsack commented May 22, 2014

mgorenstein commented May 23, 2014

mgorenstein commented May 24, 2014

mgorenstein commented May 25, 2014

grahamsack commented May 26, 2014

mrenoch commented Aug 4, 2014

samzhang111 commented Dec 21, 2014

chrismattmann commented Jan 25, 2015

chrismattmann commented May 20, 2015