PDF extractor using PDFBox.
The jar file can be found at releases.
- PDFExtract.jl: julia wrapper for pdfextract
Extract texts and draws from PDF.
java -classpath pdfextract.jar paperai.pdfextract.PDFExtractor [file or directory]
For example,
java -classpath pdfextract.jar paperai.pdfextract.PDFExtractor xxx.pdf
In the figure, blue square indicates font coordinates, and red square indicates glyph coordinates.
Gzip file is generated.
Each line is either Text
or Draw
as follows.
- Page number
- Character or
NO_UNICODE
(when unicode mapping is unavailable) - Font coordinate (x, y, width, height)
- Glyph coordinate (x, y, width, height)
- Page number
- Draw operation, either one of
[STROKE_PATH]
[FILL_PATH]
[FILL_STROKE_PATH]
[CURVE_TO]
[LINE_TO]
[MOVE_TO]
[RECTANGLE]
- Coordinate
1 P 107.551 793.155 5.478471 10.705882 107.551 795.58496 5.424672 5.8550596
1 r 113.02947 793.155 3.4879298 10.705882 113.43296 797.48584 3.290669 3.9541826
1 o 116.113914 793.155 4.4832 10.705882 116.35601 797.48584 3.9541826 4.052813
1 c 120.597115 793.155 3.981082 10.705882 120.866104 797.48584 3.5417283 4.052813
1 e 124.57819 793.155 3.981082 10.705882 124.856155 797.48584 3.4161987 4.052813
1 e 128.55928 793.155 3.981082 10.705882 128.83723 797.48584 3.4161987 4.052813
1 d 132.54036 793.155 4.4832 10.705882 132.67485 795.316 4.590797 6.240615
1 i 137.02356 793.155 2.4926593 10.705882 137.4629 795.576 1.9277761 5.9626565
1 n 139.51622 793.155 4.4832 10.705882 139.64175 797.48584 4.124544 4.03488
1 g 143.99942 793.155 4.4832 10.705882 144.07115 797.48584 4.16041 5.801261
1 s 148.48262 793.155 3.4879298 10.705882 148.62608 797.47687 3.13824 4.0797124
...
...
4 [MOVE_TO] 323.779 200.93103
4 [LINE_TO] 509.279 200.93103
4 [LINE_TO] 509.279 62.675964
4 [LINE_TO] 323.779 62.675964
4 [RECTANGLE] 323.779 200.93103 509.279 200.93103 509.279 62.676025 323.779 62.676025
4 [FILL_PATH]
Extract images from PDF as PNG format.
java -classpath pdfextract.jar paperai.pdfextract.ImageExtractor <file or directory> -dpi <dpi> -o <output directory>
For example,
java -classpath pdfextract.jar ImageExtractor xxx.pdf -dpi 300 -o /work