Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PLAYA instead of pdfminer #1226

Draft
wants to merge 34 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
dfa4b9e
feat: use playa instead of pdfminer
dhdaines Oct 1, 2024
ddd0532
feat: playa does the right thing for mcids
dhdaines Oct 1, 2024
87a0fa3
fix: playa exposes ncs/scs
dhdaines Oct 1, 2024
fffd551
fix: update to handle parsed pages
dhdaines Oct 1, 2024
e64d509
chore: format, lint
dhdaines Oct 1, 2024
5991da3
fix(deps): switch to unreleased playa
dhdaines Oct 1, 2024
f5ac9f8
feat: playa exposes these now (but... for how long)
dhdaines Oct 22, 2024
b05d6e2
fix: new API
dhdaines Oct 22, 2024
36e28cb
feat!: remove custom LAParams (just use pdfminer if you want them)
dhdaines Oct 23, 2024
d6b5106
refactor!: another useless pdfminer API removed
dhdaines Oct 23, 2024
c0f50c2
fix: numbertree is just iterable
dhdaines Oct 23, 2024
a2aeeb3
refactor!: remove structure as it is in playa
dhdaines Oct 23, 2024
ac185d6
fix: minimally support (not quite working) new PLAYA API
dhdaines Oct 31, 2024
8d70e02
fix: some updates for latest playa
dhdaines Nov 16, 2024
46a8ba2
fix: serialize namedtuple colors
dhdaines Nov 16, 2024
b370322
fix: adjust a few things for playa
dhdaines Nov 17, 2024
3188fba
fix: add page numbers to structure tests
dhdaines Nov 17, 2024
be99434
fix: updated playa names
dhdaines Nov 17, 2024
59c255b
fix: update for PLAYA 0.1
dhdaines Nov 20, 2024
9449b11
fix: lint and format and such
dhdaines Nov 20, 2024
07bfadf
fix(deps): playa is on pypi now
dhdaines Nov 20, 2024
4de84f6
fix(tests): PLAYA fixed its colors
dhdaines Nov 20, 2024
cdd3895
fix(deps): messed up playa again...
dhdaines Nov 20, 2024
d27fcb9
fix(tests): back to previous way of formatting colors (for now)
dhdaines Nov 22, 2024
7e7d354
fix: no longer needs repair as mediabox is normalized
dhdaines Nov 22, 2024
b3da221
fix: remove unused import
dhdaines Nov 22, 2024
b0b7e6c
feat: mostly implement using lazy api
dhdaines Dec 12, 2024
6de85ed
feat: complete the reimplementation using playa lazy api
dhdaines Dec 13, 2024
56bcab6
feat: lightly wrap playa structure
dhdaines Dec 13, 2024
4117380
fix: lint
dhdaines Dec 13, 2024
718a558
docs: update README and CHANGELOG
dhdaines Dec 15, 2024
0e6dc30
feat: expose render_mode (fixes: #1230)
dhdaines Dec 27, 2024
5caaefb
fix: correct the "size" of rotated glyphs
dhdaines Jan 5, 2025
c9d3848
fix(tests): new and more correct text objects
dhdaines Jan 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@

All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](http://keepachangelog.com/).

## Unreleased

### Changed

- Switch to using [`PLAYA-PDF`](https://github.com/dhdaines/playa) for PDF parsing for increased speed and robustness.
- Remove pdfminer-specific interfaces (chiefly `LAParams`)

## [0.11.5] - 2024-10-02

### Added
Expand Down
22 changes: 8 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

Works best on machine-generated, rather than scanned, PDFs. Built on [`pdfminer.six`](https://github.com/goulu/pdfminer).
Works best on machine-generated, rather than scanned, PDFs. Built on [`PLAYA-PDF`](https://github.com/dhdaines/playa).

Currently [tested](tests/) on [Python 3.8, 3.9, 3.10, 3.11](.github/workflows/tests.yml).

Expand Down Expand Up @@ -50,7 +50,6 @@ The output will be a CSV containing info about every character, line, and rectan
|`--format [format]`| `csv` or `json`. The `json` format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes.|
|`--pages [list of pages]`| A space-delimited, `1`-indexed list of pages or hyphenated page ranges. E.g., `1, 11-15`, which would return data for pages 1, 11, 12, 13, 14, and 15.|
|`--types [list of object types to extract]`| Choices are `char`, `rect`, `line`, `curve`, `image`, `annot`, et cetera. Defaults to all available.|
|`--laparams`| A JSON-formatted string (e.g., `'{"detect_vertical": true}'`) to pass to `pdfplumber.open(..., laparams=...)`.|
|`--precision [integer]`| The number of decimal places to round floating-point numbers. Defaults to no rounding.|

## Python library
Expand All @@ -77,8 +76,6 @@ The `open` method returns an instance of the `pdfplumber.PDF` class.

To load a password-protected PDF, pass the `password` keyword argument, e.g., `pdfplumber.open("file.pdf", password = "test")`.

To set layout analysis parameters to `pdfminer.six`'s layout engine, pass the `laparams` keyword argument, e.g., `pdfplumber.open("file.pdf", laparams = { "line_overlap": 0.7 })`.

To [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `"NFC"`, `"NFD"`, `"NFKC"`, or `"NFKD"`.

Invalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.
Expand Down Expand Up @@ -132,12 +129,12 @@ Additional methods are described in the sections below:

### Objects

Each instance of `pdfplumber.PDF` and `pdfplumber.Page` provides access to several types of PDF objects, all derived from [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six/) PDF parsing. The following properties each return a Python list of the matching objects:
Each instance of `pdfplumber.PDF` and `pdfplumber.Page` provides access to several types of PDF objects, all derived from [`PLAYA-PDF`](https://github.com/dhdaines/playa/) PDF parsing. The following properties each return a Python list of the matching objects:

- `.chars`, each representing a single text character.
- `.lines`, each representing a single 1-dimensional line.
- `.rects`, each representing a single 2-dimensional rectangle.
- `.curves`, each representing any series of connected points that `pdfminer.six` does not recognize as a line or rectangle.
- `.curves`, each representing any series of connected points that `pdfplumber` does not recognize as a line or rectangle.
- `.images`, each representing an image.
- `.annots`, each representing a single PDF annotation (cf. Section 8.4 of the [official PDF specification](https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf) for details)
- `.hyperlinks`, each representing a single PDF annotation of the subtype `Link` and having an `URI` action attribute
Expand Down Expand Up @@ -272,18 +269,13 @@ Additionally, both `pdfplumber.PDF` and `pdfplumber.Page` provide access to seve
|`srcsize`| The image original dimensions, as a `(width, height)` tuple.|
|`colorspace`| Color domain of the image (e.g., RGB).|
|`bits`| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).|
|`stream`| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.|
|`stream`| Pixel values of the image, as a `playa.pdftypes.ContentStream` object.|
|`imagemask`| A nullable boolean; if `True`, "specifies that the image data is to be used as a stencil mask for painting in the current color."|
|`name`| "The name by which this image XObject is referenced in the XObject subdictionary of the current resource dictionary." [🔗](https://ghostscript.com/~robin/pdf_reference17.pdf#page=340) |
|`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). *Experimental attribute.*|
|`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). *Experimental attribute.*|
|`object_type`| "image"|

### Obtaining higher-level layout objects via `pdfminer.six`

If you pass the `pdfminer.six`-handling `laparams` parameter to `pdfplumber.open(...)`, then each page's `.objects` dictionary will also contain `pdfminer.six`'s higher-level layout objects, such as `"textboxhorizontal"`.


## Visual debugging

`pdfplumber`'s visual debugging tools can be helpful in understanding the structure of a PDF and the objects that have been extracted from it.
Expand Down Expand Up @@ -451,7 +443,7 @@ Both `vertical_strategy` and `horizontal_strategy` accept the following options:

Sometimes PDF files can contain forms that include inputs that people can fill out and save. While values in form fields appear like other text in a PDF file, form data is handled differently. If you want the gory details, see page 671 of this [specification](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf).

`pdfplumber` doesn't have an interface for working with form data, but you can access it using `pdfplumber`'s wrappers around `pdfminer`.
`pdfplumber` doesn't have an interface for working with form data, but you can access it using `pdfplumber`'s wrappers around `PLAYA-PDF`.

For example, this snippet will retrieve form field names and values and store them in a dictionary.

Expand Down Expand Up @@ -523,7 +515,9 @@ It's also helpful to know what features `pdfplumber` does __not__ provide:

### Specific comparisons

- [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six) provides the foundation for `pdfplumber`. It primarily focuses on parsing PDFs, analyzing PDF layouts and object positioning, and extracting text. It does not provide tools for table extraction or visual debugging.
- [`PLAYA-PDF`](https://github.com/dhdaines/playa) provides the foundation for `pdfplumber`. It focuses on parsing PDFs and does not do layout analysis or text extraction.

- [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six) focuses on parsing PDFs, with some functionality for analyzing PDF layouts and object positioning, and extracting text. It does not provide tools for table extraction or visual debugging.

- [`PyPDF2`](https://github.com/mstamy2/PyPDF2) is a pure-Python library "capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files." It can extract page text, but does not provide easy access to shape objects (rectangles, lines, etc.), table-extraction, or visually debugging tools.

Expand Down
4 changes: 0 additions & 4 deletions pdfplumber/__init__.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,11 @@
__all__ = [
"__version__",
"utils",
"pdfminer",
"open",
"repair",
"set_debug",
]

import pdfminer
import pdfminer.pdftypes

from . import utils
from ._version import __version__
from .pdf import PDF
Expand Down
2 changes: 1 addition & 1 deletion pdfplumber/convert.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import base64
from typing import Any, Callable, Dict, List, Optional, Tuple

from pdfminer.psparser import PSLiteral
from playa.parser import PSLiteral

from .utils import decode_text

Expand Down
Loading