Skip to content

Commit

Permalink
One big commit for v0.6.0-alpha
Browse files Browse the repository at this point in the history
  • Loading branch information
jsvine committed Feb 6, 2018
1 parent ee7b4c2 commit 26bf1c5
Show file tree
Hide file tree
Showing 36 changed files with 1,841 additions and 547 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
TODO.md
/notebooks
examples-in-progress
.ipynb_checkpoints
.DS_Store
# Byte-compiled / optimized / DLL files
Expand Down
8 changes: 5 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ python:
- "3.5"
- "3.6"
install:
- pip install .
- pip install nose
- pip install -e .
- pip install pandas
script: nosetests
- pip install nose
- pip install coveralls
script: nosetests --with-coverage --cover-erase --cover-package pdfplumber
after_success: coveralls
42 changes: 42 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,48 @@ All notable changes to this project will be documented in this file. Currently g

The format is based on [Keep a Changelog](http://keepachangelog.com/).

## [0.6.0-alpha] — 2018-02-05
### Added
- Color information for many objects, thanks to `pdfminer.six` updates
- Font size and name to results of `Page/utils.extract_words`; `match_fontsize`, `match_fontname`, and `fontsize_tolerance` keyword arguments to that method.
- Ability for `Page.crop`/etc. to accept rects and other `pdfplumber` objects
- `PageImage.draw_object`, which tries to replicate a given object's attributes
- `Page.find_text_edges`, which returns a list of lines that appear to define implicit borders/edges/alignment
- `char_threshold` argument for `Page.crop`, which lets you decide how much of a `char` needs to be within a cropping box to be retained
- `MANIFEST.in`
- `requirements.txt`

### Changed
- Big revamp/simplification of table extraction
- Upgrade to `pdfminer.six==20170419`

### Fixed
- Fix `utils.objects_overlap`, which was failing when the second object was entirely encompassing first object

### Deprecated
- Access to `Page.annos`, which wasn't actually working in the first place. Hoping to re-add proper support.
- Access to `y0` and `y1` properties, which were redundant and a bit confusing
- utils.resize_object, which contained flawed assumptions and wasn't necessary

## [0.5.7] — 2018-01-20
### Added
- `.travis.yml`, but failing on `.to_image()`

### Changed
- Move from defunct `pycrypto` to `pycryptodome`
- Update `pdfminer.six` to `20170720`

## [0.5.6] — 2017-11-21
### Fixed
- Fix issue #41, in which PDF-object-referenced cropboxes/mediaboxes weren't being fully resolved.

## [0.5.5] — 2017-05-10
### Added
- Access to `__version__` from main namespace

### Fixed
- Fix issue #33, by checking `decode_text`'s argument type

## [0.5.4] — 2017-04-27
### Fixed
- Pin `pdfminer.six` to version `20151013` (for now), fixing incompatibility
Expand Down
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include *.txt *.md *.rst
70 changes: 26 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# PDFPlumber `v0.5.7`
# PDFPlumber `v0.6.0-alpha`

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.

Expand Down Expand Up @@ -41,7 +41,7 @@ The output will be a CSV containing info about every character, line, and rectan
|----------|-------------|
|`--format [format]`| `csv` or `json`. The `json` format returns slightly more information; it includes PDF-level metadata and height/width information about each page.|
|`--pages [list of pages]`| A space-delimited, `1`-indexed list of pages or hyphenated page ranges. E.g., `1, 11-15`, which would return data for pages 1, 11, 12, 13, 14, and 15.|
|`--types [list of object types to extract]`| Choices are `char`, `anno`, `line`, `curve`, `rect`, `rect_edge`. Defaults to `char`, `anno`, `line`, `curve`, `rect`.|
|`--types [list of object types to extract]`| Choices are `char`, `line`, `curve`, `rect`, `rect_edge`. Defaults to `char`, `line`, `curve`, `rect`.|

## Python library

Expand Down Expand Up @@ -88,27 +88,32 @@ The `pdfplumber.Page` class is at the core of `pdfplumber`. Most things you'll d

| Method | Description |
|--------|-------------|
|`.crop(bounding_box)`| Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values `(x0, top, x1, bottom)`. Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box.|
|`.crop(bounding_box, char_threshold=0.5)`| Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values `(x0, top, x1, bottom)`. Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box. If a character object's area is reduced by more than a ratio of `char_threshold`, it is removed.|
|`.within_bbox(bounding_box)`| Similar to `.crop`, but only retains objects that fall *entirely* within the bounding box.|
|`.filter(test_function)`| Returns a version of the page with only the `.objects` for which `test_function(obj)` returns `True`.|
|`.extract_text(x_tolerance=0, y_tolerance=0)`| Collates all of the page's character objects into a single string. Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.|
|`.extract_words(x_tolerance=0, y_tolerance=0)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`.|
|`.extract_text(x_tolerance = 3, y_tolerance = 3)`| Collates all of the page's character objects into a single string. Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.|
|`.extract_words(x_tolerance = 3, y_tolerance = 3, fontsize_tolerance = 0.25, keep_blank_chars = False, match_fontsize = True, match_fontname = True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. By default, characters are grouped into the same word only if they share the same font and their font sizes are within `fontsize_tolerance` of each other. Those defaults can be changed by setting `match_fontname` and/or `match_fontsize` to `False`.|
|`.find_text_edges(orientation, min_words = 3, extend = False, word_kwargs = {})`| Returns a list edges that align with the borders of least `min_words`. The `orientation` parameter must be either `h` or `v` — i.e., horizontal or vertical. By default, the edges extend only to the outermost matching words; setting `extend = True` extends all edges to the extremity of the page. The `word_kwargs` dict is passed to `page.extract_words(...)`|
|`.extract_tables(table_settings)`| Extracts tabular data from the page. For more details see "[Extracting tables](#extracting-tables)" below.|
|`.extract_table(table_settings)`| Extracts data from the largest detected table on the page, as measured by the number of cells. For more details see "[Extracting tables](#extracting-tables)" below.|
|`.to_image(**conversion_kwargs)`| Returns an instance of the `PageImage` class. For more details, see "[Visual debugging](#visual-debugging)" below. For conversion_kwargs, see [here](http://docs.wand-py.org/en/latest/wand/image.html#wand.image.Image).|

### Objects

Each instance of `pdfplumber.PDF` and `pdfplumber.Page` provides access to four types of PDF objects. The following properties each return a Python list of the matching objects:

- `.chars`, each representing a single text character.
- `.annos`, each representing a single annotation-text character.
- `.lines`, each representing a single 1-dimensional line.
- `.rects`, each representing a single 2-dimensional rectangle.
- `.rect_edges`, each representing one side of each rectangle.
- `.edges`, equivalent to `.lines` + `.rect_edges`, with the addition of an "orientation" property for all objects.
- `.horizontal_edges` (same as above, but only for those running horizontally)
- `.vertical_edges` (same as above, but only for those running horizontally)
- `.curves`, each representing a series of connected points.

Each object is represented as a simple Python `dict`, with the following properties:

#### `char` / `anno` properties
#### `char` properties

| Property | Description |
|----------|-------------|
Expand All @@ -127,7 +132,7 @@ Each object is represented as a simple Python `dict`, with the following propert
|`top`| Distance of top of character from top of page.|
|`bottom`| Distance of bottom of the character from top of page.|
|`doctop`| Distance of top of character from top of document.|
|`object_type`| "char" / "anno"|
|`object_type`| "char" |

#### `line` properties

Expand Down Expand Up @@ -257,56 +262,33 @@ By default, `extract_tables` uses the page's vertical and horizontal lines (or r

```python
{
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"explicit_vertical_lines": [],
"explicit_horizontal_lines": [],
"snap_tolerance": 3,
"join_tolerance": 3,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
"keep_blank_chars": False,
"text_tolerance": 3,
"text_x_tolerance": None,
"text_y_tolerance": None,
"vertical_edges": None,
"horizontal_edges": None,
"snap_tolerance": DEFAULT_SNAP_TOLERANCE,
"join_tolerance": DEFAULT_JOIN_TOLERANCE,
"intersection_tolerance": 3,
"intersection_x_tolerance": None,
"intersection_y_tolerance": None,
"text_kwargs": {}
}
```

| Setting | Description |
|---------|-------------|
|`"vertical_strategy"`| Either `"lines"`, `"lines_strict"`, `"text"`, or `"explicit"`. See explanation below.|
|`"horizontal_strategy"`| Either `"lines"`, `"lines_strict"`, `"text"`, or `"explicit"`. See explanation below.|
|`"explicit_vertical_lines"`| A list of vertical lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `x` coordinate of a line the full height of the page — or a dictionary describing the line, with at least the following keys: `x`, `top`, `bottom`. |
|`"explicit_horizontal_lines"`| A list of vertical lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `y` coordinate of a line the full height of the page — or a dictionary describing the line, with at least the following keys: `top`, `x0`, `x1`.|
|`"vertical_edges"`| A list of vertical edges/lines that explicitly demarcate cells in the table. Items in the list should be ewther numbers — indicating the `x` coordinate of a line the full height of the page — or a dictionary describing the line, with at least the following keys: `x`, `top`, `bottom`. |
|`"horizontal_edges"`| A list of horizontal edges/lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `y` coordinate of a line the full height of the page — or a dictionary describing the line, with at least the following keys: `top`, `x0`, `x1`.|
|`"snap_tolerance"`| Parallel lines within `snap_tolerance` pixels will be "snapped" to the same horizontal or vertical position.|
|`"join_tolerance"`| Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be "joined" into a single line segment.|
|`"edge_min_length"`| Edges shorter than `edge_min_length` will be discarded before attempting to reconstruct the table.|
|`"min_words_vertical"`| When using `"vertical_strategy": "text"`, at least `min_words_vertical` words must share the same alignment.|
|`"min_words_horizontal"`| When using `"horizontal_strategy": "text"`, at least `min_words_horizontal` words must share the same alignment.|
|`"keep_blank_chars"`| When using the `text` strategy, consider `" "` chars to be *parts* of words and not word-separators.|
|`"text_tolerance"`, `"text_x_tolerance"`, `"text_y_tolerance"`| When the `text` strategy searches for words, it will expect the individual letters in each word to be no more than `text_tolerance` pixels apart.|
|`"intersection_tolerance"`, `"intersection_x_tolerance"`, `"intersection_y_tolerance"`| When combining edges into cells, orthogonal edges must be within `intersection_tolerance` pixels to be considered intersecting.|

### Table-extraction strategies

Both `vertical_strategy` and `horizontal_strategy` accept the following options:

| Strategy | Description |
|----------|-------------|
| `"lines"` | Use the page's graphical lines — including the sides of rectangle objects — as the borders of potential table-cells. |
| `"lines_strict"` | Use the page's graphical lines — but *not* the sides of rectangle objects — as the borders of potential table-cells. |
| `"text"` | For `vertical_strategy`: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For `horizontal_strategy`, the same but using the tops of words. |
| `"explicit"` | Only use the lines explicitly defined in `explicit_vertical_lines` / `explicit_horizontal_lines`. |
|`"text_kwargs"`| Arguments to be passed to `utils.extract_text` inside each table cell.|

### Notes

- Often it's helpful to crop a page — `Page.crop(bounding_box)` — before trying to extract the table.
- Often it's helpful to crop a page — `Page.crop(bounding_box)` — before trying to extract the table.

- Using `Page.find_text_edges(...)` often works well for extracting tables where the text is well-aligned but missing explicit boundary lines.

- Table extraction for `pdfplumber` was radically redesigned for `v0.5.0`, and introduced breaking changes.
- Table extraction for `pdfplumber` was radically redesigned for `v0.6.0`, and introduced breaking changes.


## Demonstrations
Expand All @@ -323,7 +305,7 @@ Many thanks to the following users who've contributed ideas, features, and fixes
- [Jacob Fenton](https://github.com/jsfenfen)
- [Dan Nguyen](https://github.com/dannguyen)
- [Jeff Barrera](https://github.com/jeffbarrera)
- [Bob Lannon](https://github.com/boblannon-picwell)
- [Bob Lannon](https://github.com/boblannon)

## Feedback

Expand Down
40 changes: 30 additions & 10 deletions examples/notebooks/ag-energy-roundup-curves.ipynb

Large diffs are not rendered by default.

61 changes: 45 additions & 16 deletions examples/notebooks/extract-table-ca-warn-report.ipynb

Large diffs are not rendered by default.

69 changes: 47 additions & 22 deletions examples/notebooks/extract-table-nics.ipynb

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions examples/notebooks/san-jose-pd-firearm-report.ipynb

Large diffs are not rendered by default.

659 changes: 659 additions & 0 deletions examples/notebooks/stanislaus-county-jail-logs.ipynb

Large diffs are not rendered by default.

Binary file added examples/pdfs/pdfill-drawings.pdf
Binary file not shown.
Binary file added examples/pdfs/stanislaus-jail-log-2016-05-27.pdf
Binary file not shown.
11 changes: 7 additions & 4 deletions pdfplumber/__init__.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,19 @@
from pdfplumber.pdf import PDF
import pdfplumber.utils
import pdfminer
import pdfminer.pdftypes
from ._version import __version__
from . import utils
from . import edge_finders
from .pdf import PDF

import pdfminer
import pdfminer.pdftypes
import pdfminer.pdfinterp
pdfminer.pdftypes.STRICT = False
pdfminer.pdfinterp.STRICT = False

def load(file_or_buffer, **kwargs):
return PDF(file_or_buffer, **kwargs)

open = PDF.open

# Old idiom
from_path = PDF.open

Expand Down
2 changes: 1 addition & 1 deletion pdfplumber/_version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
version_info = (0, 5, 7)
version_info = (0, 6, 0, "alpha")
__version__ = '.'.join(map(str, version_info))
2 changes: 1 addition & 1 deletion pdfplumber/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def parse_args():
parser.add_argument("--encoding",
default="utf-8")

TYPE_DEFAULTS = [ "char", "anno", "line", "curve", "rect" ]
TYPE_DEFAULTS = [ "char", "line", "rect", "curve" ]
parser.add_argument("--types", nargs="+",
choices=TYPE_DEFAULTS + [ "rect_edge" ],
default=TYPE_DEFAULTS)
Expand Down
35 changes: 20 additions & 15 deletions pdfplumber/container.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,41 +2,46 @@
from pdfplumber import utils

class Container(object):
cached_properties = [ "_rect_edges", "_edges", "_objects" ]
cached_properties = [ "_rect_edges", "_edges", "_objects", "_objects_dict" ]

def flush_cache(self, properties=None):
props = self.cached_properties if properties == None else properties
props = self.cached_properties if properties is None else properties
for p in props:
if hasattr(self, p):
delattr(self, p)

@property
def objects_dict(self):
if hasattr(self, "_objects_dict"): return self._objects_dict
od = {}
for o in self.objects:
kind = o["object_type"]
if kind in od:
od[kind].append(o)
else:
od[kind] = [ o ]
self._objects_dict = od
return self._objects_dict

@property
def rects(self):
return self.objects.get("rect", [])
return self.objects_dict.get("rect", [])

@property
def lines(self):
return self.objects.get("line", [])
return self.objects_dict.get("line", [])

@property
def curves(self):
return self.objects.get("curve", [])
return self.objects_dict.get("curve", [])

@property
def images(self):
return self.objects.get("image", [])

@property
def figures(self):
return self.objects.get("figure", [])
return self.objects_dict.get("image", [])

@property
def chars(self):
return self.objects.get("char", [])

@property
def annos(self):
return self.objects.get("anno", [])
return self.objects_dict.get("char", [])

@property
def rect_edges(self):
Expand Down
Loading

0 comments on commit 26bf1c5

Please sign in to comment.