Skip to content

Commit

Permalink
Update form-parsing example in README (#958)
Browse files Browse the repository at this point in the history
Thanks to PR from @jeremybmerrill, who writes:

> This form-parsing example handles form fields recursively contained within other form fields, removes the incorrect-assumption that field-names are unique and includes the alternate field name in output (which is often a very useful guide to what's in a field).

> the prior form-parsing example used the field-name ("T" entry) as the key in the form_data dict, implicitly assuming that the field name is globally unique within a document. That's not a correct assumption; nested field names are often simply a numeric index like 1 or 0. The prior example also entirely ignored the TU entry alternate field name.
  • Loading branch information
jeremybmerrill authored Aug 12, 2023
1 parent 7c2d46b commit 4149831
Showing 1 changed file with 35 additions and 8 deletions.
43 changes: 35 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -426,21 +426,48 @@ Sometimes PDF files can contain forms that include inputs that people can fill o

`pdfplumber` doesn't have an interface for working with form data, but you can access it using `pdfplumber`'s wrappers around `pdfminer`.

For example, this snippet will retrieve form field names and values and store them in a dictionary. You may have to modify this script to handle cases like nested fields (see page 676 of the specification).
For example, this snippet will retrieve form field names and values and store them in a dictionary.

```python
pdf = pdfplumber.open("document_with_form.pdf")

fields = pdf.doc.catalog["AcroForm"].resolve()["Fields"]
import pdfplumber
from pdfplumber.utils.pdfinternals import resolve_and_decode, resolve

form_data = {}
pdf = pdfplumber.open("document_with_form.pdf")

def parse_field_helper(form_data, field, prefix=None):
""" appends any PDF AcroForm field/value pairs in `field` to provided `form_data` list
if `field` has child fields, those will be parsed recursively.
"""
resolved_field = field.resolve()
field_name = '.'.join(filter(lambda x: x, [prefix, resolve_and_decode(resolved_field.get("T"))]))
if "Kids" in resolved_field:
for kid_field in resolved_field["Kids"]:
parse_field_helper(form_data, kid_field, prefix=field_name)
if "T" in resolved_field or "TU" in resolved_field:
# "T" is a field-name, but it's sometimes absent.
# "TU" is the "alternate field name" and is often more human-readable
# your PDF may have one, the other, or both.
alternate_field_name = resolve_and_decode(resolved_field.get("TU")) if resolved_field.get("TU") else None
field_value = resolve_and_decode(resolved_field["V"]) if 'V' in resolved_field else None
form_data.append([field_name, alternate_field_name, field_value])


form_data = []
fields = resolve(pdf.doc.catalog["AcroForm"])["Fields"]
for field in fields:
field_name = field.resolve()["T"]
field_value = field.resolve()["V"]
form_data[field_name] = field_value
parse_field_helper(form_data, field)
```

Once you run this script, `form_data` is a list containing a three-element tuple for each form element. For instance, a PDF form with a city and state field might look like this.
```
[
['STATE.0', 'enter STATE', 'CA'],
['section 2 accident infoRmation.1.0',
'enter city of accident',
'SAN FRANCISCO']
]
```

## Demonstrations

Expand Down

0 comments on commit 4149831

Please sign in to comment.