Pdfplumber Failing on a particular PDF. #565

adicognext · 2021-12-14T11:11:27Z

adicognext
Dec 14, 2021

The bug

Hey, I wanted to extract tables from a pdf using pdfplumber, I tried the default setting and with multiple join tolerances but I got empty tables only If anyone could suggest some settings It would be very helpful.

Code to reproduce the problem

Paste it here, or attach a Python file.

PDF file

Concerned PDF
Navi 2019-20.pdf

Actual behavior

No tables were detected

Screenshots

Concerned tables

Environment

pdfplumber version: 0.5.28
Python version: 3.8.5
OS: Linux

jsvine · 2021-12-14T15:07:25Z

jsvine
Dec 14, 2021
Maintainer

Hi @adicognext, and thanks for your interest in this library. To investigate why a table might not be found, you can always use page.to_image().debug_tablefinder(...). For instance, with your settings:

Inspecting that, you can see that there are no vertical lines appearing and that, perhaps, you meant this instead, using text for the vertical setting and the lines for horizontal setting, rather than vice versa:

That does find a table, but chops it up a bit finely, because of how many different alignments the text has, even within what presumably you'd call a single column. Instead, we can try to use the gaps between the dotted lines that seem to separate each column:

Because there's some nuance to how those lines are specified in the PDF, it took a little bit of trial and error, but this seems to get fairly close to what I imagine you want:

thin_lines = [ line for line in page.lines
    if line["linewidth"] == 0.5 and line["top"] < 300 ]


# This merges overlapping line segments, due to a
# quirk in the PDF's design.
thin_lines_joined = pdfplumber.table.join_edge_group(
    thin_lines,
    "h",
    tolerance=0,
)

im.reset().debug_tablefinder({
    "vertical_strategy": "explicit",
    "explicit_vertical_lines": (
        # Using both the left- and right-hand edges of each line
        [ x["x0"] for x in thin_lines_joined ] +
        [ x["x1"] for x in thin_lines_joined ]
    ),
    "horizontal_strategy": "lines",
    "join_tolerance": 50,
    "snap_tolerance": 5,
})

If you want to get rid of the errant cells at the top and bottom, you could use cropped = page.crop(...) first.

1 reply

adicognext Dec 15, 2021
Author

Yes, my bad. Thank you so much for your reply It helped a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pdfplumber Failing on a particular PDF. #565

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Pdfplumber Failing on a particular PDF. #565

adicognext Dec 14, 2021

The bug

Code to reproduce the problem

PDF file

Actual behavior

Screenshots

Environment

Replies: 1 comment · 1 reply

jsvine Dec 14, 2021 Maintainer

adicognext Dec 15, 2021 Author

adicognext
Dec 14, 2021

Replies: 1 comment 1 reply

jsvine
Dec 14, 2021
Maintainer

adicognext Dec 15, 2021
Author