Pdfplumber Failing on a particular PDF. #565
Replies: 1 comment 1 reply
-
Hi @adicognext, and thanks for your interest in this library. To investigate why a table might not be found, you can always use Inspecting that, you can see that there are no vertical lines appearing and that, perhaps, you meant this instead, using That does find a table, but chops it up a bit finely, because of how many different alignments the text has, even within what presumably you'd call a single column. Instead, we can try to use the gaps between the dotted lines that seem to separate each column: Because there's some nuance to how those lines are specified in the PDF, it took a little bit of trial and error, but this seems to get fairly close to what I imagine you want: thin_lines = [ line for line in page.lines
if line["linewidth"] == 0.5 and line["top"] < 300 ]
# This merges overlapping line segments, due to a
# quirk in the PDF's design.
thin_lines_joined = pdfplumber.table.join_edge_group(
thin_lines,
"h",
tolerance=0,
)
im.reset().debug_tablefinder({
"vertical_strategy": "explicit",
"explicit_vertical_lines": (
# Using both the left- and right-hand edges of each line
[ x["x0"] for x in thin_lines_joined ] +
[ x["x1"] for x in thin_lines_joined ]
),
"horizontal_strategy": "lines",
"join_tolerance": 50,
"snap_tolerance": 5,
}) If you want to get rid of the errant cells at the top and bottom, you could use |
Beta Was this translation helpful? Give feedback.
-
The bug
Hey, I wanted to extract tables from a pdf using pdfplumber, I tried the default setting and with multiple join tolerances but I got empty tables only If anyone could suggest some settings It would be very helpful.
Code to reproduce the problem
Paste it here, or attach a Python file.
PDF file
Concerned PDF
Navi 2019-20.pdf
Actual behavior
No tables were detected
Screenshots
Concerned tables
Environment
Beta Was this translation helpful? Give feedback.
All reactions