Incorrect row number when extract tables #924

tujinshu · 2023-07-04T01:51:06Z

tujinshu
Jul 4, 2023

Describe the bug

A clear and concise description of what the bug is.
Incorrect row number in green cycle

企业微信截图_b25fcb8e-2910-4580-b734-7eded7cfeb8c

actual data （page 45）

Code to reproduce the problem

Paste it here, or attach a Python file.

import pdfplumber
pdf = pdfplumber.open("./wps.pdf")
p0 = pdf.pages[44]
im = p0.to_image()
im.debug_tablefinder()

PDF file

origin pdf
wps.pdf

Expected behavior

What did you expect the result should have been?
only one row should be extracted

Actual behavior

What actually happened, instead?
split to 5 rows

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

pdfplumber version: [0.9.0]
Python version: [3.8.0]
OS: [Linux]

Additional context

Add any other context/notes about the problem here.

jsvine · 2023-07-04T03:08:27Z

jsvine
Jul 4, 2023
Maintainer

Hi @tujinshu, and thanks for your interest in this library. The situation you're encountering is fairly common, and typically is caused by "invisible" graphics (rectangles, lines, etc.) on the page. In this particular example, it's a series of rectangles:

im.reset().draw_rects(p0.rects)

In this particular situation, these rectangles can be identified at least a couple of ways:

Their non_stroking_color is 1 instead of 0
Their height and width are both greater than 1. (The other rects are very narrow, making them look more like lines.)

To demonstrate:

im.reset().draw_rects([ r for r in p0.rects if (r["width"] > 1) and (r["height"] > 1)])

To ignore these invisible rectangles when extracting the tables, you'll want to use the page.filter(...) method. For examples of how this works, see the solutions to these discussions:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect row number when extract tables #924

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Incorrect row number when extract tables #924

tujinshu Jul 4, 2023

Describe the bug

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots

Environment

Additional context

Replies: 1 comment

jsvine Jul 4, 2023 Maintainer

tujinshu
Jul 4, 2023

jsvine
Jul 4, 2023
Maintainer