Incorrect row number when extract tables #924
tujinshu
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Hi @tujinshu, and thanks for your interest in this library. The situation you're encountering is fairly common, and typically is caused by "invisible" graphics (rectangles, lines, etc.) on the page. In this particular example, it's a series of rectangles: im.reset().draw_rects(p0.rects) In this particular situation, these rectangles can be identified at least a couple of ways:
To demonstrate: im.reset().draw_rects([ r for r in p0.rects if (r["width"] > 1) and (r["height"] > 1)]) To ignore these invisible rectangles when extracting the tables, you'll want to use the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Describe the bug
A clear and concise description of what the bug is.
Incorrect row number in green cycle
actual data (page 45)
Code to reproduce the problem
Paste it here, or attach a Python file.
PDF file
origin pdf
wps.pdf
Expected behavior
What did you expect the result should have been?
only one row should be extracted
Actual behavior
What actually happened, instead?
split to 5 rows
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
Add any other context/notes about the problem here.
Beta Was this translation helpful? Give feedback.
All reactions