Extract SAP PDF Report which consit of vertical and horizontal line #656
zbjit
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment 1 reply
-
This is a very interesting PDF, @zbjit! Rather than use graphical lines, it seems to use unicode box-drawing characters. My suggestion would be to filter those out, with something like: box_chars = list("─┬┌┴") # This list is incomplete, you'll have to add the others
filtered = my_page.filter(lambda obj: not (obj.get("text") in box_chars))
table = filtered.extract_text(...) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello all,
I have a SAP PDF Report which consis of vertical and horizontal line, not real table, I define the explicit_vertical_lines for horizontal_strategy to extract the PDF report to table, but all the horizontal line was included in the extract result.
The table_setting is :
table_settings = { "vertical_strategy": "explicit", "horizontal_strategy": "text", "explicit_vertical_lines": [32, 76, 132, 384, 428, 448, 488, 532, 552, 596, 644, 684, 728, 768, 812, 835], }
The result of debug_tablefinder of page 1:
The output of extract_table:
As you saw, the horizontal line had been recognized as row and also output, could you please have any ideas to
skip these lines? thank you.
Attached the PDF and notebook in the end.
SAP_Report.pdf
PDF_Table_Debug_5.ipynb.txt
Beta Was this translation helpful? Give feedback.
All reactions