Need help to ignore extra cells #489
heylouiz
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment 2 replies
-
Hi @heylouiz Appreciate your interest in the library. I don't think there is a solution that can give you the desired result based solely on using the table extraction settings provided by pdfplumber. You would still need to do some preprocessing. One way I came up with is import pdfplumber
pdf = pdfplumber.open("file.pdf")
p = pdf.pages[2] # Get the 3rd page.
# Get the minimum value for x0. This is needed because the horizontal lines in the grey background
# rows are not contiguous but instead made up of 3 lines. We get those lines by first getting those
# horizontal lines that are on the leftmost side.
min_x0 = min(edge["x0"] for edge in p.edges if (edge["orientation"] == "h" and not edge["stroke"]))
# Then, for all those, we store the doctops. The 3 horizontal lines that together make up one big
# horizontal lines will all be having the same doctop.
horizontal_line_doctops = [edge["doctop"] for edge in p.edges if (edge["x0"] == min_x0 and edge["orientation"] == "h" and not edge["stroke"])]
# Get the horizontal lines of the grey background rows by matching against the doctops.
horizontal_lines = [edge for edge in p.edges if (edge["doctop"] in horizontal_line_doctops and edge["orientation"] == "h" and not edge["stroke"])]
ts = {
"vertical_strategy": "lines",
"horizontal_strategy": "explicit",
"snap_tolerance": 10,
"explicit_horizontal_lines": horizontal_lines,
"join_tolerance": 30,
}
# For table extraction.
tables = p.extract_tables(table_settings=ts)
# For visual debugging.
im = p.to_image(resolution=200)
im.reset().debug_tablefinder(ts) Of course, there might be other more clever solutions available but I hope this gives you a direction. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, first of all, I would like to thank the developers for this project, it helped me a lot!
I am having trouble to extract a table from this PDF.
The table in question is in the third page, the one under the title Stocks, Rights, ETNs, ETCs, Warrants, & ADRs
I am using the method debug_tablefinder in order to find the best way to extract this table, without relying on using fixed lines.
Here is what I got so far:
im.debug_tablefinder({"snap_tolerance": 10, "join_tolerance": 30})
The debug result:
https://i.imgur.com/sC8fv3l.png
You can see that the row with Austria, Belgium, Denmark, Finland, France1,... the middle cell was identified as 3 cells, and this produces some weird rows when using extract_tables:
['Austria, Belgium, Denmark, Finland, France1, \nItaly1, Ireland, The Netherlands, Norway, \nPortugal, Spain, Sweden, Switzerland', '', '€60.00'], [None, '€ 4.00 + 0.05%', None], [None, '', None],
I was expecting this to be
['Austria..', € 4.00 + 0.05%', '€60.00']
Any tips on how should I improve this? I was also thinking about including the first row of the table (EXCHANGE, FEE,MAXIMUM) if possible.
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions