Need help to ignore extra cells #489

heylouiz · 2021-08-06T19:27:39Z

heylouiz
Aug 6, 2021

Hi, first of all, I would like to thank the developers for this project, it helped me a lot!

I am having trouble to extract a table from this PDF.
The table in question is in the third page, the one under the title Stocks, Rights, ETNs, ETCs, Warrants, & ADRs

I am using the method debug_tablefinder in order to find the best way to extract this table, without relying on using fixed lines.

Here is what I got so far:
im.debug_tablefinder({"snap_tolerance": 10, "join_tolerance": 30})

The debug result:
https://i.imgur.com/sC8fv3l.png

You can see that the row with Austria, Belgium, Denmark, Finland, France1,... the middle cell was identified as 3 cells, and this produces some weird rows when using extract_tables:

['Austria, Belgium, Denmark, Finland, France1, \nItaly1, Ireland, The Netherlands, Norway, \nPortugal, Spain, Sweden, Switzerland', '', '€60.00'], [None, '€ 4.00 + 0.05%', None], [None, '', None],

I was expecting this to be ['Austria..', € 4.00 + 0.05%', '€60.00']

Any tips on how should I improve this? I was also thinking about including the first row of the table (EXCHANGE, FEE,MAXIMUM) if possible.

Thanks in advance!

samkit-jain · 2021-08-07T19:08:20Z

samkit-jain
Aug 7, 2021
Collaborator

Hi @heylouiz Appreciate your interest in the library. I don't think there is a solution that can give you the desired result based solely on using the table extraction settings provided by pdfplumber. You would still need to do some preprocessing. One way I came up with is

import pdfplumber

pdf = pdfplumber.open("file.pdf")

p = pdf.pages[2]  # Get the 3rd page.

# Get the minimum value for x0. This is needed because the horizontal lines in the grey background
# rows are not contiguous but instead made up of 3 lines. We get those lines by first getting those
# horizontal lines that are on the leftmost side. 
min_x0 = min(edge["x0"] for edge in p.edges if (edge["orientation"] == "h" and not edge["stroke"]))

# Then, for all those, we store the doctops. The 3 horizontal lines that together make up one big
# horizontal lines will all be having the same doctop.
horizontal_line_doctops = [edge["doctop"] for edge in p.edges if (edge["x0"] == min_x0 and edge["orientation"] == "h" and not edge["stroke"])]

# Get the horizontal lines of the grey background rows by matching against the doctops.
horizontal_lines = [edge for edge in p.edges if (edge["doctop"] in horizontal_line_doctops and edge["orientation"] == "h" and not edge["stroke"])]

ts = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "explicit",
    "snap_tolerance": 10,
    "explicit_horizontal_lines": horizontal_lines,
    "join_tolerance": 30,
}

# For table extraction.
tables = p.extract_tables(table_settings=ts)

# For visual debugging.
im = p.to_image(resolution=200)
im.reset().debug_tablefinder(ts)

The result you get is

Of course, there might be other more clever solutions available but I hope this gives you a direction.

2 replies

heylouiz Aug 10, 2021
Author

Thank you @samkit-jain, this worked and I could understand a little better how to use the explict mode.

I've tried this with a similar PDF, from the same company but from another country and it unfortunately failed..
I've noticed that the min_x cannot be used, because pdfplumber detected a line close to the left elsewhere in the page, I then fixed the min_x to a value and managed to highlight most of the lines:

I had the same problem as before, where there are invisible lines inside a cell.
This is the other PDF: https://www.degiro.co.uk/data/pdf/uk/UK_Feeschedule.pdf

That being said, I don't think I can trust this code in the future, if the page update the pdf, even changing something small in the table, this logic might fail and the cells cannot be trust..
What do you think? Is it a waste of time to try to periodic parse these PDFs checking for changes in these tables?

Thanks again for the help!

samkit-jain Sep 5, 2021
Collaborator

@heylouiz It depends on how many variations there are. If you have 2-3 variations, then you can probably come up with an extraction logic that will satisfy all the PDFs you will be encountering. If there are more or unknown to you, then you will have to come up with a more nuanced algorithm to tackle these cases.

Also, I think you referenced the same PDF. I don't see the page that you have shared in the screenshot in the PDF shared.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help to ignore extra cells #489

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Need help to ignore extra cells #489

heylouiz Aug 6, 2021

Replies: 1 comment · 2 replies

samkit-jain Aug 7, 2021 Collaborator

heylouiz Aug 10, 2021 Author

samkit-jain Sep 5, 2021 Collaborator

heylouiz
Aug 6, 2021

Replies: 1 comment 2 replies

samkit-jain
Aug 7, 2021
Collaborator

heylouiz Aug 10, 2021
Author

samkit-jain Sep 5, 2021
Collaborator