Trying to extract data from a PDF file #625

jakobdo · 2022-03-15T13:27:27Z

jakobdo
Mar 15, 2022

Hello, I have been using this library before and I am really amazed how "easy" it is to extract data. But!
When data is not easily extracted, I find it hard to tweak these settings and get the data I need.
Instead of getting a working solution, how will I be able to extract the data from this pdf: https://www.taggmbh.at/fileadmin/content/TAG-Website-Content-SM/2022_Maintenance_PROD_PDF.pdf

When using the debug-table-finder, the last row on page 1 is missing:

How do I tweak the table settings to get the last row/line?
I have read about using explicit_lines, but I need to get the lines for a start.

jsvine · 2022-03-15T22:30:31Z

jsvine
Mar 15, 2022
Maintainer

Hi @jakobdo, and glad to hear that you've found the library useful! For your particular example, I'd recommend getting the position of the PDF's missing final line this way, by identifying the bottom-most extremity of existing page.rects objects (since those lines are represented as rect objects):

line_pos = max(r["bottom"] for r in page.rects)
table = page.extract_table({
    "explicit_horizontal_lines": [ line_pos ]
})

Demonstrating via .debug_tablefinder(...):

im = page.to_image()
line_pos = max(r["bottom"] for r in page.rects)
im.reset().debug_tablefinder({
    "explicit_horizontal_lines": [ line_pos ]
})

1 reply

jakobdo Mar 16, 2022
Author

@jsvine works like a charm like always. I might create another question in a few seconds, because I think I have "another" PDF which is producing some weird values. (characters are double)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to extract data from a PDF file #625

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Trying to extract data from a PDF file #625

jakobdo Mar 15, 2022

Replies: 1 comment · 1 reply

jsvine Mar 15, 2022 Maintainer

jakobdo Mar 16, 2022 Author

jakobdo
Mar 15, 2022

Replies: 1 comment 1 reply

jsvine
Mar 15, 2022
Maintainer

jakobdo Mar 16, 2022
Author