No tables detected on LibreofficeDraw generated pdf #418

carl-krikorian · 2021-04-20T11:11:52Z

carl-krikorian
Apr 20, 2021

The Problem

I created an edited version of a pdf file while keeping the same format and tried extracting the tables but the extraction failed completely. What is strange is that the extraction from the original pdf worked perfectly fine with the same code. Also, Tabula was able to extract the tables from the edited version with no problem. I suspect the issue was with how the file was generated or the fact that the outlines of the page are being detected as seen under the screenshots. I used LibreOffice Draw to export it as PDF. Just want to know if there are any other fixes/ reasons I'm missing.

Code to reproduce the problem

import pdfplumber
pdf = pdfplumber.open('./pdfs/submit.pdf')
page = pdf.pages[0]
tables = page.extract_tables()
print(tables)

PDF file

submit.pdf

Screenshots

The curves seem to also be detected properly but strangely always with the outline of the page, even after cropping (this may also be the problem)

Environment

pdfplumber version: 0.5.27
Python version: 3.7.10
OS: Ubuntu 20.04

jsvine · 2021-04-20T13:12:07Z

jsvine
Apr 20, 2021
Maintainer

Hi @carl-krikorian, and thanks for your interest in this library. This issue stems from how pdfminer.six (which pdfplumber uses for object extraction) identifies curves, lines, and rects. I have a PR awaiting approval on that library's repository that would fix this. Unfortunately, until it's merged we're stuck dealing with the "curve-ification" (not a real term) of objects that should more properly be classified as lines or rects. (My guess is that LIbreOffice Draw exports the PDF in a way that visually preserves the original while changing the underlying representation ever so slightly.)

In the meantime, you should be able to extract the tables this way:

import pdfplumber
pdf = pdfplumber.open('./pdfs/submit.pdf')
page = pdf.pages[0]
tables = page.extract_tables({
    "explicit_vertical_lines": page.curves,
    "explicit_horizontal_lines": page.curves,
})
print(tables)

1 reply

carl-krikorian Apr 20, 2021
Author

I see, thank you for the fix and explanation! Great job on the library, works great.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No tables detected on LibreofficeDraw generated pdf #418

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

No tables detected on LibreofficeDraw generated pdf #418

carl-krikorian Apr 20, 2021

The Problem

Code to reproduce the problem

PDF file

Screenshots

Environment

Replies: 1 comment · 1 reply

jsvine Apr 20, 2021 Maintainer

carl-krikorian Apr 20, 2021 Author

carl-krikorian
Apr 20, 2021

Replies: 1 comment 1 reply

jsvine
Apr 20, 2021
Maintainer

carl-krikorian Apr 20, 2021
Author