Text being incorrectly parsed in table #1149

enrac5 · 2024-06-07T02:12:50Z

enrac5
Jun 7, 2024

I have a very odd issue with the attached file:
test.pdf

Basically, the text in the second column is being read incorrectly and I'm not sure why. This is basically what I'm doing:

import pdfplumber

pdf_path = '/[path to attached]/test.pdf'
pdf = pdfplumber.open(pdf_path, repair=True)
for page in pdf.pages:
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            if row[1] is not None:
                print(row[1])

The output should be 13+13 but I get 113+13. This is just a small part of a 78 page document (a PDF printed from MS-Word). (Puts on Leia's clothes, "Help me @jsvine, you're my only hope")

jsvine · 2024-06-11T19:56:32Z

jsvine
Jun 11, 2024
Maintainer

Hi @enrac5, it looks like the PDF has two 1s written on top of one another:

print(page..extract_text(layout=True))

Produces:

...
1  113 +13 MT #1 FADES IN OVER  1 52+11  58+10  5+15 LUCY TO KOS
...

If you use page.dedupe_chars(), this seems to fix it:

print(page.dedupe_chars().extract_text(layout=True))

...
1  13 +13 MT #1 FADES IN OVER   1 52+11  58+10  5+15 LUCY TO KOS
...

13 replies

jsvine Jun 14, 2024
Maintainer

Strange. I run the exact same code as you but get the correct output.

Ran that on Python 3.10.4 on MacOS, though neither should not have any effect on the processing. Try upgrading pdfplumber and running print(pdfplumber.__version__) to confirm?

enrac5 Jun 21, 2024
Author

Which version of MacOS are you on?

jsvine Jun 25, 2024
Maintainer

For an operation like this one, the operating system should not have any effect on the results. (Both theoretically, and from practical experience; I haven't encountered any PDF where an operation such as dedupe_chars performs differently on different OSes.) Different versions of Ghostscript (used for repair=True), however, may have an effect, so I'd focus on the output of without repair=True.

Is it possible you're processing a slightly different version of the PDF than the one shared in this issue?

enrac5 Jun 25, 2024
Author

I'm using the same PDF, though this is a stripped down example of another one I cannot share.

enrac5 Jun 25, 2024
Author

On my WSL instance, this works amazingly, however on my MacOS install, it doesn't, that's the flummoxing part.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text being incorrectly parsed in table #1149

{{title}}

Replies: 1 comment 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Text being incorrectly parsed in table #1149

enrac5 Jun 7, 2024

Replies: 1 comment · 13 replies

jsvine Jun 11, 2024 Maintainer

jsvine Jun 14, 2024 Maintainer

enrac5 Jun 21, 2024 Author

jsvine Jun 25, 2024 Maintainer

enrac5 Jun 25, 2024 Author

enrac5 Jun 25, 2024 Author

enrac5
Jun 7, 2024

Replies: 1 comment 13 replies

jsvine
Jun 11, 2024
Maintainer

jsvine Jun 14, 2024
Maintainer

enrac5 Jun 21, 2024
Author

jsvine Jun 25, 2024
Maintainer

enrac5 Jun 25, 2024
Author

enrac5 Jun 25, 2024
Author