Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Words while extracting from PDF #167

Open
abhiwins opened this issue Aug 14, 2024 · 5 comments
Open

Missing Words while extracting from PDF #167

abhiwins opened this issue Aug 14, 2024 · 5 comments
Assignees

Comments

@abhiwins
Copy link

Lot of words are missing when the data is extracted from the PDF.
Scenario :- In event of large text pages more than( 1000) words.

@lfoppiano
Copy link
Collaborator

Hi @abhiwins could you please provide some examples?
Including input pdf and output. Also, on which OS/platform did you run it?

Thank you

@abhiwins
Copy link
Author

abhiwins commented Aug 23, 2024

attached PDF, Image Output.

validated on ubuntu 20.04, 24.04,

Test_pdf_word_issue
Test_pdf_word_issue.pdf

@calee88
Copy link

calee88 commented Nov 29, 2024

I have a similar issue. When I run it for a specific page, it works. But, when I try whole file at once, it missed a character. When I tried pdftohtml v4.03 it has no problem. https://www.kyobo.com/file/ajax/download?fName=/dtc/pdf/mm/1312890060288_%EB%AC%B4%EB%B0%B0%EB%8B%B9%EA%B5%90%EB%B3%B4%EA%B0%80%EC%A1%B1%EC%82%AC%EB%9E%91%ED%86%B5%ED%95%A9CI%EB%B3%B4%ED%97%98%20%EC%A4%91%EB%8F%84%EB%B6%80%EA%B0%80%ED%8A%B9%EC%95%BD%20%ED%86%B5%ED%95%A9%EC%95%BD%EA%B4%80_2011.08.01_.pdf
Page 221 약 is removed.
This happens both on v0.4 and v0.5.

@lfoppiano
Copy link
Collaborator

@calee88 does the extracted text well extracted and not messed up? I tried to run that file but it seems that the output is not in the right order. With Page 221 is removed you mean this page?

image

@lfoppiano
Copy link
Collaborator

attached PDF, Image Output.

validated on ubuntu 20.04, 24.04,

@abhiwins I've processed the same PDF and checking the data inside, I"m not sure which words are missing, if you refer to "Chiesa di san carlo" or "Chiesa di sant'Andrea" they are in the XML:

image

Here my xml output: Test_pdf_word_issue.xml.zip

@lfoppiano lfoppiano self-assigned this Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants