-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing Words while extracting from PDF #167
Comments
Hi @abhiwins could you please provide some examples? Thank you |
attached PDF, Image Output. validated on ubuntu 20.04, 24.04, |
I have a similar issue. When I run it for a specific page, it works. But, when I try whole file at once, it missed a character. When I tried pdftohtml v4.03 it has no problem. https://www.kyobo.com/file/ajax/download?fName=/dtc/pdf/mm/1312890060288_%EB%AC%B4%EB%B0%B0%EB%8B%B9%EA%B5%90%EB%B3%B4%EA%B0%80%EC%A1%B1%EC%82%AC%EB%9E%91%ED%86%B5%ED%95%A9CI%EB%B3%B4%ED%97%98%20%EC%A4%91%EB%8F%84%EB%B6%80%EA%B0%80%ED%8A%B9%EC%95%BD%20%ED%86%B5%ED%95%A9%EC%95%BD%EA%B4%80_2011.08.01_.pdf |
@calee88 does the extracted text well extracted and not messed up? I tried to run that file but it seems that the output is not in the right order. With |
@abhiwins I've processed the same PDF and checking the data inside, I"m not sure which words are missing, if you refer to "Chiesa di san carlo" or "Chiesa di sant'Andrea" they are in the XML: Here my xml output: Test_pdf_word_issue.xml.zip |
Lot of words are missing when the data is extracted from the PDF.
Scenario :- In event of large text pages more than( 1000) words.
The text was updated successfully, but these errors were encountered: