Missing Words while extracting from PDF #167

abhiwins · 2024-08-14T13:11:11Z

Lot of words are missing when the data is extracted from the PDF.
Scenario :- In event of large text pages more than( 1000) words.

lfoppiano · 2024-08-22T11:57:18Z

Hi @abhiwins could you please provide some examples?
Including input pdf and output. Also, on which OS/platform did you run it?

Thank you

abhiwins · 2024-08-23T14:00:29Z

attached PDF, Image Output.

validated on ubuntu 20.04, 24.04,

Test_pdf_word_issue.pdf

calee88 · 2024-11-29T11:32:19Z

I have a similar issue. When I run it for a specific page, it works. But, when I try whole file at once, it missed a character. When I tried pdftohtml v4.03 it has no problem. https://www.kyobo.com/file/ajax/download?fName=/dtc/pdf/mm/1312890060288_%EB%AC%B4%EB%B0%B0%EB%8B%B9%EA%B5%90%EB%B3%B4%EA%B0%80%EC%A1%B1%EC%82%AC%EB%9E%91%ED%86%B5%ED%95%A9CI%EB%B3%B4%ED%97%98%20%EC%A4%91%EB%8F%84%EB%B6%80%EA%B0%80%ED%8A%B9%EC%95%BD%20%ED%86%B5%ED%95%A9%EC%95%BD%EA%B4%80_2011.08.01_.pdf
Page 221 약 is removed.
This happens both on v0.4 and v0.5.

lfoppiano · 2024-12-25T16:32:07Z

@calee88 does the extracted text well extracted and not messed up? I tried to run that file but it seems that the output is not in the right order. With Page 221 is removed you mean this page?

lfoppiano · 2024-12-25T16:50:20Z

attached PDF, Image Output.

validated on ubuntu 20.04, 24.04,

@abhiwins I've processed the same PDF and checking the data inside, I"m not sure which words are missing, if you refer to "Chiesa di san carlo" or "Chiesa di sant'Andrea" they are in the XML:

Here my xml output: Test_pdf_word_issue.xml.zip

lfoppiano self-assigned this Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Words while extracting from PDF #167

Missing Words while extracting from PDF #167

abhiwins commented Aug 14, 2024

lfoppiano commented Aug 22, 2024

abhiwins commented Aug 23, 2024 •

edited

Loading

calee88 commented Nov 29, 2024 •

edited

Loading

lfoppiano commented Dec 25, 2024

lfoppiano commented Dec 25, 2024

Missing Words while extracting from PDF #167

Missing Words while extracting from PDF #167

Comments

abhiwins commented Aug 14, 2024

lfoppiano commented Aug 22, 2024

abhiwins commented Aug 23, 2024 • edited Loading

calee88 commented Nov 29, 2024 • edited Loading

lfoppiano commented Dec 25, 2024

lfoppiano commented Dec 25, 2024

abhiwins commented Aug 23, 2024 •

edited

Loading

calee88 commented Nov 29, 2024 •

edited

Loading