Unwanted letters in specific pdf file #1240
Replies: 4 comments 5 replies
-
No. I would need an example file to find out. |
Beta Was this translation helpful? Give feedback.
-
I managed to narrow down the problem.
However I noticed a few months ago that if I extracted the text like this, I often had problems recovering the text in the expected order. I am treating a bunch of documents that are sometimes poorly formatted, and so if I have some text like "Begin line ........................ End line" in my document, it happens often that the "Begin line" and "End line" are not part of the same line object, which is problematic in my implementation. I figured that it was due to the pdf itself and not pyMupdf as the vertical position of "Begin line" and "End line" was not exactly the same. So to account for this problem, I am extracting each line like this :
So the goal is that, if there is a very small vertical gap between 2 lines, to combine the text to form a single line. This usually works, but now I noticed some documents where the unwanted characters appear. I could share the document with you privately, but it contains personal data and adresses so I don't feel comfortable sharing it here. |
Beta Was this translation helpful? Give feedback.
-
Layout preserving text extraction also should address other pesky situations as explained in the documentation and here like doubled characters to simulate text shadows or bold characters, or completely scrambled character sequences to prevent text copy-paste and so on. |
Beta Was this translation helpful? Give feedback.
-
It does seem that the gettextbox() method shows there characters in this resume. Do you know if these characters are layout information that is retrieved as text by mistake? Like i = indent, t = tab or something like that? I could email you the file in question if that might help. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I have an issue when extracting text from a pdf file. I have done this on hundred of documents with goo results, but this particular pdf has an unexpected behavior.
Here is a look at the pdf :
and now the text extracted :
As you can see, a bunch of "i" and "t" characters appeared on every line of text. I thought it might be an issue with the pdf, but when trying an online pdf converter, I didn't get those characters.
Any idea on what the issue might be?
Beta Was this translation helpful? Give feedback.
All reactions