Unwanted letters in specific pdf file #1240

tristancatteeuw · 2021-09-01T10:30:56Z

tristancatteeuw
Sep 1, 2021

Hello,
I have an issue when extracting text from a pdf file. I have done this on hundred of documents with goo results, but this particular pdf has an unexpected behavior.
Here is a look at the pdf :

and now the text extracted :

As you can see, a bunch of "i" and "t" characters appeared on every line of text. I thought it might be an issue with the pdf, but when trying an online pdf converter, I didn't get those characters.

Any idea on what the issue might be?

JorjMcKie · 2021-09-01T10:37:10Z

JorjMcKie
Sep 1, 2021
Maintainer

Any idea on what the issue might be?

No. I would need an example file to find out.

1 reply

JorjMcKie Sep 1, 2021
Maintainer

could be all sorts of things - among them hidden text, which MuPDF does not suppress, or a font that is not correctly supported, ...

tristancatteeuw · 2021-09-01T13:05:15Z

tristancatteeuw
Sep 1, 2021
Author

I managed to narrow down the problem.
The following code snippet works as expected.


doc = fitz.open("filename")
for page in doc:
    blocks_bis = page.getText("dict", flags=3)["blocks"]
    for b in blocks_bis:
            for l in b["lines"]:
                for s in l["spans"]:
                    text = s["text"]
                    print(text)

However I noticed a few months ago that if I extracted the text like this, I often had problems recovering the text in the expected order. I am treating a bunch of documents that are sometimes poorly formatted, and so if I have some text like "Begin line ........................ End line" in my document, it happens often that the "Begin line" and "End line" are not part of the same line object, which is problematic in my implementation. I figured that it was due to the pdf itself and not pyMupdf as the vertical position of "Begin line" and "End line" was not exactly the same.

So to account for this problem, I am extracting each line like this :

def get_rounded_rect(rect):
    return fitz.Rect(rect[0],math.floor(rect[1]/5)*5, rect[2],math.ceil(rect[3]/5)*5)

doc = fitz.open("filename")
for nb_page, page in enumerate(doc):
        blocks = page.getText("dict", flags=3)["blocks"]
        for b in blocks:
            for l in b["lines"]:
                if 'rounded_line_rect' in locals() and l["bbox"] in rounded_line_rect:
                    continue
                rounded_line_rect = get_rounded_rect(l["bbox"])
                line_text = page.getTextbox(rounded_line_rect).strip()
                if not re.search('[a-zA-Z]', line_text):
                    continue
                print(line_text)

So the goal is that, if there is a very small vertical gap between 2 lines, to combine the text to form a single line. This usually works, but now I noticed some documents where the unwanted characters appear. I could share the document with you privately, but it contains personal data and adresses so I don't feel comfortable sharing it here.

1 reply

JorjMcKie Sep 1, 2021
Maintainer

Okay, that's good.
You could try layout-preserving text extraction: python -m fitz gettext file.pdf. A number of options lets you combine lines with only small differences in vertical coordinates or suppress text with a small font completely etc.

JorjMcKie · 2021-09-01T13:14:47Z

JorjMcKie
Sep 1, 2021
Maintainer

Layout preserving text extraction also should address other pesky situations as explained in the documentation and here like doubled characters to simulate text shadows or bold characters, or completely scrambled character sequences to prevent text copy-paste and so on.

1 reply

tristancatteeuw Sep 1, 2021
Author

I took a look at it and it sure seems interesting! I might try it to solve the problem.
However at this point I have a fairly lengthy program to extract the text of the pdf and organize them into sections. I'm working with resumes, so I have a script using pymupdf to split the documents in sections like "education", "experience", "personal" etc. To do this I use mostly font information like font size, color, boldness, etc. to find titles and some natural language processing to make sure the right content fits in the right section.
I'm a little scared that I would have to change my methodology completely by using the layout-dependant extraction instead of the dict... So if it's possible to do the same with this solution then I will try but I don't see a way to do it right now. It doesn't seem to get the text font right?

tristancatteeuw · 2021-09-01T15:50:44Z

tristancatteeuw
Sep 1, 2021
Author

It does seem that the gettextbox() method shows there characters in this resume. Do you know if these characters are layout information that is retrieved as text by mistake? Like i = indent, t = tab or something like that? I could email you the file in question if that might help.

2 replies

JorjMcKie Sep 1, 2021
Maintainer

I could email you the file in question if that might help.

yes, please do so
you can use my email if privacy concerns exist

tristancatteeuw Sep 2, 2021
Author

I just sent you the document

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unwanted letters in specific pdf file #1240

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Unwanted letters in specific pdf file #1240

tristancatteeuw Sep 1, 2021

Replies: 4 comments · 5 replies

JorjMcKie Sep 1, 2021 Maintainer

JorjMcKie Sep 1, 2021 Maintainer

tristancatteeuw Sep 1, 2021 Author

JorjMcKie Sep 1, 2021 Maintainer

JorjMcKie Sep 1, 2021 Maintainer

tristancatteeuw Sep 1, 2021 Author

tristancatteeuw Sep 1, 2021 Author

JorjMcKie Sep 1, 2021 Maintainer

tristancatteeuw Sep 2, 2021 Author

tristancatteeuw
Sep 1, 2021

Replies: 4 comments 5 replies

JorjMcKie
Sep 1, 2021
Maintainer

JorjMcKie Sep 1, 2021
Maintainer

tristancatteeuw
Sep 1, 2021
Author

JorjMcKie Sep 1, 2021
Maintainer

JorjMcKie
Sep 1, 2021
Maintainer

tristancatteeuw Sep 1, 2021
Author

tristancatteeuw
Sep 1, 2021
Author

JorjMcKie Sep 1, 2021
Maintainer

tristancatteeuw Sep 2, 2021
Author