Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsolicited Text Particles #187

Open
JorjMcKie opened this issue Nov 17, 2024 · 7 comments
Open

Unsolicited Text Particles #187

JorjMcKie opened this issue Nov 17, 2024 · 7 comments

Comments

@JorjMcKie
Copy link
Contributor

This reopens an older issue.
In the attached file, some words are being repeated across multiple line.
It happens in versions 0.0.17 and also 0.0.16.
image
A.World.of.Propensities.by.Karl.Popper.1997.pdf

@brucenielson
Copy link

Found the new issue. Thanks.

Any ideas why it worked for you previously?

@JorjMcKie
Copy link
Contributor Author

I am not clear.
The file does have some peculiarities in that some of its fonts contain errors, notably '*Minion Pro-5152'.
Recent versions of the base library MuPDF are capable of intercepting more cases of this sort of craziness. Here is the detail of two spans around the border of the problem:

image

So, after the word "It", the font changes to Minion Pro (for no apparent reason, to start with). Then, errors in font Minion-Pro (look at the crazy ascender and bbox values) prevent the markdown generator from finding these text particles together - which of course is based on this geometric information.

As I am working on the new PyMuPDF, my script made use of the more advanced MuPDF when I was not aware of this happening.

@brucenielson
Copy link

"As I am working on the new PyMuPDF, my script made use of the more advanced MuPDF when I was not aware of this happening."

Is that why you didn't get the error?

That is a good sign though, right?

When you say you are using a more advanced MuPDF, do you mean an unreleased version?

@JorjMcKie
Copy link
Contributor Author

When you say you are using a more advanced MuPDF, do you mean an unreleased version?

Correct.
Fonts in PDFs are a notorious source of problems. Among these problems frequently are missing or incorrect ascender / descender values. They are essential for computing text bboxes (more precisely, the y0 / y1 values).
The two values typically are in the range 0.8 < ascender <= 1.4, -0.2 <= descender < 0.
The font definition in the PDF also allows to overwrite the internal values in the font binary.
Your PDF (what the hell is so important about this specific file in Markdown format ...?) overwrites the ascender value in the binary with an impossible value in the PDF font definition: /Ascent 67306242 by the PDF spec given in 1000 units, so meaning 67306.242 (crazy).
That value in the font binary is -32.768 - even more idiotic.

So MuPDF (hence PyMuPDF) are being given 2 alternatives for the ascender value, which both make no sense.

@brucenielson
Copy link

Can we please delete out the pdf from the issue once you have it downloaded to work with? Probably need to remove links to the output too since it probably contains the entire pdf.

@brucenielson
Copy link

You asked: "what the hell is so important about this specific file in Markdown format ...?"

The answer is 'nothing.' I just by chance happened to test PyMuPDF4LLM on this one and happened to notice a problem and reported it.

I was using it in a Haystack pipeline that takes a number of PDFs and converts them to markdown and then converts to HTML and then feeds it over to an already existing HTML parser to try to grab metadata. This was my first attempt at that idea when I noticed it had this problem.

@brucenielson
Copy link

Question: You mentioned this isn't a problem with the latest PyMu. When will the version you are on be released?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants