Unsolicited Text Particles #187

JorjMcKie · 2024-11-17T21:59:15Z

This reopens an older issue.
In the attached file, some words are being repeated across multiple line.
It happens in versions 0.0.17 and also 0.0.16.

A.World.of.Propensities.by.Karl.Popper.1997.pdf

brucenielson · 2024-11-18T14:39:33Z

Found the new issue. Thanks.

Any ideas why it worked for you previously?

JorjMcKie · 2024-11-18T15:30:37Z

I am not clear.
The file does have some peculiarities in that some of its fonts contain errors, notably '*Minion Pro-5152'.
Recent versions of the base library MuPDF are capable of intercepting more cases of this sort of craziness. Here is the detail of two spans around the border of the problem:

So, after the word "It", the font changes to Minion Pro (for no apparent reason, to start with). Then, errors in font Minion-Pro (look at the crazy ascender and bbox values) prevent the markdown generator from finding these text particles together - which of course is based on this geometric information.

As I am working on the new PyMuPDF, my script made use of the more advanced MuPDF when I was not aware of this happening.

brucenielson · 2024-11-19T00:03:54Z

"As I am working on the new PyMuPDF, my script made use of the more advanced MuPDF when I was not aware of this happening."

Is that why you didn't get the error?

That is a good sign though, right?

When you say you are using a more advanced MuPDF, do you mean an unreleased version?

JorjMcKie · 2024-11-19T08:42:43Z

When you say you are using a more advanced MuPDF, do you mean an unreleased version?

Correct.
Fonts in PDFs are a notorious source of problems. Among these problems frequently are missing or incorrect ascender / descender values. They are essential for computing text bboxes (more precisely, the y0 / y1 values).
The two values typically are in the range 0.8 < ascender <= 1.4, -0.2 <= descender < 0.
The font definition in the PDF also allows to overwrite the internal values in the font binary.
Your PDF (what the hell is so important about this specific file in Markdown format ...?) overwrites the ascender value in the binary with an impossible value in the PDF font definition: /Ascent 67306242 by the PDF spec given in 1000 units, so meaning 67306.242 (crazy).
That value in the font binary is -32.768 - even more idiotic.

So MuPDF (hence PyMuPDF) are being given 2 alternatives for the ascender value, which both make no sense.

brucenielson · 2024-11-19T16:54:28Z

Can we please delete out the pdf from the issue once you have it downloaded to work with? Probably need to remove links to the output too since it probably contains the entire pdf.

brucenielson · 2024-11-19T16:56:50Z

You asked: "what the hell is so important about this specific file in Markdown format ...?"

The answer is 'nothing.' I just by chance happened to test PyMuPDF4LLM on this one and happened to notice a problem and reported it.

I was using it in a Haystack pipeline that takes a number of PDFs and converts them to markdown and then converts to HTML and then feeds it over to an already existing HTML parser to try to grab metadata. This was my first attempt at that idea when I noticed it had this problem.

brucenielson · 2024-11-27T14:23:03Z

Question: You mentioned this isn't a problem with the latest PyMu. When will the version you are on be released?

JorjMcKie added the fix developed label Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsolicited Text Particles #187

Unsolicited Text Particles #187

JorjMcKie commented Nov 17, 2024

brucenielson commented Nov 18, 2024

JorjMcKie commented Nov 18, 2024

brucenielson commented Nov 19, 2024

JorjMcKie commented Nov 19, 2024

brucenielson commented Nov 19, 2024

brucenielson commented Nov 19, 2024

brucenielson commented Nov 27, 2024

Unsolicited Text Particles #187

Unsolicited Text Particles #187

Comments

JorjMcKie commented Nov 17, 2024

brucenielson commented Nov 18, 2024

JorjMcKie commented Nov 18, 2024

brucenielson commented Nov 19, 2024

JorjMcKie commented Nov 19, 2024

brucenielson commented Nov 19, 2024

brucenielson commented Nov 19, 2024

brucenielson commented Nov 27, 2024