-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unsolicited Text Particles #187
Comments
Found the new issue. Thanks. Any ideas why it worked for you previously? |
I am not clear. So, after the word "It", the font changes to Minion Pro (for no apparent reason, to start with). Then, errors in font Minion-Pro (look at the crazy ascender and bbox values) prevent the markdown generator from finding these text particles together - which of course is based on this geometric information. As I am working on the new PyMuPDF, my script made use of the more advanced MuPDF when I was not aware of this happening. |
"As I am working on the new PyMuPDF, my script made use of the more advanced MuPDF when I was not aware of this happening." Is that why you didn't get the error? That is a good sign though, right? When you say you are using a more advanced MuPDF, do you mean an unreleased version? |
Correct. So MuPDF (hence PyMuPDF) are being given 2 alternatives for the ascender value, which both make no sense. |
Can we please delete out the pdf from the issue once you have it downloaded to work with? Probably need to remove links to the output too since it probably contains the entire pdf. |
You asked: "what the hell is so important about this specific file in Markdown format ...?" The answer is 'nothing.' I just by chance happened to test PyMuPDF4LLM on this one and happened to notice a problem and reported it. I was using it in a Haystack pipeline that takes a number of PDFs and converts them to markdown and then converts to HTML and then feeds it over to an already existing HTML parser to try to grab metadata. This was my first attempt at that idea when I noticed it had this problem. |
Question: You mentioned this isn't a problem with the latest PyMu. When will the version you are on be released? |
This reopens an older issue.
In the attached file, some words are being repeated across multiple line.
It happens in versions 0.0.17 and also 0.0.16.
A.World.of.Propensities.by.Karl.Popper.1997.pdf
The text was updated successfully, but these errors were encountered: