-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling Graphical Images & Superscripts #116
Comments
Please provide the script you used. |
|
Don't let me guess please: |
|
As regards the superscript handling improvement request, I guess what you're looking for is a feature handling footnotes and footnote references. This would obviously be useful but it would imply a major refactoring. For a naive approach, it would mean first detecting superscript text within the body text (this is already here), saving them in some data structure for further processing, then detecting and differentiating the footnotes from the body text on the page, then matching the footnotes with the references. Since the footnotes are usually located at the bottom of the page and the footnote references inside the body text and pymupdf4llm generates the string linearly, this would mean that the script would need to use the saved references to try and match the beginning of the lines at the bottom of page. So far, not that difficult. However, this would then mean that once the footnote has been matched, we would have to go back into the string to create the reference. However, sometimes, footnote references are incremented at page level and their index is reset on each page which would mean that in a single md string for a multi page document, there would be ambiguous footnotes and footnote references, meaning that the script would also need to handle an eventual re-numbering. Some documents also include simultaneously various symbols for the footnote references (e.g. numbers and roman numbers, for instance, to differentiate the author's footnotes from the publisher's or the translator's footnotes) and these would also need to be differentiated and tracked in the data structure. Finally, superscript text might also be references to endnotes or mark other information (e.g. "tm", copyright symbol, the "o" in a number symbol on "no", aso.). All this processing would probably have some performance impact. So while the feature would obviously be welcome, this makes it almost a package on its own and I personally think that it would probably be better handled in a post-processing script of its own doing only this and doing it well instead of directly into pymupdf4llm. |
@CedricLor - thank you for your thoughtful assessment on footnotes. |
Embedded images are extracted to a dedicated folder, which i observed for some of the documents.
There are some graphical images in the below pdf which are not getting extracted to separate folder.
There are also superscripts in the pdf, which are not referenced.
sample_document.pdf
The text was updated successfully, but these errors were encountered: