Handling Graphical Images & Superscripts #116

SBhat2615 · 2024-08-26T13:52:49Z

Embedded images are extracted to a dedicated folder, which i observed for some of the documents.

There are some graphical images in the below pdf which are not getting extracted to separate folder.

There are also superscripts in the pdf, which are not referenced.

sample_document.pdf

JorjMcKie · 2024-08-26T13:55:37Z

Please provide the script you used.

SBhat2615 · 2024-08-26T14:00:13Z

Please provide the script you used.

import pymupdf4llm

md_text = pymupdf4llm.to_markdown(input_path, write_images=True)

output = open(output_path, "w")
output.write(md_text)
output.close()

JorjMcKie · 2024-08-26T15:09:23Z

Don't let me guess please:
On which page are you missing what?

SBhat2615 · 2024-08-27T05:43:10Z

Don't let me guess please: On which page are you missing what?

Figure 1 and 2 are not extracted as image.
Table 3, 5, 6 is not extracted as image.

sample_document.md

SBhat2615 · 2024-08-27T05:45:17Z

For superscripts, if we can get output similar to this, that would be good as well.

CedricLor · 2024-09-07T10:44:03Z

As regards the superscript handling improvement request, I guess what you're looking for is a feature handling footnotes and footnote references.

This would obviously be useful but it would imply a major refactoring.

For a naive approach, it would mean first detecting superscript text within the body text (this is already here), saving them in some data structure for further processing, then detecting and differentiating the footnotes from the body text on the page, then matching the footnotes with the references.

Since the footnotes are usually located at the bottom of the page and the footnote references inside the body text and pymupdf4llm generates the string linearly, this would mean that the script would need to use the saved references to try and match the beginning of the lines at the bottom of page. So far, not that difficult.

However, this would then mean that once the footnote has been matched, we would have to go back into the string to create the reference.

However, sometimes, footnote references are incremented at page level and their index is reset on each page which would mean that in a single md string for a multi page document, there would be ambiguous footnotes and footnote references, meaning that the script would also need to handle an eventual re-numbering.

Some documents also include simultaneously various symbols for the footnote references (e.g. numbers and roman numbers, for instance, to differentiate the author's footnotes from the publisher's or the translator's footnotes) and these would also need to be differentiated and tracked in the data structure.

Finally, superscript text might also be references to endnotes or mark other information (e.g. "tm", copyright symbol, the "o" in a number symbol on "no", aso.).

All this processing would probably have some performance impact.

So while the feature would obviously be welcome, this makes it almost a package on its own and I personally think that it would probably be better handled in a post-processing script of its own doing only this and doing it well instead of directly into pymupdf4llm.

JorjMcKie · 2024-09-07T11:00:02Z

@CedricLor - thank you for your thoughtful assessment on footnotes.
I totally agree with you:
This is something we will probably never support for all the reasons you were mentioning: simply out of scope.

JorjMcKie added the waiting for information label Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Graphical Images & Superscripts #116

Handling Graphical Images & Superscripts #116

SBhat2615 commented Aug 26, 2024

JorjMcKie commented Aug 26, 2024

SBhat2615 commented Aug 26, 2024 •

edited

Loading

JorjMcKie commented Aug 26, 2024

SBhat2615 commented Aug 27, 2024

SBhat2615 commented Aug 27, 2024

CedricLor commented Sep 7, 2024 •

edited

Loading

JorjMcKie commented Sep 7, 2024

Handling Graphical Images & Superscripts #116

Handling Graphical Images & Superscripts #116

Comments

SBhat2615 commented Aug 26, 2024

JorjMcKie commented Aug 26, 2024

SBhat2615 commented Aug 26, 2024 • edited Loading

JorjMcKie commented Aug 26, 2024

SBhat2615 commented Aug 27, 2024

SBhat2615 commented Aug 27, 2024

CedricLor commented Sep 7, 2024 • edited Loading

JorjMcKie commented Sep 7, 2024

SBhat2615 commented Aug 26, 2024 •

edited

Loading

CedricLor commented Sep 7, 2024 •

edited

Loading