Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image extraction broken in 0.17, worked on 0.16 #163

Open
kingennio opened this issue Oct 6, 2024 · 7 comments
Open

image extraction broken in 0.17, worked on 0.16 #163

kingennio opened this issue Oct 6, 2024 · 7 comments

Comments

@kingennio
Copy link

I think the new version has introduced a glitch in output_images function because several images are not extracted.

It's consistent throughout, but for demonstration consider this slide
slide.pdf
with v.0.016 two images are extracted, the photo and the logo. In v.0.17 only the logo is extracted and not the main photo.
I stepped through the code. I guess the problem is that the loop removes images as they are extracted but this creates a problem with the way the loop is structured.

In 0.16, the loop made a copy of the references of the list
for i, img_rect in sorted(
[j for j in img_rects.items() if j[1].y1 <= text_rect.y0],
key=lambda j: (j[1].y1, j[1].x0),
):

whereas the 0.17 works directly on the original list
for i, img_rect in enumerate(parms.img_rects):
if not img_rect.y1 <= text_rect.y0:
continue

so when the image is deleted
del parms.img_rects[i] # do not touch this image twice

the loop exhausts the items and exits. In fact there are 2 images, the first is the logo, and it is extracted, but since it's deleted from the list, at the next iteration the loop is completed because it has already dealt with an item and the list now has in fact one item and so it's over.

@kingennio
Copy link
Author

kingennio commented Oct 7, 2024

I think I fixed the code by keeping track of the indices to remove and then delete them at the end (marked > the modification)

def output_images(parms, text_rect):
        """Output images and graphics above text rectangle."""
        if not parms.img_rects:
            return ""
        this_md = ""  # markdown string
    processed_images = []  # List to keep track of processed images
    if text_rect is not None:  # select images above the text block
        for i, img_rect in enumerate(parms.img_rects):
            if not img_rect.y1 <= text_rect.y0:
                continue
            pathname = save_image(parms.page, img_rect, i)
            if pathname:
                this_md += GRAPHICS_TEXT % pathname
            if force_text:
                img_txt = write_text(
                    parms,
                    img_rect,
                    tabs=None,
                    tab_rects={},  # we have no tables here
                    img_rects=[],  # we have no other images here
                    force_text=True,
                )
                if not is_white(img_txt):  # was there text at all?
                    this_md += img_txt
            #del parms.img_rects[i]  # do not touch this image twice
            processed_images.append(i)
    else:  # output all remaining images
        for i, img_rect in enumerate(parms.img_rects):
            pathname = save_image(parms.page, img_rect, i)
            if pathname:
                this_md += GRAPHICS_TEXT % pathname
            if force_text:
                img_txt = write_text(
                    parms,
                    img_rect,
                    tabs=None,
                    tab_rects={},  # we have no tables here
                    img_rects=[],  # we have no other images here
                    force_text=True,
                )
                if not is_white(img_txt):
                    this_md += img_txt
            #del parms.img_rects[i]  # do not touch this image twice
            processed_images.append(i)

    # Remove processed images from parms.img_rects after the loop
    for i in sorted(processed_images, reverse=True):
        del parms.img_rects[i]
    return this_md

@luc42ei
Copy link

luc42ei commented Oct 14, 2024

yep, I have the same issue

@PedroFCM
Copy link

I got the same problem when having two images on the same PDF page.
@kingennio code solved it for me

@greengeek
Copy link

I am seeing this same issue as well in pymupdf4llm 0.0.17 using Python 3.12.4

@JorjMcKie
Copy link
Contributor

@kingennio Thank you for your contribution, I will include the idea in the next version.

@rajuptvs
Copy link

had similar issues, downgraded to pymupdf4llm==0.0.16 from "0.0.17", helped me extract all the images.
for anybody facing issues with missing images during extraction.

  • try to downgrade to the 0.0.16 to see if it helps (this helped me a ton!!)

@HDembinski
Copy link

It looks like there is a fix for this, why is this issue not closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants