Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow bbox merging algorithm prevents conversion of some pdfs #204

Open
HDembinski opened this issue Dec 3, 2024 · 5 comments · May be fixed by #208
Open

Slow bbox merging algorithm prevents conversion of some pdfs #204

HDembinski opened this issue Dec 3, 2024 · 5 comments · May be fixed by #208
Labels
enhancement New feature or request

Comments

@HDembinski
Copy link

This code in pymupdf_rag.py:795 has complexity O(n^2), which is ok if there are about 100 rectangles, but I have some pdfs in which >6000 rectangles are detected. In that case, this algorithm runs virtually forever.

        # sort descending by image area size
        img_info.sort(key=lambda i: abs(i["bbox"]), reverse=True)
        # run from back to front (= small to large)
        for i in range(len(img_info) - 1, 0, -1):
            r = img_info[i]["bbox"]
            if r.is_empty:
                del img_info[i]
                continue
            for j in range(i):  # image areas larger than r
                if r in img_info[j]["bbox"]:
                    del img_info[i]  # contained in some larger image
                    break

Two solutions come to mind:

  • Small images are dropped by default. They should be dropped before trying to merge them into larger images.
  • It is possible to turn this into a O(n log n) with a swipe line algorithm.
@JorjMcKie
Copy link
Contributor

The algorithm is indeed not optimized for crazy corner cases with several thousands of images. Normal are maybe a few handful of images.
I want to avoid experimenting with new algorithms and would rather reduce the number of candidates by excluding small ones.
Another option may be to simply join all image boundary boxes if the count exceeds some threshold.
But maybe you can contribute an algorithm.

@JorjMcKie JorjMcKie added the enhancement New feature or request label Dec 3, 2024
@JorjMcKie
Copy link
Contributor

Could you please let me have an example page causing this type of problem?

@HDembinski
Copy link
Author

I am working on a fix, will submit a PR. I cannot share the document.

@HDembinski
Copy link
Author

The O(n log n) algorithm is not vastly more complicated than your implementation, and it does not introduce external dependencies, which you guys seem to care about (for good reasons), so I took that in mind, too.

@JorjMcKie
Copy link
Contributor

Great, thank you in advance!

@HDembinski HDembinski linked a pull request Dec 4, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants