Slow bbox merging algorithm prevents conversion of some pdfs #204

HDembinski · 2024-12-03T11:30:01Z

This code in pymupdf_rag.py:795 has complexity O(n^2), which is ok if there are about 100 rectangles, but I have some pdfs in which >6000 rectangles are detected. In that case, this algorithm runs virtually forever.

        # sort descending by image area size
        img_info.sort(key=lambda i: abs(i["bbox"]), reverse=True)
        # run from back to front (= small to large)
        for i in range(len(img_info) - 1, 0, -1):
            r = img_info[i]["bbox"]
            if r.is_empty:
                del img_info[i]
                continue
            for j in range(i):  # image areas larger than r
                if r in img_info[j]["bbox"]:
                    del img_info[i]  # contained in some larger image
                    break

Two solutions come to mind:

Small images are dropped by default. They should be dropped before trying to merge them into larger images.
It is possible to turn this into a O(n log n) with a swipe line algorithm.

JorjMcKie · 2024-12-03T17:41:22Z

The algorithm is indeed not optimized for crazy corner cases with several thousands of images. Normal are maybe a few handful of images.
I want to avoid experimenting with new algorithms and would rather reduce the number of candidates by excluding small ones.
Another option may be to simply join all image boundary boxes if the count exceeds some threshold.
But maybe you can contribute an algorithm.

JorjMcKie · 2024-12-03T17:49:10Z

Could you please let me have an example page causing this type of problem?

HDembinski · 2024-12-03T18:31:24Z

I am working on a fix, will submit a PR. I cannot share the document.

HDembinski · 2024-12-03T18:33:08Z

The O(n log n) algorithm is not vastly more complicated than your implementation, and it does not introduce external dependencies, which you guys seem to care about (for good reasons), so I took that in mind, too.

JorjMcKie · 2024-12-03T20:38:32Z

Great, thank you in advance!

JorjMcKie added the enhancement New feature or request label Dec 3, 2024

HDembinski linked a pull request Dec 4, 2024 that will close this issue

Faster image filter #208

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow bbox merging algorithm prevents conversion of some pdfs #204

Slow bbox merging algorithm prevents conversion of some pdfs #204

HDembinski commented Dec 3, 2024

JorjMcKie commented Dec 3, 2024

JorjMcKie commented Dec 3, 2024

HDembinski commented Dec 3, 2024

HDembinski commented Dec 3, 2024

JorjMcKie commented Dec 3, 2024

Slow bbox merging algorithm prevents conversion of some pdfs #204

Slow bbox merging algorithm prevents conversion of some pdfs #204

Comments

HDembinski commented Dec 3, 2024

JorjMcKie commented Dec 3, 2024

JorjMcKie commented Dec 3, 2024

HDembinski commented Dec 3, 2024

HDembinski commented Dec 3, 2024

JorjMcKie commented Dec 3, 2024