Faster image filter #208

HDembinski · 2024-12-04T11:46:01Z

Closes #204

This adds a simple fix for my edge case, where I have pages with >6000 detected image rectangles. The algorithm which discards rectangles inside other rectangles churns on forever on such pages, because of the O(n*n) complexity.

Since most of these rectangles are tiny, a few pixels in size, a sufficient fix is to filter out tiny images. By default there is already a filter to not save images smaller than image_size_limit. I apply that limit before considering which rectangles to keep, the reasoning being: if a small image is not anyway inside a larger one, it would not be stored afterwards anyway, and if it is inside, there is also no harm done in removing it either.

Since filtering by image size is O(n), this speeds up the code for my edge case.

This is a minimal change to solve the problem, please accept it. I made the extraction algorithm configurable, with the parameter "image_extract_algorithm" which uses the new algorithm by default. This is just defensive programming, so that people can keep the old behavior if that is for some reason required, although I cannot see any reason why the algorithm should give different results from the old one, apart from running faster. In my tests they produced the same results.

I would in fact be so bold and simply replace the old by the new algorithm and not provide an option to switch.

HDembinski · 2024-12-04T11:48:11Z

I also tried out a sweepline algorithm, whose implementation came from ChatGPT, but that did produce different results (it was missing some images) and on top of that it took longer on benign pages - O(n log n) can be slower than O(n^2) for small n -, and thus I am not inclined to bug-fix and submit that other algorithm.

JorjMcKie · 2024-12-04T12:18:57Z

Thanks a lot for your in-depth considerations. A final request before I can accept your PR:
Please read our contributor license agreement here and accept by adding a comment in this thread with the wording:
"I have read Artifex' Contributor License Agreement and herewith accept it."

HDembinski · 2024-12-05T13:59:28Z

Fine.

I have read Artifex' Contributor License Agreement and herewith accept it.

For the record, I previously contributed to marker-pdf, but did not sign their contributor license agreement, because I think that using free work by the community to improve your product that you want to sell is an unfair practice.

HDembinski added 2 commits December 3, 2024 12:23

ignore eggs

2e7adb3

add simple-drop algorithm

83ea4d1

JorjMcKie self-requested a review December 6, 2024 08:49

JorjMcKie approved these changes Dec 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster image filter #208

Faster image filter #208

HDembinski commented Dec 4, 2024 •

edited

Loading

HDembinski commented Dec 4, 2024 •

edited

Loading

JorjMcKie commented Dec 4, 2024

HDembinski commented Dec 5, 2024

Faster image filter #208

Are you sure you want to change the base?

Faster image filter #208

Conversation

HDembinski commented Dec 4, 2024 • edited Loading

HDembinski commented Dec 4, 2024 • edited Loading

JorjMcKie commented Dec 4, 2024

HDembinski commented Dec 5, 2024

HDembinski commented Dec 4, 2024 •

edited

Loading

HDembinski commented Dec 4, 2024 •

edited

Loading