Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster image filter #208

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

HDembinski
Copy link

@HDembinski HDembinski commented Dec 4, 2024

Closes #204

This adds a simple fix for my edge case, where I have pages with >6000 detected image rectangles. The algorithm which discards rectangles inside other rectangles churns on forever on such pages, because of the O(n*n) complexity.

Since most of these rectangles are tiny, a few pixels in size, a sufficient fix is to filter out tiny images. By default there is already a filter to not save images smaller than image_size_limit. I apply that limit before considering which rectangles to keep, the reasoning being: if a small image is not anyway inside a larger one, it would not be stored afterwards anyway, and if it is inside, there is also no harm done in removing it either.

Since filtering by image size is O(n), this speeds up the code for my edge case.

This is a minimal change to solve the problem, please accept it. I made the extraction algorithm configurable, with the parameter "image_extract_algorithm" which uses the new algorithm by default. This is just defensive programming, so that people can keep the old behavior if that is for some reason required, although I cannot see any reason why the algorithm should give different results from the old one, apart from running faster. In my tests they produced the same results.

I would in fact be so bold and simply replace the old by the new algorithm and not provide an option to switch.

@HDembinski
Copy link
Author

HDembinski commented Dec 4, 2024

I also tried out a sweepline algorithm, whose implementation came from ChatGPT, but that did produce different results (it was missing some images) and on top of that it took longer on benign pages - O(n log n) can be slower than O(n^2) for small n -, and thus I am not inclined to bug-fix and submit that other algorithm.

@JorjMcKie
Copy link
Contributor

Thanks a lot for your in-depth considerations. A final request before I can accept your PR:
Please read our contributor license agreement here and accept by adding a comment in this thread with the wording:
"I have read Artifex' Contributor License Agreement and herewith accept it."

@HDembinski
Copy link
Author

Fine.

I have read Artifex' Contributor License Agreement and herewith accept it.

For the record, I previously contributed to marker-pdf, but did not sign their contributor license agreement, because I think that using free work by the community to improve your product that you want to sell is an unfair practice.

@JorjMcKie JorjMcKie self-requested a review December 6, 2024 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow bbox merging algorithm prevents conversion of some pdfs
2 participants