Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #204
This adds a simple fix for my edge case, where I have pages with >6000 detected image rectangles. The algorithm which discards rectangles inside other rectangles churns on forever on such pages, because of the O(n*n) complexity.
Since most of these rectangles are tiny, a few pixels in size, a sufficient fix is to filter out tiny images. By default there is already a filter to not save images smaller than image_size_limit. I apply that limit before considering which rectangles to keep, the reasoning being: if a small image is not anyway inside a larger one, it would not be stored afterwards anyway, and if it is inside, there is also no harm done in removing it either.
Since filtering by image size is O(n), this speeds up the code for my edge case.
This is a minimal change to solve the problem, please accept it. I made the extraction algorithm configurable, with the parameter "image_extract_algorithm" which uses the new algorithm by default. This is just defensive programming, so that people can keep the old behavior if that is for some reason required, although I cannot see any reason why the algorithm should give different results from the old one, apart from running faster. In my tests they produced the same results.
I would in fact be so bold and simply replace the old by the new algorithm and not provide an option to switch.