-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow bbox merging algorithm prevents conversion of some pdfs #204
Comments
The algorithm is indeed not optimized for crazy corner cases with several thousands of images. Normal are maybe a few handful of images. |
Could you please let me have an example page causing this type of problem? |
I am working on a fix, will submit a PR. I cannot share the document. |
The O(n log n) algorithm is not vastly more complicated than your implementation, and it does not introduce external dependencies, which you guys seem to care about (for good reasons), so I took that in mind, too. |
Great, thank you in advance! |
This code in pymupdf_rag.py:795 has complexity O(n^2), which is ok if there are about 100 rectangles, but I have some pdfs in which >6000 rectangles are detected. In that case, this algorithm runs virtually forever.
Two solutions come to mind:
The text was updated successfully, but these errors were encountered: