-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine based reading order integration: some refactoring and fixes #142
base: machine_based_reading_order_integration
Are you sure you want to change the base?
Conversation
…_contours_in_image: use mp.Pool, simplify
src/eynollah/eynollah.py
Outdated
img_res, is_image_enhanced, num_col_classifier, num_column_is_classified = self.run_enhancement(self.light_version) | ||
self.logger.info("Enhancing took %.1fs ", time.time() - t0) | ||
#print("text region early -1 in %.1fs", time.time() - t0) | ||
t1 = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it does not make sense to move that from the lower indentation level to all the conditional branches in a copycat fashion. img_res
seems to be needed everywhere below, so why not just compute it once here?
(See follow-up comments below.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cfc6512 I have moved it up again, so that code does not need to be repeated.
I dug through most of the code base, still watching out for places that might act like a memory leak. To be able to read and understand the code, in 335aa27 I had to do further simplification and styling (esp. wrapping overlong lines) – I hope you don't mind these changes. I have tested them in various modes, no differences in output so far. Note: in 0ae28f7 I switched from stdlib ProcessPoolExecutor to loky, which is the origin of many bugfixes in stdlib between Python 3.9 and 3.11, which in my experience are needed for robustness, but won't be backported for 3.8. (When in the distant future we will have moved to 3.11 anyway, we can remove that dependency and switch back to stdlib.) I am still hunting the OOM failures (via instrumentation and monitoring), so stay tuned. (I'll judiciously compare performance gains/losses of the recent changes as soon as the code is sufficiently stable.) |
I tried to run the new branch on https://github.com/OCR-D/gt_structure_all/tree/main/datasets, but ran into a couple of problems:
dir_in
mode after a few hundred pagesgc.collect
del
(don't recall if this made a difference)eynollah
instead ofocrd_utils.setOverrideLogLevel
(which only affectsocrd.*
and some preconfigured loggers)cv2.resize
on zero-channel label array) if no text regions detectedIn doing so, or rather in order to achieve that, I had to simplify and refactor here and there to make it readable (to me). There were lots of extremely long lines, code duplication, unnecessary indentation etc.
In particular, I rewrote the parallel subprocessing by utilising concurrent.futures.ProcessPoolExecutor, and maximally reusing the executor instance to avoid the overhead of setting up processes, queues and threads. In my measurements, this reduced the average runtime per page from 26.8 secs to 14.3 secs. GPU utilization is still peaky, though:
(This interval was taken over 9 min or a few dozen pages.) I will address CPU-GPU pipelining another time.
I also added the detected deskewing angle to the regions as
@orientation
attribute.Moreover, I introduced
--overwrite
to ignore existing output XMLs, and changed the default behaviour to skip them (so one can easily complete a directory if a previous run failed or new images were added).@vahidrezanezhad, since there are so many large, but rather cosmetic diffs, I recommend going through the changes commit by commit instead of the aggregated file by file view.