Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine based reading order integration: some refactoring and fixes #142

Open
wants to merge 34 commits into
base: machine_based_reading_order_integration
Choose a base branch
from

Conversation

bertsky
Copy link
Contributor

@bertsky bertsky commented Dec 11, 2024

I tried to run the new branch on https://github.com/OCR-D/gt_structure_all/tree/main/datasets, but ran into a couple of problems:

  1. CUDA libraries (specifically, libcudnn) could not be installed properly, because the new OCR feature depends on Pytorch, which explicitly depends on (and is dynamically linked against) a newer version of nvidia-cudnn than the one Tensorflow implicitly needs (and is dynamically loaded)
    • fixed by manually downgrading, but to avoid that problem for unsuspecting users, I also made the OCR feature (and its dependencies) into an optional feature; same goes for matplotlib, which could drag in X11 libs IIRC
  2. CUDA OOM with dir_in mode after a few hundred pages
    • fixed by ensuring no models are reloaded in that mode, and adding some gc.collect
    • probably also be removed an explicit del (don't recall if this made a difference)
  3. no log output
    • fixed by setting the log level for our actual logger eynollah instead of ocrd_utils.setOverrideLogLevel (which only affects ocrd.* and some preconfigured loggers)
  4. no deskewing (always 0° results)
    • fixed by correct indentation for aggregating results (was behind exception handler)
  5. non-termination (sleep state) after a few hundred pages
    • fixed by using multiprocessing.Pool instead of custom Process/Queue loops for deskewing and contour extraction
  6. segfault after a few hundred pages
    • fixed by avoiding loop body (leading towards cv2.resize on zero-channel label array) if no text regions detected

In doing so, or rather in order to achieve that, I had to simplify and refactor here and there to make it readable (to me). There were lots of extremely long lines, code duplication, unnecessary indentation etc.

In particular, I rewrote the parallel subprocessing by utilising concurrent.futures.ProcessPoolExecutor, and maximally reusing the executor instance to avoid the overhead of setting up processes, queues and threads. In my measurements, this reduced the average runtime per page from 26.8 secs to 14.3 secs. GPU utilization is still peaky, though:

eynollah-light-cl-ocrd-gtsa-pool

(This interval was taken over 9 min or a few dozen pages.) I will address CPU-GPU pipelining another time.

I also added the detected deskewing angle to the regions as @orientation attribute.

Moreover, I introduced --overwrite to ignore existing output XMLs, and changed the default behaviour to skip them (so one can easily complete a directory if a previous run failed or new images were added).

@vahidrezanezhad, since there are so many large, but rather cosmetic diffs, I recommend going through the changes commit by commit instead of the aggregated file by file view.

bertsky and others added 30 commits December 4, 2024 15:57
Comment on lines 4491 to 4494
img_res, is_image_enhanced, num_col_classifier, num_column_is_classified = self.run_enhancement(self.light_version)
self.logger.info("Enhancing took %.1fs ", time.time() - t0)
#print("text region early -1 in %.1fs", time.time() - t0)
t1 = time.time()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it does not make sense to move that from the lower indentation level to all the conditional branches in a copycat fashion. img_res seems to be needed everywhere below, so why not just compute it once here?

(See follow-up comments below.)

Copy link
Contributor Author

@bertsky bertsky Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cfc6512 I have moved it up again, so that code does not need to be repeated.

src/eynollah/eynollah.py Show resolved Hide resolved
src/eynollah/eynollah.py Outdated Show resolved Hide resolved
@bertsky
Copy link
Contributor Author

bertsky commented Dec 23, 2024

I dug through most of the code base, still watching out for places that might act like a memory leak. To be able to read and understand the code, in 335aa27 I had to do further simplification and styling (esp. wrapping overlong lines) – I hope you don't mind these changes. I have tested them in various modes, no differences in output so far.

Note: in 0ae28f7 I switched from stdlib ProcessPoolExecutor to loky, which is the origin of many bugfixes in stdlib between Python 3.9 and 3.11, which in my experience are needed for robustness, but won't be backported for 3.8. (When in the distant future we will have moved to 3.11 anyway, we can remove that dependency and switch back to stdlib.)

I am still hunting the OOM failures (via instrumentation and monitoring), so stay tuned. (I'll judiciously compare performance gains/losses of the recent changes as soon as the code is sufficiently stable.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants