Machine based reading order integration: some refactoring and fixes #142

bertsky · 2024-12-11T23:09:34Z

I tried to run the new branch on https://github.com/OCR-D/gt_structure_all/tree/main/datasets, but ran into a couple of problems:

CUDA libraries (specifically, libcudnn) could not be installed properly, because the new OCR feature depends on Pytorch, which explicitly depends on (and is dynamically linked against) a newer version of nvidia-cudnn than the one Tensorflow implicitly needs (and is dynamically loaded)
- fixed by manually downgrading, but to avoid that problem for unsuspecting users, I also made the OCR feature (and its dependencies) into an optional feature; same goes for matplotlib, which could drag in X11 libs IIRC
CUDA OOM with dir_in mode after a few hundred pages
- fixed by ensuring no models are reloaded in that mode, and adding some gc.collect
- probably also be removed an explicit del (don't recall if this made a difference)
no log output
- fixed by setting the log level for our actual logger eynollah instead of ocrd_utils.setOverrideLogLevel (which only affects ocrd.* and some preconfigured loggers)
no deskewing (always 0° results)
- fixed by correct indentation for aggregating results (was behind exception handler)
non-termination (sleep state) after a few hundred pages
- fixed by using multiprocessing.Pool instead of custom Process/Queue loops for deskewing and contour extraction
segfault after a few hundred pages
- fixed by avoiding loop body (leading towards cv2.resize on zero-channel label array) if no text regions detected

In doing so, or rather in order to achieve that, I had to simplify and refactor here and there to make it readable (to me). There were lots of extremely long lines, code duplication, unnecessary indentation etc.

In particular, I rewrote the parallel subprocessing by utilising concurrent.futures.ProcessPoolExecutor, and maximally reusing the executor instance to avoid the overhead of setting up processes, queues and threads. In my measurements, this reduced the average runtime per page from 26.8 secs to 14.3 secs. GPU utilization is still peaky, though:

(This interval was taken over 9 min or a few dozen pages.) I will address CPU-GPU pipelining another time.

I also added the detected deskewing angle to the regions as @orientation attribute.

Moreover, I introduced --overwrite to ignore existing output XMLs, and changed the default behaviour to skip them (so one can easily complete a directory if a previous run failed or new images were added).

@vahidrezanezhad, since there are so many large, but rather cosmetic diffs, I recommend going through the changes commit by commit instead of the aggregated file by file view.

…ify via mp.Pool

…_contours_in_image: use mp.Pool, simplify

…sPoolExecutor

src/eynollah/utils/separate_lines.py

bertsky · 2024-12-16T20:35:58Z

src/eynollah/eynollah.py

-            img_res, is_image_enhanced, num_col_classifier, num_column_is_classified = self.run_enhancement(self.light_version)
-            self.logger.info("Enhancing took %.1fs ", time.time() - t0)
-            #print("text region early -1 in %.1fs", time.time() - t0)
-            t1 = time.time()


IMO it does not make sense to move that from the lower indentation level to all the conditional branches in a copycat fashion. img_res seems to be needed everywhere below, so why not just compute it once here?

(See follow-up comments below.)

In cfc6512 I have moved it up again, so that code does not need to be repeated.

src/eynollah/eynollah.py

bertsky · 2024-12-23T14:14:03Z

I dug through most of the code base, still watching out for places that might act like a memory leak. To be able to read and understand the code, in 335aa27 I had to do further simplification and styling (esp. wrapping overlong lines) – I hope you don't mind these changes. I have tested them in various modes, no differences in output so far.

Note: in 0ae28f7 I switched from stdlib ProcessPoolExecutor to loky, which is the origin of many bugfixes in stdlib between Python 3.9 and 3.11, which in my experience are needed for robustness, but won't be backported for 3.8. (When in the distant future we will have moved to 3.11 anyway, we can remove that dependency and switch back to stdlib.)

I am still hunting the OOM failures (via instrumentation and monitoring), so stay tuned. (I'll judiciously compare performance gains/losses of the recent changes as soon as the code is sufficiently stable.)

bertsky and others added 30 commits December 4, 2024 15:57

move Torch to optional dependencies (to avoid clash with TF over CuDNN)

f765e26

RO model: do not reload when in dir_in mode

7ae64f3

simplify dir_in conditionals

3b9a29b

do not reload enhancement model in dir_in mode, simplify

329fac2

simplify loading models w/o dir_in mode

14beb46

log-level: only set 'eynollah' logger level

9f12fa2

avoid indentation

5b82320

avoid indentation (skip_layout_and_reading_order)

cd4e426

wrap extremely long lines

a520bd1

run: log instead of print

3d88b20

simplify

aaea2ef

avoid indentation

055463d

avoid indentation

c3163ca

do_prediction: avoid code duplication

ad748d0

do_prediction: trigger GC to avoid CUDA OOM

d680170

do_image_rotation: fix f93fa12 (do return results)

6fe02df

do_image_rotation / return_deskew_slop: avoid code duplication, simpl…

54cb150

…ify via mp.Pool

simplify

5e0c1da

no del on function argument

21efea8

exit early if no text regions found (to avoid segfault)

25e9673

do_work_of_slopes_new*, do_back_rotation_and_get_cnt_back, do_work_of…

68456ea

…_contours_in_image: use mp.Pool, simplify

switch from (ad-hoc) mp.Pool to (attribute) concurrent.futures.Proces…

7e9ee90

…sPoolExecutor

avoid deskewing patches if binary-empty

3b70b11

annotate region angles in PAGE

9270ea4

log num_cols-dependent resizing

b9ca7a6

add option to overwrite output xml, but skip by default if file exists

b4b0890

change polarity of orientation angle (PAGE schema required cw=positive)

dcaf796

CI: install optional dependencies, too

e9c0d71

debugging issues

0e8c561

function of patch-wise inference with scatter_nd is added

f93c6c2

switch from stdlib to loky.ProcessPoolExecutor, ensure shutdown

0ae28f7

bertsky commented Dec 16, 2024

View reviewed changes

src/eynollah/utils/separate_lines.py Show resolved Hide resolved

bertsky commented Dec 16, 2024

View reviewed changes

src/eynollah/utils/separate_lines.py Show resolved Hide resolved

bertsky commented Dec 16, 2024

View reviewed changes

bertsky added 3 commits December 22, 2024 13:10

do_order_of_regions_with_model: simplify

01376af

reduce redundancy/indentation

cfc6512

simplify, wrap extremely long lines

335aa27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine based reading order integration: some refactoring and fixes #142

Machine based reading order integration: some refactoring and fixes #142

bertsky commented Dec 11, 2024 •

edited by michalbubula

Loading

bertsky Dec 16, 2024

bertsky Dec 23, 2024 •

edited

Loading

bertsky commented Dec 23, 2024 •

edited

Loading

Machine based reading order integration: some refactoring and fixes #142

Are you sure you want to change the base?

Machine based reading order integration: some refactoring and fixes #142

Conversation

bertsky commented Dec 11, 2024 • edited by michalbubula Loading

bertsky Dec 16, 2024

Choose a reason for hiding this comment

bertsky Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

bertsky commented Dec 23, 2024 • edited Loading

bertsky commented Dec 11, 2024 •

edited by michalbubula

Loading

bertsky Dec 23, 2024 •

edited

Loading

bertsky commented Dec 23, 2024 •

edited

Loading