override instead of wrap TesserocrRecognize in other processors #191

bertsky · 2023-03-14T17:57:01Z

fixes #190

Perhaps we should also add a test case for the workspace mechanics. I have a feeling this should have failed long ago even without the Processing Server's instance caching.

- instead of wrapping a foreign instance and delegating the `process` call, inherit `process` but reparse parameters after rewriting them in the constructor - only overwrite the `process` docstring (This is necessary because the inner instance had its own `workspace` instance. Also, we don't want to rewrite `moduledir` and the version parser everywhere.)

bertsky · 2023-03-14T23:16:13Z

Perhaps we should also add a test case for the workspace mechanics. I have a feeling this should have failed long ago even without the Processing Server's instance caching.

Ah, it actually did! (We just had no event on this repo trigger the CI since the last get_processor / run_processor changes in core.)

bertsky · 2023-03-14T23:28:49Z

Turns out I had to do a lot more to fix the CI:

build based on ocrd/core but with PIP and PYTHON variables did not work (anymore?) and is not useful: we want to test various Python versions natively (without the hassle of OS dependencies for them), so the base must be a matrix of Python CI images
Python 3.6 now (being end-of-life) builds much faster if older Numpy/OpenCV binaries get installed beforehand
the PPA for Tesseract does not work (and is not necessary) for the most up-to-date Python versions (i.e. OS versions), so this step (including the related apt-get update) must be optional
the PPA version of Tesseract has its tessdata directory (now identical to our ocrd-tesserocr moduledir, where resmgr wants to install additional models into) owned by root, so this must become writable for others

bertsky · 2023-03-15T00:05:54Z

Ok, I ran into OCR-D/core#998. Now, I could wait for the fix to be merged, or avoid the kwargs.setdefault('ocrd_tool', ...) pattern. But I don't want either, so let's do a workaround for now.

codecov · 2023-03-15T00:20:18Z

Codecov Report

Merging #191 (148d015) into master (c10f94d) will decrease coverage by 26.98%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #191       +/-   ##
==========================================
- Coverage   26.97%   0.00%   -26.98%     
==========================================
  Files          11      12        +1     
  Lines        1416    1377       -39     
  Branches      333     346       +13     
==========================================
- Hits          382       0      -382     
- Misses        981    1371      +390     
+ Partials       53       6       -47

Impacted Files	Coverage Δ
ocrd_tesserocr/binarize.py	`0.00% <0.00%> (-15.12%)`	⬇️
ocrd_tesserocr/config.py	`0.00% <ø> (-100.00%)`	⬇️
ocrd_tesserocr/crop.py	`0.00% <0.00%> (-14.29%)`	⬇️
ocrd_tesserocr/deskew.py	`0.00% <0.00%> (-13.52%)`	⬇️
ocrd_tesserocr/fontshape.py	`0.00% <0.00%> (-17.83%)`	⬇️
ocrd_tesserocr/recognize.py	`0.00% <0.00%> (-25.38%)`	⬇️
ocrd_tesserocr/segment.py	`0.00% <0.00%> (-36.00%)`	⬇️
ocrd_tesserocr/segment_line.py	`0.00% <0.00%> (-96.16%)`	⬇️
ocrd_tesserocr/segment_region.py	`0.00% <0.00%> (-96.43%)`	⬇️
ocrd_tesserocr/segment_table.py	`0.00% <0.00%> (-34.62%)`	⬇️
... and 1 more

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

joschrew

I have tested this code together with the processing-server PR and everything works now. So this PR seems to solve my problems regarding the tesser-ocrd-processors.
Additionally I have briefly gone through the python code changes (I ignored all ci-cd changes). I didn't find anything which I could remark here (except my question).

ocrd_tesserocr/segment_region.py

bertsky · 2023-03-15T14:20:42Z

@kba if you make a release out of OCR-D/core#999 today, I'll revert b357217 here and require that new core version.

kba · 2023-03-15T17:37:57Z

@kba if you make a release out of OCR-D/core#999 today, I'll revert b357217 here and require that new core version.

v2.47.0 has been released now.

stweil · 2023-03-15T17:43:26Z

v2.47.0 has been released now.

Too early, because we need OCR-D/core#1004.

bertsky · 2023-03-15T17:53:41Z

v2.47.0 has been released now.

Thanks! I'll adapt...

Too early, because we need OCR-D/core#1004.

This is not about Docker, but Python API.

This reverts commit b357217.

kba · 2023-03-15T18:04:18Z

v2.47.0 has been released now.

Too early, because we need OCR-D/core#1004.

Hotfix incoming.

bertsky · 2023-03-15T18:26:37Z

Ok, I ran into OCR-D/core#998. Now, I could wait for the fix to be merged, or avoid the kwargs.setdefault('ocrd_tool', ...) pattern. But I don't want either, so let's do a workaround for now.

v2.47.0 has been released now.

Thanks! I'll adapt...

Odd. CI failure did come back:

  File "/home/circleci/project/ocrd_tesserocr/cli.py", line 23, in ocrd_tesserocr_segment_region
    return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
  File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 116, in ocrd_cli_wrap_processor
    run_processor(processorClass, mets_url=mets, workspace=workspace, **kwargs)
  File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 95, in run_processor
    instance_caching=instance_caching
  File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 329, in get_processor
    parameter=parameter
  File "/home/circleci/project/ocrd_tesserocr/segment_region.py", line 17, in __init__
    self.parameter['overwrite_segments'] = self.parameter['overwrite_regions']
KeyError: 'overwrite_regions'

That's despite ocrd==2.47.1. Where does the empty ocrd_tool come from???

bertsky · 2023-03-15T19:32:41Z

That's despite ocrd==2.47.1. Where does the empty ocrd_tool come from???

Aha! It turns out that my OCR-D/core#999 was premature: we also pass an unnecessary ocrd_tool in other places:

run_processor kwarg: https://github.com/OCR-D/core/blob/cff8ddaefec72451795a5e0b4bdb7efc18223d99/ocrd/ocrd/processor/helpers.py#L38
run_processor calling get_processor: https://github.com/OCR-D/core/blob/cff8ddaefec72451795a5e0b4bdb7efc18223d99/ocrd/ocrd/processor/helpers.py#L91
get_processor kwarg: https://github.com/OCR-D/core/blob/cff8ddaefec72451795a5e0b4bdb7efc18223d99/ocrd/ocrd/processor/helpers.py#L306
get_processor calling constructor: https://github.com/OCR-D/core/blob/cff8ddaefec72451795a5e0b4bdb7efc18223d99/ocrd/ocrd/processor/helpers.py#L325

Now, the big story here is: none of these places is needed (or used) – neither currently, nor for OCR-D/core#974 nor OCR-D/core#884!

joschrew · 2023-03-16T08:02:25Z

Other questions (hopefully it is ok to ask it here):

What is the reason to do this kwargs.setdefault('ocrd_tool', OCRD_TOOL['tools'][TOOL]) instead of doing this: kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL] (this is the way it was done before). Why shouldn't the ocrd_tool always be set in the constructor?
I tried to get the idea behind the ocrd-tool-setting. Is it right that a processor-constructor is supposed to set the ocrd_tool always before calling the super-constructor? I have only viewed a few example processors and I am wondering if this (setting ocrd-tool in a processors __init__() before calling the super constructor) is something like a rule (In Java you would probably use something like an abstract method in the processor-base-class?)?

bertsky · 2023-03-16T11:48:40Z

@joschrew

What is the reason to do this kwargs.setdefault('ocrd_tool', OCRD_TOOL['tools'][TOOL]) instead of doing this: kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL] (this is the way it was done before). Why shouldn't the ocrd_tool always be set in the constructor?

The reason is that here in ocrd_tesserocr many processors are basically just re-parameterizations (simplifications) of the "goliath" ocrd-tesserocr-recognize. This helps the user cope with complexity (and avoids code duplication while sustaining the older single-step processors) – see README.

To implement this, you have to know that ocrd.Processor does parameter parsing/instantiation/validation on the constructor:
https://github.com/OCR-D/core/blob/cff8ddaefec72451795a5e0b4bdb7efc18223d99/ocrd/ocrd/processor/base.py#L146-L150

So you need to

enter with the (possibly empty) params for the wrapper processor,
let them validate (instantiate) against the wrapper processor's tool json,
translate them to the TesserocrRecognize's tool json,
re-validate (instantiate) against that tool json.

Now, in the old pattern, where every ocrd.Processor's subclass constructor just overwrites the kwarg for ocrd_tool with its own local tool json (from pkg_resources), we cannot bring TesserocrRecognize to accept any foreign tool json. Hence the setdefault pattern, which IMO is also better in general.

I tried to get the idea behind the ocrd-tool-setting. Is it right that a processor-constructor is supposed to set the ocrd_tool always before calling the super-constructor?

Yes, it must. Otherwise, the parameter validator in the superclass won't see the actual cmdline values. Also, the non-processing contexts (help, dump json, show resources) in the superclass constructor need to see the actual tool json.

I have only viewed a few example processors and I am wondering if this (setting ocrd-tool in a processors __init__() before calling the super constructor) is something like a rule

We could of course change the API in core to separate the parameter instantiation, but that would entail changing all processors (lots of diverse codebases with various maintainers). We have to do that for the new process_page API soon anyway, so it's a good time to think about alternatives...

(In Java you would probably use something like an abstract method in the processor-base-class?)?

Yes. We could define a method (say) init which (in the subclasses) loads the tool json, and gets called (by the superclass) before most of the work done in the current superclass constructor.

For the processor delegation pattern then, since in the new API, we will also have a setup for the processing context of a processor instance, we could do parameter re-instantiation for the new superclass there. So the subclasses would simply have to call super().setup() and then do their own ParameterValidator.validate.

mikegerber · 2023-03-22T18:39:03Z

With this PR (and ocrd 2.48.0) I seem to have problems relating to $TESSDATA_PREFIX:

$ ocrd-tesserocr-segment-region -I OCR-D-IMG-BINPAGE-sauvola -O OCR-D-TEST-TEST-TEST
18:37:10.199 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-tesserocr-segment-region'
Traceback (most recent call last):
  File "/usr/local/share/pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 128, in run_processor
    processor.process()
[... shortened traceback, so remove noise (turned out to be notabug)]
RuntimeError: Failed to init API, possibly an invalid tessdata path:
$ echo $TESSDATA_PREFIX
/home/mike.gerber/.local/share/ocrd-resources/ocrd-tesserocr-recognize
$ ls $TESSDATA_PREFIX
Fraktur_GT4HistOCR.traineddata  deu.traineddata  frk.traineddata

kba · 2023-03-22T18:41:25Z

ls $TESSDATA_PREFIX
Fraktur_GT4HistOCR.traineddata  deu.traineddata  frk.traineddata

You need osd.traineddata I think. Can you try ocrd resmgr download ocrd-tesserocr-recognize osd.traineddata and check if that fixes it?

mikegerber · 2023-03-22T18:55:32Z

Ah. Never had it working on this system, and I wrongly assumed it was a new bug.

I copied everything from /usr/share/.../tessdata for now, that fixed ocrd-tesserocr-segment-region! This PR's ocrd-tesserocr-recognize also works for me. OCR-D 2.48.0.

- improve subclassing - isolate workspaces from each other - use pytest-xdist for parallel tests - workaround for os.getenv failure after tmpdir removal, caused by Processor.__init__'s os.chdir - also test additional processors and parameters - also validate results (to some degree) - rely on (and ensure) Fraktur model being available

bertsky · 2023-03-23T13:21:42Z

@kba this time, codecov broke – IIUC it says it now has 0% coverage, because I have added more processors to the test 🙄 – do you have any idea what's going on?

kba · 2023-03-23T14:04:05Z

@kba this time, codecov broke – IIUC it says it now has 0% coverage, because I have added more processors to the test 🙄 – do you have any idea what's going on?

Really confusing, I think the problem is that the last time code coverage was calculated on master, only four of the five python versions uploaded the results to codecov successfully. Now it cannot match the coverage reports to the four it last had in master. I cannot verify this because CircleCI does not retain logs of the last run on master.

In other words, I think this will clear up once we merge to master. The code coverage calculation looks right:

======================= 4 passed, 38 warnings in 22.72s ========================
make[1]: Leaving directory '/home/circleci/project'
coverage report
Name                               Stmts   Miss Branch BrPart  Cover
--------------------------------------------------------------------
ocrd_tesserocr/__init__.py            10      0      0      0   100%
ocrd_tesserocr/binarize.py            85     72     34      0    11%
ocrd_tesserocr/config.py               3      0      0      0   100%
ocrd_tesserocr/crop.py               113     99     28      0    10%
ocrd_tesserocr/deskew.py             109     95     48      0     9%
ocrd_tesserocr/fontshape.py           96     79     34      0    13%
ocrd_tesserocr/recognize.py          861    515    500     79    35%
ocrd_tesserocr/segment.py             18      8      2      0    50%
ocrd_tesserocr/segment_line.py        19      0      2      1    95%
ocrd_tesserocr/segment_region.py      21      0      2      1    96%
ocrd_tesserocr/segment_table.py       19      9      2      0    48%
ocrd_tesserocr/segment_word.py        19      0      2      1    95%
--------------------------------------------------------------------
TOTAL                               1373    877    654     82    31%

(for python3.9)

bertsky · 2023-03-23T14:09:23Z

Understood. Thanks!

I'll merge and release then.

bertsky added 2 commits March 14, 2023 18:46

all: inherit from TesserocrRecognize, but override its logger

6181b91

bertsky requested review from kba and joschrew March 14, 2023 17:57

bertsky added 8 commits March 14, 2023 19:20

increase pylint bliss

b0551bf

update deprecated shapely .type

89aa533

CI: update+simplify (on cimg/python:* instead of ocrd/core)

3c87c31

CI: speedup Py36 deps

e28a70b

CI: seems to require sudo

de24a1c

CI: fix conditional step syntax

474bef4

CI: fix conditional step syntax

0bc0edb

CI: fix conditional step syntax

66da108

bertsky added 3 commits March 15, 2023 00:18

deps-ubuntu: allow PPA to fail (on newer distributions)

b50620d

CI: chmod PPA Tesseract tessdata g+w

51f04a5

deps-ubuntu: allow PPA to fail (on newer distributions)

04ebc00

constructor ocrd_tool: workaround for core#998

b357217

joschrew approved these changes Mar 15, 2023

View reviewed changes

ocrd_tesserocr/segment_region.py Show resolved Hide resolved

Revert "constructor ocrd_tool: workaround for core#998"

133fd0b

This reverts commit b357217.

require ocrd v2.47

181fcf3

bertsky mentioned this pull request Mar 15, 2023

run_processor / get_processor: do not pass (empty) ocrd_tool OCR-D/core#1009

Merged

This was referenced Mar 18, 2023

update 2023-03-16 OCR-D/ocrd_all#356

Merged

A lot of ShapelyDeprecationWarning: The 'type' attribute is deprecated #192

Closed

bertsky linked an issue Mar 22, 2023 that may be closed by this pull request

A lot of ShapelyDeprecationWarning: The 'type' attribute is deprecated #192

Closed

mikegerber mentioned this pull request Mar 22, 2023

ocrd-tesserocr-segment-region TypeError: __init__() got multiple values for argument 'workspace' #193

Closed

mikegerber mentioned this pull request Mar 22, 2023

Revert to using official release of ocrd_tesserocr again qurator-spk/ocrd-galley#72

Closed

bertsky added 5 commits March 23, 2023 08:23

_reinit before SetImage (for xpath/auto models)

6640c39

fix join_polygons Numpy vs Shapely interface

503828b

update ocrd_models kwargs to avoid deprecation warnings

24d5620

require core v2.48 for proper constructor delegation

eca0305

bertsky mentioned this pull request Mar 23, 2023

ocrd-tesserocr-segment does not work #189

Closed

bertsky linked an issue Mar 23, 2023 that may be closed by this pull request

ocrd-tesserocr-segment does not work #189

Closed

Merge branch 'master' into override-insteadof-wrap

148d015

bertsky merged commit c1bf59b into OCR-D:master Mar 23, 2023

bertsky mentioned this pull request Jun 4, 2023

Error when running processing server with ocrd-tesserocr OCR-D/core#942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

override instead of wrap TesserocrRecognize in other processors #191

override instead of wrap TesserocrRecognize in other processors #191

bertsky commented Mar 14, 2023

bertsky commented Mar 14, 2023

bertsky commented Mar 14, 2023

bertsky commented Mar 15, 2023

codecov bot commented Mar 15, 2023 •

edited

Loading

joschrew left a comment •

edited

Loading

bertsky commented Mar 15, 2023

kba commented Mar 15, 2023

stweil commented Mar 15, 2023

bertsky commented Mar 15, 2023

kba commented Mar 15, 2023

bertsky commented Mar 15, 2023

bertsky commented Mar 15, 2023

joschrew commented Mar 16, 2023 •

edited

Loading

bertsky commented Mar 16, 2023 •

edited

Loading

mikegerber commented Mar 22, 2023 •

edited

Loading

kba commented Mar 22, 2023

mikegerber commented Mar 22, 2023 •

edited

Loading

bertsky commented Mar 23, 2023

kba commented Mar 23, 2023

bertsky commented Mar 23, 2023

override instead of wrap TesserocrRecognize in other processors #191

override instead of wrap TesserocrRecognize in other processors #191

Conversation

bertsky commented Mar 14, 2023

bertsky commented Mar 14, 2023

bertsky commented Mar 14, 2023

bertsky commented Mar 15, 2023

codecov bot commented Mar 15, 2023 • edited Loading

Codecov Report

joschrew left a comment • edited Loading

Choose a reason for hiding this comment

bertsky commented Mar 15, 2023

kba commented Mar 15, 2023

stweil commented Mar 15, 2023

bertsky commented Mar 15, 2023

kba commented Mar 15, 2023

bertsky commented Mar 15, 2023

bertsky commented Mar 15, 2023

joschrew commented Mar 16, 2023 • edited Loading

bertsky commented Mar 16, 2023 • edited Loading

mikegerber commented Mar 22, 2023 • edited Loading

kba commented Mar 22, 2023

mikegerber commented Mar 22, 2023 • edited Loading

bertsky commented Mar 23, 2023

kba commented Mar 23, 2023

bertsky commented Mar 23, 2023

codecov bot commented Mar 15, 2023 •

edited

Loading

joschrew left a comment •

edited

Loading

joschrew commented Mar 16, 2023 •

edited

Loading

bertsky commented Mar 16, 2023 •

edited

Loading

mikegerber commented Mar 22, 2023 •

edited

Loading

mikegerber commented Mar 22, 2023 •

edited

Loading