Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

override instead of wrap TesserocrRecognize in other processors #191

Merged
merged 22 commits into from
Mar 23, 2023

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Mar 14, 2023

fixes #190

Perhaps we should also add a test case for the workspace mechanics. I have a feeling this should have failed long ago even without the Processing Server's instance caching.

bertsky added 2 commits March 14, 2023 18:46
- instead of wrapping a foreign instance
  and delegating the `process` call,
  inherit `process` but reparse parameters
  after rewriting them in the constructor
- only overwrite the `process` docstring

(This is necessary because the inner instance had its own
`workspace` instance. Also, we don't want to rewrite
`moduledir` and the version parser everywhere.)
@bertsky bertsky requested review from kba and joschrew March 14, 2023 17:57
@bertsky
Copy link
Collaborator Author

bertsky commented Mar 14, 2023

Perhaps we should also add a test case for the workspace mechanics. I have a feeling this should have failed long ago even without the Processing Server's instance caching.

Ah, it actually did! (We just had no event on this repo trigger the CI since the last get_processor / run_processor changes in core.)

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 14, 2023

Turns out I had to do a lot more to fix the CI:

  • build based on ocrd/core but with PIP and PYTHON variables did not work (anymore?) and is not useful: we want to test various Python versions natively (without the hassle of OS dependencies for them), so the base must be a matrix of Python CI images
  • Python 3.6 now (being end-of-life) builds much faster if older Numpy/OpenCV binaries get installed beforehand
  • the PPA for Tesseract does not work (and is not necessary) for the most up-to-date Python versions (i.e. OS versions), so this step (including the related apt-get update) must be optional
  • the PPA version of Tesseract has its tessdata directory (now identical to our ocrd-tesserocr moduledir, where resmgr wants to install additional models into) owned by root, so this must become writable for others

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 15, 2023

Ok, I ran into OCR-D/core#998. Now, I could wait for the fix to be merged, or avoid the kwargs.setdefault('ocrd_tool', ...) pattern. But I don't want either, so let's do a workaround for now.

@codecov
Copy link

codecov bot commented Mar 15, 2023

Codecov Report

Merging #191 (148d015) into master (c10f94d) will decrease coverage by 26.98%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #191       +/-   ##
==========================================
- Coverage   26.97%   0.00%   -26.98%     
==========================================
  Files          11      12        +1     
  Lines        1416    1377       -39     
  Branches      333     346       +13     
==========================================
- Hits          382       0      -382     
- Misses        981    1371      +390     
+ Partials       53       6       -47     
Impacted Files Coverage Δ
ocrd_tesserocr/binarize.py 0.00% <0.00%> (-15.12%) ⬇️
ocrd_tesserocr/config.py 0.00% <ø> (-100.00%) ⬇️
ocrd_tesserocr/crop.py 0.00% <0.00%> (-14.29%) ⬇️
ocrd_tesserocr/deskew.py 0.00% <0.00%> (-13.52%) ⬇️
ocrd_tesserocr/fontshape.py 0.00% <0.00%> (-17.83%) ⬇️
ocrd_tesserocr/recognize.py 0.00% <0.00%> (-25.38%) ⬇️
ocrd_tesserocr/segment.py 0.00% <0.00%> (-36.00%) ⬇️
ocrd_tesserocr/segment_line.py 0.00% <0.00%> (-96.16%) ⬇️
ocrd_tesserocr/segment_region.py 0.00% <0.00%> (-96.43%) ⬇️
ocrd_tesserocr/segment_table.py 0.00% <0.00%> (-34.62%) ⬇️
... and 1 more

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@joschrew joschrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this code together with the processing-server PR and everything works now. So this PR seems to solve my problems regarding the tesser-ocrd-processors.
Additionally I have briefly gone through the python code changes (I ignored all ci-cd changes). I didn't find anything which I could remark here (except my question).

ocrd_tesserocr/segment_region.py Show resolved Hide resolved
@bertsky
Copy link
Collaborator Author

bertsky commented Mar 15, 2023

@kba if you make a release out of OCR-D/core#999 today, I'll revert b357217 here and require that new core version.

@kba
Copy link
Member

kba commented Mar 15, 2023

@kba if you make a release out of OCR-D/core#999 today, I'll revert b357217 here and require that new core version.

v2.47.0 has been released now.

@stweil
Copy link
Contributor

stweil commented Mar 15, 2023

v2.47.0 has been released now.

Too early, because we need OCR-D/core#1004.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 15, 2023

v2.47.0 has been released now.

Thanks! I'll adapt...

Too early, because we need OCR-D/core#1004.

This is not about Docker, but Python API.

@kba
Copy link
Member

kba commented Mar 15, 2023

v2.47.0 has been released now.

Too early, because we need OCR-D/core#1004.

Hotfix incoming.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 15, 2023

Ok, I ran into OCR-D/core#998. Now, I could wait for the fix to be merged, or avoid the kwargs.setdefault('ocrd_tool', ...) pattern. But I don't want either, so let's do a workaround for now.

v2.47.0 has been released now.

Thanks! I'll adapt...

Odd. CI failure did come back:

  File "/home/circleci/project/ocrd_tesserocr/cli.py", line 23, in ocrd_tesserocr_segment_region
    return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
  File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 116, in ocrd_cli_wrap_processor
    run_processor(processorClass, mets_url=mets, workspace=workspace, **kwargs)
  File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 95, in run_processor
    instance_caching=instance_caching
  File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 329, in get_processor
    parameter=parameter
  File "/home/circleci/project/ocrd_tesserocr/segment_region.py", line 17, in __init__
    self.parameter['overwrite_segments'] = self.parameter['overwrite_regions']
KeyError: 'overwrite_regions'

That's despite ocrd==2.47.1. Where does the empty ocrd_tool come from???

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 15, 2023

That's despite ocrd==2.47.1. Where does the empty ocrd_tool come from???

Aha! It turns out that my OCR-D/core#999 was premature: we also pass an unnecessary ocrd_tool in other places:

Now, the big story here is: none of these places is needed (or used) – neither currently, nor for OCR-D/core#974 nor OCR-D/core#884!

@joschrew
Copy link
Contributor

joschrew commented Mar 16, 2023

Other questions (hopefully it is ok to ask it here):

  • What is the reason to do this kwargs.setdefault('ocrd_tool', OCRD_TOOL['tools'][TOOL]) instead of doing this: kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL] (this is the way it was done before). Why shouldn't the ocrd_tool always be set in the constructor?
  • I tried to get the idea behind the ocrd-tool-setting. Is it right that a processor-constructor is supposed to set the ocrd_tool always before calling the super-constructor? I have only viewed a few example processors and I am wondering if this (setting ocrd-tool in a processors __init__() before calling the super constructor) is something like a rule (In Java you would probably use something like an abstract method in the processor-base-class?)?

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 16, 2023

@joschrew

  • What is the reason to do this kwargs.setdefault('ocrd_tool', OCRD_TOOL['tools'][TOOL]) instead of doing this: kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL] (this is the way it was done before). Why shouldn't the ocrd_tool always be set in the constructor?

The reason is that here in ocrd_tesserocr many processors are basically just re-parameterizations (simplifications) of the "goliath" ocrd-tesserocr-recognize. This helps the user cope with complexity (and avoids code duplication while sustaining the older single-step processors) – see README.

To implement this, you have to know that ocrd.Processor does parameter parsing/instantiation/validation on the constructor:
https://github.com/OCR-D/core/blob/cff8ddaefec72451795a5e0b4bdb7efc18223d99/ocrd/ocrd/processor/base.py#L146-L150

So you need to

  1. enter with the (possibly empty) params for the wrapper processor,
  2. let them validate (instantiate) against the wrapper processor's tool json,
  3. translate them to the TesserocrRecognize's tool json,
  4. re-validate (instantiate) against that tool json.

Now, in the old pattern, where every ocrd.Processor's subclass constructor just overwrites the kwarg for ocrd_tool with its own local tool json (from pkg_resources), we cannot bring TesserocrRecognize to accept any foreign tool json. Hence the setdefault pattern, which IMO is also better in general.

  • I tried to get the idea behind the ocrd-tool-setting. Is it right that a processor-constructor is supposed to set the ocrd_tool always before calling the super-constructor?

Yes, it must. Otherwise, the parameter validator in the superclass won't see the actual cmdline values. Also, the non-processing contexts (help, dump json, show resources) in the superclass constructor need to see the actual tool json.

I have only viewed a few example processors and I am wondering if this (setting ocrd-tool in a processors __init__() before calling the super constructor) is something like a rule

We could of course change the API in core to separate the parameter instantiation, but that would entail changing all processors (lots of diverse codebases with various maintainers). We have to do that for the new process_page API soon anyway, so it's a good time to think about alternatives...

(In Java you would probably use something like an abstract method in the processor-base-class?)?

Yes. We could define a method (say) init which (in the subclasses) loads the tool json, and gets called (by the superclass) before most of the work done in the current superclass constructor.

For the processor delegation pattern then, since in the new API, we will also have a setup for the processing context of a processor instance, we could do parameter re-instantiation for the new superclass there. So the subclasses would simply have to call super().setup() and then do their own ParameterValidator.validate.

@mikegerber
Copy link
Contributor

mikegerber commented Mar 22, 2023

With this PR (and ocrd 2.48.0) I seem to have problems relating to $TESSDATA_PREFIX:

$ ocrd-tesserocr-segment-region -I OCR-D-IMG-BINPAGE-sauvola -O OCR-D-TEST-TEST-TEST
18:37:10.199 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-tesserocr-segment-region'
Traceback (most recent call last):
  File "/usr/local/share/pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 128, in run_processor
    processor.process()
[... shortened traceback, so remove noise (turned out to be notabug)]
RuntimeError: Failed to init API, possibly an invalid tessdata path:
$ echo $TESSDATA_PREFIX
/home/mike.gerber/.local/share/ocrd-resources/ocrd-tesserocr-recognize
$ ls $TESSDATA_PREFIX
Fraktur_GT4HistOCR.traineddata  deu.traineddata  frk.traineddata

@kba
Copy link
Member

kba commented Mar 22, 2023

ls $TESSDATA_PREFIX
Fraktur_GT4HistOCR.traineddata  deu.traineddata  frk.traineddata

You need osd.traineddata I think. Can you try ocrd resmgr download ocrd-tesserocr-recognize osd.traineddata and check if that fixes it?

@mikegerber
Copy link
Contributor

mikegerber commented Mar 22, 2023

Ah. Never had it working on this system, and I wrongly assumed it was a new bug.

I copied everything from /usr/share/.../tessdata for now, that fixed ocrd-tesserocr-segment-region! This PR's ocrd-tesserocr-recognize also works for me. OCR-D 2.48.0.

bertsky added 5 commits March 23, 2023 08:23
- improve subclassing
- isolate workspaces from each other
- use pytest-xdist for parallel tests
- workaround for os.getenv failure after tmpdir
  removal, caused by Processor.__init__'s os.chdir
- also test additional processors and parameters
- also validate results (to some degree)
- rely on (and ensure) Fraktur model being available
@bertsky bertsky linked an issue Mar 23, 2023 that may be closed by this pull request
@bertsky
Copy link
Collaborator Author

bertsky commented Mar 23, 2023

@kba this time, codecov broke – IIUC it says it now has 0% coverage, because I have added more processors to the test 🙄 – do you have any idea what's going on?

@kba
Copy link
Member

kba commented Mar 23, 2023

@kba this time, codecov broke – IIUC it says it now has 0% coverage, because I have added more processors to the test 🙄 – do you have any idea what's going on?

Really confusing, I think the problem is that the last time code coverage was calculated on master, only four of the five python versions uploaded the results to codecov successfully. Now it cannot match the coverage reports to the four it last had in master. I cannot verify this because CircleCI does not retain logs of the last run on master.

In other words, I think this will clear up once we merge to master. The code coverage calculation looks right:

======================= 4 passed, 38 warnings in 22.72s ========================
make[1]: Leaving directory '/home/circleci/project'
coverage report
Name                               Stmts   Miss Branch BrPart  Cover
--------------------------------------------------------------------
ocrd_tesserocr/__init__.py            10      0      0      0   100%
ocrd_tesserocr/binarize.py            85     72     34      0    11%
ocrd_tesserocr/config.py               3      0      0      0   100%
ocrd_tesserocr/crop.py               113     99     28      0    10%
ocrd_tesserocr/deskew.py             109     95     48      0     9%
ocrd_tesserocr/fontshape.py           96     79     34      0    13%
ocrd_tesserocr/recognize.py          861    515    500     79    35%
ocrd_tesserocr/segment.py             18      8      2      0    50%
ocrd_tesserocr/segment_line.py        19      0      2      1    95%
ocrd_tesserocr/segment_region.py      21      0      2      1    96%
ocrd_tesserocr/segment_table.py       19      9      2      0    48%
ocrd_tesserocr/segment_word.py        19      0      2      1    95%
--------------------------------------------------------------------
TOTAL                               1373    877    654     82    31%

(for python3.9)

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 23, 2023

Understood. Thanks!

I'll merge and release then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants