Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure general workflow, cli, services, processors, docs, tests #40

Merged
merged 68 commits into from
Mar 27, 2018

Conversation

kba
Copy link
Member

@kba kba commented Mar 26, 2018

This is a rewrite based on our discussions.

All XML access is now in ocrd.model, pythonic API to METS and PAGE.

ocrd.resolver handles interactions with external web resources (such as downloading files), caching and with the data/model repository (the latter NIH yet).

ocrd.workspace represents the "working directory", a folder for the processors to work on, created by the resolver. Provides hooks to the METS file and images/pages contained therein.

ocrd.processor contains Processors, i.e. the actual functionality bundled in a class that is instantiated with a workspace and has a process method to execute that functionality on the workspace. In addition to accessing the files in the METS, parameters can be passed as an object for processors that require information beyond what's in the METS/PAGE XML.

The general workflow is:

resolver = Resolver(...)
workspace = Resolver.create_workspace(mets_url)
processor = FancyProcessorThatDoesGreatThings(workspace)
processor.process()

Reusable functionality moved to a ocrd.utils module (coordinate translation, string normalization, logging etc).

ocrd.cli contains the CLI (installed as ocrd) which has subcommands to process METS directly (process) and start services (server). Commands can be chained, e.g.

ocrd process -m /path/to/mets.xml characterize/exif segment_region/tesseract3 segment_line/tesseract3 recognize/tesseract3

Documentation (apidocs only atm, should move stuff from wiki there once it's agreed upon) can be built with sphinx (see Makefile).

TestCases are there but I couldn't get it running yet on circle ci because of tesseract bindings.

fixes #38
fixes #35
fixes #33
fixes #28
fixes #27
fixes #20
fixes #9
fixes #3
#16

@kba kba requested review from cneud and wrznr March 26, 2018 13:07
Copy link
Contributor

@wrznr wrznr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great piece of work! Many thanks. Only minor changes requested. Mainly, I do not understand the 3 in Tesseract3.

from ocrd.utils import getLogger, mets_file_id
from ocrd.constants import MIMETYPE_PAGE

import tesserocr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why the class is called Tesseract3.... You are importing tesserocr which in turn uses the tesseract installation provided at your system (which in my case is Tesseract4).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that, because tesserocr bindings wouldn't build for me before with the 4.00 headers. But I realize now it was probably just a mismatch of headers and pkg-config. Sure, let's rename it everywhere tesseract3 -> tesseract (or tesserocr which would be clearer but conflict-prone, global package over local package priority change between 2/3).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: If you have an idea why this won't build on Circle (https://circleci.com/gh/kba/pyocrd/12) or Travis (https://travis-ci.org/OCR-D/pyocrd/jobs/358407881) I'd appreciate it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get:7 http://us-central1.gce.archive.ubuntu.com/ubuntu trusty/universe amd64 libtesseract-dev amd64 3.03.02-3 [1,284 kB]

I see the problem now, tesserocr supports only libtesseract >= 3.04 (semver for
tesseract would be a good idea to spot stuff like this).

I'll see how I can get a newer version or a newer OS in the CI container.

run_processor(ExifProcessor, request.args['mets_url'], resolver)
return 'DONE', 200

@app.route('/processor/segent_line/tesseract3', methods=['PUT'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: segment

run_processor(Tesseract3LineSegmenter, request.args['mets_url'], resolver)
return 'DONE', 200

@app.route('/processor/segent_region/tesseract3', methods=['PUT'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: segment.

Copy link
Member

@cneud cneud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! A major step forward in structure and control flow and covering a large amount of functionality. Only thing to keep in mind: while it is good to have something in place and working now for local/remote data management, we should align those parts closely with KIT and the LZA module project to ensure other parts of this implementation remain flexible enough to switch to using these repositories once they become available.

.travis.yml Outdated
- 2.7
- 3.6
before_install:
- sudo add-apt-repository ppa:alex-p/tesseract-ocr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tesseract 4.0 for Ubuntu 18.04 should be coming soon, then we can switch to using apt with no ppa.

See tesseract issue #1423 and recent release of 4.0.0-beta.1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem with travis is that you can only choose between 12.04 and 14.04. At least since 16.04 / Debian 7, 3.04 is shipped and tesserocr should build.

Even with the PPA it doesn't build for me since tesseract 4 seems to make use of some c++11 features that the 14.04 gcc version doesn't support.

I'm now using the same pre-install setup that tesserocr uses (download and build leptonica and tesseract). Very inefficient but it worked: https://travis-ci.org/OCR-D/pyocrd/jobs/358810995

tesseract-ocr-eng \
tesseract-ocr-deu \
tesseract-ocr-deu-frak

# Install python deps via pip
deps-pip:
pip3 install --user -r requirements.txt
$(PIP) install -r requirements.txt

# Clone the spec dir for sample files
spec:
git clone https://github.com/OCR-D/spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that the example files in spec need updating - the mets.xml should be updated to reflect recent discussions and ideally we should pick some sample images that are a) lightweight and b) for which we already have ground truth in ocr-d.de/daten.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, tracking in #41

'mods': "http://www.loc.gov/mods/v3",
'xlink': "http://www.w3.org/1999/xlink",
'page': "http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15",
'xsl': 'http://www.w3.org/1999/XSL/Transform#',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for ALTO was seen as highly useful given current practices in libraries and possible uses in OCR-D post-processing of existing OCR. Current version is 3.1 with 4.0 due for release in April 2018.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, ALTO should be supported. From a programmer's POV it doesn't seem that hard since most features should be easy to translate between the two, i.e. you could use the same API and change the XML handling in the implementation. The same probably goes for hOCR BTW. But we need to discuss the specifics of how we support this and the repercussions on performance and long-term archival aspects etc.

Handle Uploads, Downloads, Repository access and manage temporary directories
Optionally cache files.

Args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

@kba
Copy link
Member Author

kba commented Mar 27, 2018

Builds now, the last thing I want to get right is fix a few python2-isms that make it break for 3.6 and then merge.

@wrznr
Copy link
Contributor

wrznr commented Mar 27, 2018

@kba Is the failing Travis check preventing the merge?

@kba
Copy link
Member Author

kba commented Mar 27, 2018

Yes, well not so much the checks but the python3 incompatibility, I'm on it now.

@kba kba merged commit 343ceae into OCR-D:master Mar 27, 2018
@kba kba deleted the wip-api branch March 27, 2018 19:06
bertsky pushed a commit to bertsky/core that referenced this pull request Apr 26, 2023
LABEL the image with git rev, build date, maintainer, microbadger bad…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants