-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure general workflow, cli, services, processors, docs, tests #40
Conversation
…, rename ocrd_mets_file ocrd_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great piece of work! Many thanks. Only minor changes requested. Mainly, I do not understand the 3 in Tesseract3
.
from ocrd.utils import getLogger, mets_file_id | ||
from ocrd.constants import MIMETYPE_PAGE | ||
|
||
import tesserocr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why the class is called Tesseract3...
. You are importing tesserocr
which in turn uses the tesseract installation provided at your system (which in my case is Tesseract4
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize that, because tesserocr bindings wouldn't build for me before with the 4.00 headers. But I realize now it was probably just a mismatch of headers and pkg-config. Sure, let's rename it everywhere tesseract3
-> tesseract
(or tesserocr
which would be clearer but conflict-prone, global package over local package priority change between 2/3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: If you have an idea why this won't build on Circle (https://circleci.com/gh/kba/pyocrd/12) or Travis (https://travis-ci.org/OCR-D/pyocrd/jobs/358407881) I'd appreciate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get:7 http://us-central1.gce.archive.ubuntu.com/ubuntu trusty/universe amd64 libtesseract-dev amd64 3.03.02-3 [1,284 kB]
I see the problem now, tesserocr supports only libtesseract >= 3.04 (semver for
tesseract would be a good idea to spot stuff like this).
I'll see how I can get a newer version or a newer OS in the CI container.
ocrd/webservice/processor.py
Outdated
run_processor(ExifProcessor, request.args['mets_url'], resolver) | ||
return 'DONE', 200 | ||
|
||
@app.route('/processor/segent_line/tesseract3', methods=['PUT']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: segment
ocrd/webservice/processor.py
Outdated
run_processor(Tesseract3LineSegmenter, request.args['mets_url'], resolver) | ||
return 'DONE', 200 | ||
|
||
@app.route('/processor/segent_region/tesseract3', methods=['PUT']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: segment
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! A major step forward in structure and control flow and covering a large amount of functionality. Only thing to keep in mind: while it is good to have something in place and working now for local/remote data management, we should align those parts closely with KIT and the LZA module project to ensure other parts of this implementation remain flexible enough to switch to using these repositories once they become available.
.travis.yml
Outdated
- 2.7 | ||
- 3.6 | ||
before_install: | ||
- sudo add-apt-repository ppa:alex-p/tesseract-ocr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tesseract 4.0 for Ubuntu 18.04 should be coming soon, then we can switch to using apt with no ppa.
See tesseract issue #1423 and recent release of 4.0.0-beta.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem with travis is that you can only choose between 12.04 and 14.04. At least since 16.04 / Debian 7, 3.04 is shipped and tesserocr should build.
Even with the PPA it doesn't build for me since tesseract 4 seems to make use of some c++11 features that the 14.04 gcc version doesn't support.
I'm now using the same pre-install setup that tesserocr uses (download and build leptonica and tesseract). Very inefficient but it worked: https://travis-ci.org/OCR-D/pyocrd/jobs/358810995
tesseract-ocr-eng \ | ||
tesseract-ocr-deu \ | ||
tesseract-ocr-deu-frak | ||
|
||
# Install python deps via pip | ||
deps-pip: | ||
pip3 install --user -r requirements.txt | ||
$(PIP) install -r requirements.txt | ||
|
||
# Clone the spec dir for sample files | ||
spec: | ||
git clone https://github.com/OCR-D/spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that the example files in spec need updating - the mets.xml should be updated to reflect recent discussions and ideally we should pick some sample images that are a) lightweight and b) for which we already have ground truth in ocr-d.de/daten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, tracking in #41
'mods': "http://www.loc.gov/mods/v3", | ||
'xlink': "http://www.w3.org/1999/xlink", | ||
'page': "http://schema.primaresearch.org/PAGE/gts/pagecontent/2017-07-15", | ||
'xsl': 'http://www.w3.org/1999/XSL/Transform#', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support for ALTO was seen as highly useful given current practices in libraries and possible uses in OCR-D post-processing of existing OCR. Current version is 3.1 with 4.0 due for release in April 2018.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, ALTO should be supported. From a programmer's POV it doesn't seem that hard since most features should be easy to translate between the two, i.e. you could use the same API and change the XML handling in the implementation. The same probably goes for hOCR BTW. But we need to discuss the specifics of how we support this and the repercussions on performance and long-term archival aspects etc.
Handle Uploads, Downloads, Repository access and manage temporary directories | ||
Optionally cache files. | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
Builds now, the last thing I want to get right is fix a few python2-isms that make it break for 3.6 and then merge. |
@kba Is the failing Travis check preventing the merge? |
Yes, well not so much the checks but the python3 incompatibility, I'm on it now. |
LABEL the image with git rev, build date, maintainer, microbadger bad…
This is a rewrite based on our discussions.
All XML access is now in
ocrd.model
, pythonic API to METS and PAGE.ocrd.resolver
handles interactions with external web resources (such as downloading files), caching and with the data/model repository (the latter NIH yet).ocrd.workspace
represents the "working directory", a folder for the processors to work on, created by the resolver. Provides hooks to the METS file and images/pages contained therein.ocrd.processor
contains Processors, i.e. the actual functionality bundled in a class that is instantiated with a workspace and has aprocess
method to execute that functionality on the workspace. In addition to accessing the files in the METS, parameters can be passed as an object for processors that require information beyond what's in the METS/PAGE XML.The general workflow is:
Reusable functionality moved to a
ocrd.utils
module (coordinate translation, string normalization, logging etc).ocrd.cli
contains the CLI (installed asocrd
) which has subcommands to process METS directly (process
) and start services (server
). Commands can be chained, e.g.Documentation (apidocs only atm, should move stuff from wiki there once it's agreed upon) can be built with sphinx (see Makefile).
TestCases are there but I couldn't get it running yet on circle ci because of tesseract bindings.
fixes #38
fixes #35
fixes #33
fixes #28
fixes #27
fixes #20
fixes #9
fixes #3
#16