Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order the ground truth section by type ? #103

Open
PonteIneptique opened this issue Jan 8, 2019 · 1 comment
Open

Order the ground truth section by type ? #103

PonteIneptique opened this issue Jan 8, 2019 · 1 comment

Comments

@PonteIneptique
Copy link
Contributor

Hi there !
I found out that a GT section was added while I was tempted to create my own awesome list.
One thing that I think would be great is categorizing a little more this section (Manuscript / Early print / Modern / Contemporaneous ?). That would probably be a better way to browse these data.

@kba
Copy link
Owner

kba commented Jan 8, 2019

The list is @cneud's work and it's maintained at https://github.com/cneud/ocr-gt.

We're working on a project making open-source OCR readily deployable in libraries, archives etc. (https://github.com/OCR-D / http://ocr-d.de). An important part of highly accurate OCR esp. for historical texts is training, for training one needs the right ground truth and for it all to work together one needs to describe the ground truth itself, the corpora, the tools etc. in a structured way.

Therefore, we want to define a JSON schema for describing ground truth and restructure the list/table into a JSON file, c.f. cneud/ocr-gt#11. This should be aligned with OCR-D/spec#86 where we want to define schemas for both training data and trained models. Ideally, all inputs and outputs of individual steps in an OCR workflow would be data defined by such a schema.

@mittagessen has been describing the models for his kraken OCR engine in such a way for a while.

@wrznr @tboenig @cneud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants