Metadata for OCR models and/or OCR model training sets #86

wrznr · 2018-10-12T14:15:08Z

We need to define a set of metadata for OCR models including at least:

engine (inkl. version)
parameter setting for training
reference to OCR model training set
...

We need to define a set of metadata for OCR model training sets including at least:

information on the training materials
(output) character sets
license
...

wrznr · 2018-10-12T14:17:06Z

@VolkerHartmann Relevant for GT repository as well as model repository.
@bertsky Relevent for post correction.

Additions to metadata entries and proposals for representation format(s) very much welcome.

VolkerHartmann · 2018-10-15T08:47:31Z

GT repository
@bertsky: Which attributes will be important for the selection of GT records?
I'm thinking of:

Font
Publication date
Print shop (?) (I have not seen this attribute yet but it could be helpful, couldn't it)
...

Model repository
At the moment no collection (for training set) can be created and therefore not referenced. This feature is planned for future versions. Until then all pages/data have to be listed.
How willl the parameter look like? (To be most generic a key-value implementation would be appropriate)
information on the training materials:
Part of the GT metadata. e.g. publishing date, language, fonts, ...

bertsky · 2018-10-15T12:41:11Z

@VolkerHartmann Sorry, I am not so sure what it is you are asking me for. This issue is about OCR model meta-data, and I already find the list of features for that mentioned by @wrznr in the original post sufficient for post-correction purposes. Are you actually addressing #85 here? And what does "selection of GT records" refer to (the selection of features for GT meta-data records perhaps)?

VolkerHartmann · 2018-10-15T13:08:43Z

If the list of features is sufficient, that's fine.

wrznr · 2018-11-06T08:21:31Z

@VolkerHartmann In which format can necessary metadata be sufficiently (i.e. in a formal, machine-readable way) defined?

VolkerHartmann · 2018-11-06T08:27:01Z

Most formats are easy to parse. I would prefer JSON or XML but key-value pairs are also ok if no hierarchy exists.

kba · 2018-11-06T09:56:32Z

https://github.com/kba/ocr-models/blob/master/schema/description.schema.yml

wrznr · 2018-11-06T10:01:17Z

@wrznr develops a proposal based on the above schema.

wrznr · 2018-11-13T09:27:30Z

@wrznr Push.

cneud · 2018-11-13T16:22:10Z

Just to let you know that I've been told today that PMML is the widely accepted standard to describe ML models. It is XML-based. Perhaps we can learn/borrow some things from there.

VolkerHartmann · 2018-11-16T07:21:27Z

https://github.com/kba/ocr-models/blob/master/schema/description.schema.yml
format: hdf5, pyrnn pronn,...

HDF5 is a container format but not a format of the model, right? It could contain any models.
Is pyrnn a widely known standard extension? I can't find any information about that.
We could add PMML as a possible format.

Landing page for the model or homepage of the creator

In most cases the creator will be an algorithm.

I am missing information about the underlying font, language variants (optional) to select the appropriate model.
In addition I would prefer the model as defined in PMML: (see MODEL-ELEMENT) e.g.: "NeuralNetwork" and information on which algorithms it can be used for (Ok, KRAKEN is compatible with ocropus) Are there other algorithms we could use later?
I think the format defined in description.schema.yml links both.

If the model is described in PMML a consumer have to support all variants?
In the future, there could be importers and exporters for different algorithms.
When the time comes, we can always store the models as PMML. :-)

mittagessen · 2019-01-20T16:09:39Z

What is the status on this? I've hacked together a zenodo-based thingy that I uses the metadata schema of the old repository but that is clearly insufficient.

If we're still on the schema proposed by @kba I would suggest some additions and changes. For one adding a field pointing to a training data set (by URL or PID) is somewhat important and putting in at least a CER measurement might also be prudent.

With regards to using PMML, I'm not sure how/if it is beneficial to describe OCR models on a functional level as all engines come with their own format, effectively making the model files opaque blobs. A functional description also doesn't aid in any way in model selection/implementation matching.

wrznr · 2019-01-20T16:13:28Z

@kba @tboenig @wrznr have a meeting on this issue next week. We'll get back to you asap.

wrznr · 2019-01-29T09:23:40Z

https://github.com/kba/mollusc/blob/master/spec/training-schema.yml

mittagessen · 2019-01-29T09:28:33Z

The repository isn't public.

kba · 2019-01-29T10:21:20Z

@mittagessen See #105

wrznr · 2019-04-16T07:59:46Z

@kba Can we involve @Doreenruirui here? She has a specification ready, right?

cneud · 2019-05-21T22:49:14Z

@wrznr @kba @Doreenruirui This is pretty progressed https://github.com/Doreenruirui/okralact/tree/master/docs, https://github.com/Doreenruirui/okralact/tree/master/engines/schemas, no?

Doreenruirui · 2019-05-23T09:02:10Z

Hi Clemens, Yes, the schemas are designed according to the documentation of the parameters of each engine. They are mainly used to verify the parameters when a user upload a configuration file. Best, Rui Clemens Neudecker <[email protected]> 于2019年5月22日周三上午12:49写道：

…

@wrznr <https://github.com/wrznr> @kba <https://github.com/kba> @Doreenruirui <https://github.com/Doreenruirui> This is pretty close https://github.com/Doreenruirui/okralact/tree/master/docs, https://github.com/Doreenruirui/okralact/tree/master/engines/schemas, no? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#86?email_source=notifications&email_token=ACEQARBS6N5CAYONHECKUL3PWR36XA5CNFSM4F3GHRJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV5M3XQ#issuecomment-494587358>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACEQARGIDZ42DZDJTY73GGTPWR36XANCNFSM4F3GHRJA> .

kba · 2019-06-17T15:21:02Z

See https://github.com/Calamari-OCR/calamari/blob/master/calamari_ocr/ocr/datasets/dataset.py for the base class of datasets (image+transcription tuples) in calamari

mittagessen · 2019-10-23T18:20:18Z

I would like to restart the discussion on this as I've got a scalable-ish model repository working but the metadata schema used right now is insufficient powerful (both for print and manuscripts). The current state is here. It is already designed in a way to support multiple recognition engines through a free-text field in a searchable property. Each engine would define their own identifiers, ideally with different suffixes for functionally different model types, so multi- or cross-engine software would be able to effectively filter for supported models.

Currently, there are two requirements missing:

proper automatic model selection support
reproducibility

My suggestion is to incorporate an opaque blob that encapsulates hyperparameters in a way that OCR engines or a third party software like okralact can re-instantiate a model from scratch. This allows us

For automatic model selection there should be the ability to encode script (already in there), transcription levels, some kind of validation/test loss/error curve(s), and references directly to training data (if publicly available) or at least source material. To incorporate the methods the FAU team have developed we should also incorporate some kind of global script type embedding. It might be advisable to allow multiple of these, as the FAU system is currently fairly specific to the material OCR-D concerns itself with while other people might have more specific embeddings.

wrznr assigned kba, VolkerHartmann and tboenig Oct 12, 2018

wrznr added the discussion label Oct 12, 2018

cneud self-assigned this Nov 6, 2018

kba removed their assignment Dec 11, 2018

ehrmn assigned wrznr Dec 11, 2018

wrznr unassigned cneud and VolkerHartmann Dec 11, 2018

kba mentioned this issue Dec 14, 2018

Make this a JSON file? cneud/ocr-gt#11

Closed

kba mentioned this issue Jan 8, 2019

Order the ground truth section by type ? kba/awesome-ocr#103

Open

kba mentioned this issue Jan 15, 2019

Add GT metadata to service. OCR-D/repository_metastore#7

Open

lena-hinrichsen mentioned this issue Aug 2, 2022

OCR-D METS Profile issue collection OCR-D/zenhub#117

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata for OCR models and/or OCR model training sets #86

Metadata for OCR models and/or OCR model training sets #86

wrznr commented Oct 12, 2018

wrznr commented Oct 12, 2018

VolkerHartmann commented Oct 15, 2018 •

edited

Loading

bertsky commented Oct 15, 2018

VolkerHartmann commented Oct 15, 2018

wrznr commented Nov 6, 2018

VolkerHartmann commented Nov 6, 2018

kba commented Nov 6, 2018

wrznr commented Nov 6, 2018

wrznr commented Nov 13, 2018

cneud commented Nov 13, 2018

VolkerHartmann commented Nov 16, 2018

mittagessen commented Jan 20, 2019 •

edited

Loading

wrznr commented Jan 20, 2019

wrznr commented Jan 29, 2019

mittagessen commented Jan 29, 2019

kba commented Jan 29, 2019

wrznr commented Apr 16, 2019

cneud commented May 21, 2019 •

edited

Loading

Doreenruirui commented May 23, 2019 via email

kba commented Jun 17, 2019 •

edited

Loading

mittagessen commented Oct 23, 2019

Metadata for OCR models and/or OCR model training sets #86

Metadata for OCR models and/or OCR model training sets #86

Comments

wrznr commented Oct 12, 2018

wrznr commented Oct 12, 2018

VolkerHartmann commented Oct 15, 2018 • edited Loading

bertsky commented Oct 15, 2018

VolkerHartmann commented Oct 15, 2018

wrznr commented Nov 6, 2018

VolkerHartmann commented Nov 6, 2018

kba commented Nov 6, 2018

wrznr commented Nov 6, 2018

wrznr commented Nov 13, 2018

cneud commented Nov 13, 2018

VolkerHartmann commented Nov 16, 2018

mittagessen commented Jan 20, 2019 • edited Loading

wrznr commented Jan 20, 2019

wrznr commented Jan 29, 2019

mittagessen commented Jan 29, 2019

kba commented Jan 29, 2019

wrznr commented Apr 16, 2019

cneud commented May 21, 2019 • edited Loading

Doreenruirui commented May 23, 2019 via email

kba commented Jun 17, 2019 • edited Loading

mittagessen commented Oct 23, 2019

VolkerHartmann commented Oct 15, 2018 •

edited

Loading

mittagessen commented Jan 20, 2019 •

edited

Loading

cneud commented May 21, 2019 •

edited

Loading

kba commented Jun 17, 2019 •

edited

Loading