Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata for OCR models and/or OCR model training sets #86

Open
wrznr opened this issue Oct 12, 2018 · 21 comments
Open

Metadata for OCR models and/or OCR model training sets #86

wrznr opened this issue Oct 12, 2018 · 21 comments
Assignees

Comments

@wrznr
Copy link
Contributor

wrznr commented Oct 12, 2018

We need to define a set of metadata for OCR models including at least:

  • engine (inkl. version)
  • parameter setting for training
  • reference to OCR model training set
  • ...

We need to define a set of metadata for OCR model training sets including at least:

  • information on the training materials
  • (output) character sets
  • license
  • ...
@wrznr
Copy link
Contributor Author

wrznr commented Oct 12, 2018

@VolkerHartmann Relevant for GT repository as well as model repository.
@bertsky Relevent for post correction.

Additions to metadata entries and proposals for representation format(s) very much welcome.

@VolkerHartmann
Copy link
Contributor

VolkerHartmann commented Oct 15, 2018

GT repository
@bertsky: Which attributes will be important for the selection of GT records?
I'm thinking of:

  • Font
  • Publication date
  • Print shop (?) (I have not seen this attribute yet but it could be helpful, couldn't it)
  • ...

Model repository
At the moment no collection (for training set) can be created and therefore not referenced. This feature is planned for future versions. Until then all pages/data have to be listed.
How willl the parameter look like? (To be most generic a key-value implementation would be appropriate)
information on the training materials:
Part of the GT metadata. e.g. publishing date, language, fonts, ...

@bertsky
Copy link
Collaborator

bertsky commented Oct 15, 2018

@VolkerHartmann Sorry, I am not so sure what it is you are asking me for. This issue is about OCR model meta-data, and I already find the list of features for that mentioned by @wrznr in the original post sufficient for post-correction purposes. Are you actually addressing #85 here? And what does "selection of GT records" refer to (the selection of features for GT meta-data records perhaps)?

@VolkerHartmann
Copy link
Contributor

If the list of features is sufficient, that's fine.

@wrznr
Copy link
Contributor Author

wrznr commented Nov 6, 2018

@VolkerHartmann In which format can necessary metadata be sufficiently (i.e. in a formal, machine-readable way) defined?

@VolkerHartmann
Copy link
Contributor

Most formats are easy to parse. I would prefer JSON or XML but key-value pairs are also ok if no hierarchy exists.

@cneud cneud self-assigned this Nov 6, 2018
@kba
Copy link
Member

kba commented Nov 6, 2018

@wrznr
Copy link
Contributor Author

wrznr commented Nov 6, 2018

@wrznr develops a proposal based on the above schema.

@wrznr
Copy link
Contributor Author

wrznr commented Nov 13, 2018

@wrznr Push.

@cneud
Copy link
Member

cneud commented Nov 13, 2018

Just to let you know that I've been told today that PMML is the widely accepted standard to describe ML models. It is XML-based. Perhaps we can learn/borrow some things from there.

@VolkerHartmann
Copy link
Contributor

https://github.com/kba/ocr-models/blob/master/schema/description.schema.yml
format: hdf5, pyrnn pronn,...

HDF5 is a container format but not a format of the model, right? It could contain any models.
Is pyrnn a widely known standard extension? I can't find any information about that.
We could add PMML as a possible format.

Landing page for the model or homepage of the creator

In most cases the creator will be an algorithm.

I am missing information about the underlying font, language variants (optional) to select the appropriate model.
In addition I would prefer the model as defined in PMML: (see MODEL-ELEMENT) e.g.: "NeuralNetwork" and information on which algorithms it can be used for (Ok, KRAKEN is compatible with ocropus) Are there other algorithms we could use later?
I think the format defined in description.schema.yml links both.

If the model is described in PMML a consumer have to support all variants?
In the future, there could be importers and exporters for different algorithms.
When the time comes, we can always store the models as PMML. :-)

@mittagessen
Copy link

mittagessen commented Jan 20, 2019

What is the status on this? I've hacked together a zenodo-based thingy that I uses the metadata schema of the old repository but that is clearly insufficient.

If we're still on the schema proposed by @kba I would suggest some additions and changes. For one adding a field pointing to a training data set (by URL or PID) is somewhat important and putting in at least a CER measurement might also be prudent.

With regards to using PMML, I'm not sure how/if it is beneficial to describe OCR models on a functional level as all engines come with their own format, effectively making the model files opaque blobs. A functional description also doesn't aid in any way in model selection/implementation matching.

@wrznr
Copy link
Contributor Author

wrznr commented Jan 20, 2019

@kba @tboenig @wrznr have a meeting on this issue next week. We'll get back to you asap.

@wrznr
Copy link
Contributor Author

wrznr commented Jan 29, 2019

@mittagessen
Copy link

The repository isn't public.

@kba
Copy link
Member

kba commented Jan 29, 2019

@mittagessen See #105

@wrznr
Copy link
Contributor Author

wrznr commented Apr 16, 2019

@kba Can we involve @Doreenruirui here? She has a specification ready, right?

@cneud
Copy link
Member

cneud commented May 21, 2019

@Doreenruirui
Copy link

Doreenruirui commented May 23, 2019 via email

@kba
Copy link
Member

kba commented Jun 17, 2019

See https://github.com/Calamari-OCR/calamari/blob/master/calamari_ocr/ocr/datasets/dataset.py for the base class of datasets (image+transcription tuples) in calamari

@mittagessen
Copy link

I would like to restart the discussion on this as I've got a scalable-ish model repository working but the metadata schema used right now is insufficient powerful (both for print and manuscripts). The current state is here. It is already designed in a way to support multiple recognition engines through a free-text field in a searchable property. Each engine would define their own identifiers, ideally with different suffixes for functionally different model types, so multi- or cross-engine software would be able to effectively filter for supported models.

Currently, there are two requirements missing:

  • proper automatic model selection support
  • reproducibility

My suggestion is to incorporate an opaque blob that encapsulates hyperparameters in a way that OCR engines or a third party software like okralact can re-instantiate a model from scratch. This allows us

For automatic model selection there should be the ability to encode script (already in there), transcription levels, some kind of validation/test loss/error curve(s), and references directly to training data (if publicly available) or at least source material. To incorporate the methods the FAU team have developed we should also incorporate some kind of global script type embedding. It might be advisable to allow multiple of these, as the FAU system is currently fairly specific to the material OCR-D concerns itself with while other people might have more specific embeddings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants