-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata for OCR models and/or OCR model training sets #86
Comments
@VolkerHartmann Relevant for GT repository as well as model repository. Additions to metadata entries and proposals for representation format(s) very much welcome. |
GT repository
Model repository |
@VolkerHartmann Sorry, I am not so sure what it is you are asking me for. This issue is about OCR model meta-data, and I already find the list of features for that mentioned by @wrznr in the original post sufficient for post-correction purposes. Are you actually addressing #85 here? And what does "selection of GT records" refer to (the selection of features for GT meta-data records perhaps)? |
If the list of features is sufficient, that's fine. |
@VolkerHartmann In which format can necessary metadata be sufficiently (i.e. in a formal, machine-readable way) defined? |
Most formats are easy to parse. I would prefer JSON or XML but key-value pairs are also ok if no hierarchy exists. |
@wrznr develops a proposal based on the above schema. |
@wrznr Push. |
Just to let you know that I've been told today that PMML is the widely accepted standard to describe ML models. It is XML-based. Perhaps we can learn/borrow some things from there. |
HDF5 is a container format but not a format of the model, right? It could contain any models.
In most cases the creator will be an algorithm. I am missing information about the underlying font, language variants (optional) to select the appropriate model. If the model is described in PMML a consumer have to support all variants? |
What is the status on this? I've hacked together a zenodo-based thingy that I uses the metadata schema of the old repository but that is clearly insufficient. If we're still on the schema proposed by @kba I would suggest some additions and changes. For one adding a field pointing to a training data set (by URL or PID) is somewhat important and putting in at least a CER measurement might also be prudent. With regards to using PMML, I'm not sure how/if it is beneficial to describe OCR models on a functional level as all engines come with their own format, effectively making the model files opaque blobs. A functional description also doesn't aid in any way in model selection/implementation matching. |
The repository isn't public. |
@mittagessen See #105 |
@kba Can we involve @Doreenruirui here? She has a specification ready, right? |
Hi Clemens,
Yes, the schemas are designed according to the documentation of the
parameters of each engine. They are mainly used to verify the parameters
when a user upload a configuration file.
Best,
Rui
Clemens Neudecker <[email protected]> 于2019年5月22日周三 上午12:49写道:
… @wrznr <https://github.com/wrznr> @kba <https://github.com/kba>
@Doreenruirui <https://github.com/Doreenruirui> This is pretty close
https://github.com/Doreenruirui/okralact/tree/master/docs,
https://github.com/Doreenruirui/okralact/tree/master/engines/schemas, no?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#86?email_source=notifications&email_token=ACEQARBS6N5CAYONHECKUL3PWR36XA5CNFSM4F3GHRJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV5M3XQ#issuecomment-494587358>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACEQARGIDZ42DZDJTY73GGTPWR36XANCNFSM4F3GHRJA>
.
|
See https://github.com/Calamari-OCR/calamari/blob/master/calamari_ocr/ocr/datasets/dataset.py for the base class of datasets (image+transcription tuples) in calamari |
I would like to restart the discussion on this as I've got a scalable-ish model repository working but the metadata schema used right now is insufficient powerful (both for print and manuscripts). The current state is here. It is already designed in a way to support multiple recognition engines through a free-text field in a searchable property. Each engine would define their own identifiers, ideally with different suffixes for functionally different model types, so multi- or cross-engine software would be able to effectively filter for supported models. Currently, there are two requirements missing:
My suggestion is to incorporate an opaque blob that encapsulates hyperparameters in a way that OCR engines or a third party software like okralact can re-instantiate a model from scratch. This allows us For automatic model selection there should be the ability to encode script (already in there), transcription levels, some kind of validation/test loss/error curve(s), and references directly to training data (if publicly available) or at least source material. To incorporate the methods the FAU team have developed we should also incorporate some kind of global script type embedding. It might be advisable to allow multiple of these, as the FAU system is currently fairly specific to the material OCR-D concerns itself with while other people might have more specific embeddings. |
We need to define a set of metadata for OCR models including at least:
We need to define a set of metadata for OCR model training sets including at least:
The text was updated successfully, but these errors were encountered: