-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add kraken OCR engine #89
base: main
Are you sure you want to change the base?
Conversation
This is a first implementation to support the Open Source kraken OCR engine. I created this draft pull request to allow public review and comments although the implementation is still incomplete. OCR for cropped image still needs testing, and there is also currently no unit test code for the new engine. |
Also, would you mind opening a Phabricator task for this work, so it can be tracked there? Thanks! |
Done, see https://phabricator.wikimedia.org/T345055. I also updated the PR here to solve a merge conflict. |
d773984
to
a704816
Compare
See also my test installation. |
f1764a1
to
6d69032
Compare
The test installation is meanwhile available on https://kraken-ocr.wmcloud.org/. |
Signed-off-by: Stefan Weil <[email protected]>
Both are not language specific, but support historic and current scripts used by many European languages. Signed-off-by: Stefan Weil <[email protected]>
Kraken is an Open Source OCR engine with trainable segmentation and OCR models. It can work with printed and handwritten texts. This initial implementation comes with two generic OCR models which can be used on a wide range of German publications, but also with other languages which are based on Latin script. Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Cropping is also implemented now, but still untested. Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
…test) Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Segmentation models are currently only supported for kraken. All other OCR engines return an empty list. Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This review focuses on the following main things
- Files related to Transkribus have also been modified, which I think is not ideal
- A new Transkribus model has also been added, which should be removed
- Normalise text has been added here for all engines (we would want that in a separate PR)
- Renaming the newly added API route
Overall, we'd like to keep the non-Kraken changes out of the way before testing kraken once again
/** | ||
* Get a list of available segmentation models for use with a specific OCR engine. | ||
* | ||
* @Route("/api/available_segmentation_models", name="apiSegmentationModels", methods={"GET"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This route, since it is related to Kraken, should be named /api/kraken/available_segmentation_models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having segmentation models is not kraken specific. All OCR processes require a segmentation step, and if that step uses AI, it also requires a model. That's why I did not use a route with "kraken" here. So even if it is currently only used for kraken, I'd suggest to use a generic route.
"german-fraktur-19th-20th-century": { | ||
"transkribus": { | ||
"htr": 37738 | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Transkribus model should not be added along with Kraken changes, it would be better to separate it out
@@ -118,9 +118,6 @@ public function getResult( | |||
): EngineResult { | |||
$this->checkImageUrl( $imageUrl ); | |||
|
|||
$image = $this->getImage( $imageUrl, $crop ); | |||
$imageUrl = $image->getUrl(); | |||
|
|||
$points = ''; | |||
if ( $crop ) { | |||
$x = $crop['x']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd prefer to isolate changes other than those related to Kraken to another PR!
$this->transkribusEngine = $this->instantiateEngine( 'transkribus' ); | ||
|
||
$this->transkribusEngine = $this->instatiateEngine( 'transkribus' ); | ||
$this->krakenEngine = $this->instantiateEngine( 'kraken' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instantiate()
fixed in #115
description: A web service for Tesseract, Google and Transkribus OCR engines. | ||
version: 1.0.0 | ||
description: A web service for Kraken, Tesseract, Google and Transkribus OCR engines. | ||
version: 1.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it has been bumped up to 1.4.4 now.
areas: | ||
path_patterns: | ||
- ^/api$ | ||
- ^/api/available_langs$ | ||
- ^/api/available_segmentation_models$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to change this in accordance with my comment on route path
@@ -1,6 +1,6 @@ | |||
{ | |||
"@metadata": {}, | |||
"title": "WikimediaOCR", | |||
"title": "WikimediaOCR – Kraken Test", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should remain as WikimediaOCR
@@ -13,6 +13,7 @@ | |||
"regenerator-runtime": "^0.13.11", | |||
"select2": "^4.0.13", | |||
"select2-bootstrap-theme": "0.1.0-beta.10", | |||
"stylelint": "^15.10.3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stylelint
can be removed from this PR as well
@@ -23,5 +24,6 @@ | |||
"watch": "encore dev --watch", | |||
"build": "encore production --progress", | |||
"test": "grunt test" | |||
} | |||
}, | |||
"dependencies": {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty entries should be removed
@@ -65,6 +72,7 @@ abstract class EngineBase { | |||
'fro' => 'Franceis, François, Romanz (1400-1600)', | |||
'ger-hd-m1' => 'Transkribus German handwriting M1', | |||
'ger-15' => '15th-16th century German', | |||
'german-fraktur-19th-20th-century' => 'German Fraktur 19th-20th century', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a Transkribus model and needs to be removed from this PR
No description provided.