Add kraken OCR engine #89

stweil · 2023-08-19T11:13:20Z

No description provided.

stweil · 2023-08-19T11:17:05Z

This is a first implementation to support the Open Source kraken OCR engine.

I created this draft pull request to allow public review and comments although the implementation is still incomplete. OCR for cropped image still needs testing, and there is also currently no unit test code for the new engine.

samwilson · 2023-08-28T07:56:35Z

Also, would you mind opening a Phabricator task for this work, so it can be tracked there? Thanks!

stweil · 2023-08-28T08:35:59Z

Also, would you mind opening a Phabricator task for this work, so it can be tracked there? Thanks!

Done, see https://phabricator.wikimedia.org/T345055. I also updated the PR here to solve a merge conflict.

stweil · 2023-08-28T08:57:38Z

See also my test installation.

stweil · 2023-09-22T16:18:30Z

The test installation is meanwhile available on https://kraken-ocr.wmcloud.org/.

Signed-off-by: Stefan Weil <[email protected]>

Both are not language specific, but support historic and current scripts used by many European languages. Signed-off-by: Stefan Weil <[email protected]>

Kraken is an Open Source OCR engine with trainable segmentation and OCR models. It can work with printed and handwritten texts. This initial implementation comes with two generic OCR models which can be used on a wide range of German publications, but also with other languages which are based on Latin script. Signed-off-by: Stefan Weil <[email protected]>

Signed-off-by: Stefan Weil <[email protected]>

Cropping is also implemented now, but still untested. Signed-off-by: Stefan Weil <[email protected]>

Signed-off-by: Stefan Weil <[email protected]>

…test) Signed-off-by: Stefan Weil <[email protected]>

Signed-off-by: Stefan Weil <[email protected]>

Segmentation models are currently only supported for kraken. All other OCR engines return an empty list. Signed-off-by: Stefan Weil <[email protected]>

Signed-off-by: Stefan Weil <[email protected]>

Parthiv-M

This review focuses on the following main things

Files related to Transkribus have also been modified, which I think is not ideal
A new Transkribus model has also been added, which should be removed
Normalise text has been added here for all engines (we would want that in a separate PR)
Renaming the newly added API route

Overall, we'd like to keep the non-Kraken changes out of the way before testing kraken once again

Parthiv-M · 2023-12-07T07:26:20Z

src/Controller/OcrController.php

+	/**
+	 * Get a list of available segmentation models for use with a specific OCR engine.
+	 *
+	 * @Route("/api/available_segmentation_models", name="apiSegmentationModels", methods={"GET"})


This route, since it is related to Kraken, should be named /api/kraken/available_segmentation_models

Having segmentation models is not kraken specific. All OCR processes require a segmentation step, and if that step uses AI, it also requires a model. That's why I did not use a route with "kraken" here. So even if it is currently only used for kraken, I'd suggest to use a generic route.

Parthiv-M · 2023-12-07T07:26:30Z

public/langs.json

+    "german-fraktur-19th-20th-century": {
+        "transkribus": {
+            "htr": 37738
+        }
+    },


This Transkribus model should not be added along with Kraken changes, it would be better to separate it out

Parthiv-M · 2023-12-07T08:56:28Z

src/Engine/TranskribusEngine.php

@@ -118,9 +118,6 @@ public function getResult(
 	): EngineResult {
 		$this->checkImageUrl( $imageUrl );

-		$image = $this->getImage( $imageUrl, $crop );
-		$imageUrl = $image->getUrl();
-
 		$points = '';
 		if ( $crop ) {
 			$x = $crop['x'];


We'd prefer to isolate changes other than those related to Kraken to another PR!

Parthiv-M · 2023-12-07T08:59:01Z

tests/Engine/EngineBaseTest.php

+		$this->transkribusEngine = $this->instantiateEngine( 'transkribus' );

-		$this->transkribusEngine = $this->instatiateEngine( 'transkribus' );
+		$this->krakenEngine = $this->instantiateEngine( 'kraken' );


instantiate() fixed in #115

Parthiv-M · 2023-12-07T08:59:27Z

config/packages/nelmio_api_doc.yaml

-            description: A web service for Tesseract, Google and Transkribus OCR engines.
-            version: 1.0.0
+            description: A web service for Kraken, Tesseract, Google and Transkribus OCR engines.
+            version: 1.4.0


I believe it has been bumped up to 1.4.4 now.

Parthiv-M · 2023-12-07T09:00:39Z

config/packages/nelmio_api_doc.yaml

    areas:
        path_patterns:
            - ^/api$
            - ^/api/available_langs$
+            - ^/api/available_segmentation_models$


Will need to change this in accordance with my comment on route path

Parthiv-M · 2023-12-07T09:00:46Z

i18n/en.json

@@ -1,6 +1,6 @@
 {
    "@metadata": {},
-    "title": "WikimediaOCR",
+    "title": "WikimediaOCR – Kraken Test",


This should remain as WikimediaOCR

Parthiv-M · 2023-12-07T09:01:44Z

package.json

@@ -13,6 +13,7 @@
        "regenerator-runtime": "^0.13.11",
        "select2": "^4.0.13",
        "select2-bootstrap-theme": "0.1.0-beta.10",
+        "stylelint": "^15.10.3",


stylelint can be removed from this PR as well

Parthiv-M · 2023-12-07T09:01:51Z

package.json

@@ -23,5 +24,6 @@
        "watch": "encore dev --watch",
        "build": "encore production --progress",
        "test": "grunt test"
-    }
+    },
+    "dependencies": {}


Empty entries should be removed

Parthiv-M · 2023-12-07T09:02:53Z

src/Engine/EngineBase.php

@@ -65,6 +72,7 @@ abstract class EngineBase {
 		'fro' => 'Franceis, François, Romanz (1400-1600)',
 		'ger-hd-m1' => 'Transkribus German handwriting M1',
 		'ger-15' => '15th-16th century German',
+		'german-fraktur-19th-20th-century' => 'German Fraktur 19th-20th century',


This is a Transkribus model and needs to be removed from this PR

stweil marked this pull request as draft August 19, 2023 11:13

stweil force-pushed the kraken branch from 9e5607e to 4227fae Compare August 22, 2023 09:31

stweil force-pushed the kraken branch 2 times, most recently from d773984 to a704816 Compare August 28, 2023 08:39

stweil force-pushed the kraken branch from bca0c1e to 0edc944 Compare August 31, 2023 12:13

stweil force-pushed the kraken branch 15 times, most recently from f1764a1 to 6d69032 Compare September 22, 2023 16:02

stweil added 5 commits October 12, 2023 13:30

Fix typo in name of newly introduced method (instatiate -> instantiate)

b0c99b1

Signed-off-by: Stefan Weil <[email protected]>

Add OCR models Fraktur and Latin for Tesseract

4c0565d

Both are not language specific, but support historic and current scripts used by many European languages. Signed-off-by: Stefan Weil <[email protected]>

WebProfilerBundle

2c3fb6b

Signed-off-by: Stefan Weil <[email protected]>

Remove unneeded code for Transkribus OCR engine

803fe24

Signed-off-by: Stefan Weil <[email protected]>

stweil added 21 commits October 12, 2023 11:45

Add script for kraken OCR

2168998

Signed-off-by: Stefan Weil <[email protected]>

Update KrakenEngine to support language selection

b4683e4

Cropping is also implemented now, but still untested. Signed-off-by: Stefan Weil <[email protected]>

Add models for kraken OCR

28f8db2

Signed-off-by: Stefan Weil <[email protected]>

Suppress warning from phpcs because usage of popen

166ad14

Signed-off-by: Stefan Weil <[email protected]>

Add austriannewspapers model for kraken

e597b4d

Signed-off-by: Stefan Weil <[email protected]>

Add missing documentation for new OCR engine kraken (required for CI …

a54c70c

…test) Signed-off-by: Stefan Weil <[email protected]>

Update package.json

d039754

Signed-off-by: Stefan Weil <[email protected]>

Update package-lock.json

a0c799c

Signed-off-by: Stefan Weil <[email protected]>

Support segmentation model for kraken OCR engine

22703d5

Signed-off-by: Stefan Weil <[email protected]>

Update API version for new release with kraken OCR engine

bc93361

Signed-off-by: Stefan Weil <[email protected]>

Add new API /api/available_segmentation_models

c9723b7

Segmentation models are currently only supported for kraken. All other OCR engines return an empty list. Signed-off-by: Stefan Weil <[email protected]>

Add segmentation model for kraken OCR

85775c5

Signed-off-by: Stefan Weil <[email protected]>

Add more OCR models for Tesseract

c6fe3c8

Signed-off-by: Stefan Weil <[email protected]>

npm: Add missing dependency stylelint

3f316bb

Signed-off-by: Stefan Weil <[email protected]>

Fix iteration over Transkribus line models

903ccfb

Signed-off-by: Stefan Weil <[email protected]>

Fix description of OpenAPI parameters langs and crop

b1427e8

Signed-off-by: Stefan Weil <[email protected]>

Add Transkribus model german-fraktur-19th-20th-century

d7fc002

Signed-off-by: Stefan Weil <[email protected]>

Fix kraken_ocr script

4d67cfd

Signed-off-by: Stefan Weil <[email protected]>

Add new OCR parameter to normalize the result text

dac4739

Signed-off-by: Stefan Weil <[email protected]>

Modify title shown on test web page

b5640ec

Signed-off-by: Stefan Weil <[email protected]>

Fix code injection

f281e6e

Signed-off-by: Stefan Weil <[email protected]>

stweil force-pushed the kraken branch from 6d69032 to f281e6e Compare October 12, 2023 12:07

Improve code to fix code injection

fa67a34

Signed-off-by: Stefan Weil <[email protected]>

Parthiv-M requested changes Dec 7, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kraken OCR engine #89

Add kraken OCR engine #89

stweil commented Aug 19, 2023

stweil commented Aug 19, 2023

samwilson commented Aug 28, 2023

stweil commented Aug 28, 2023

stweil commented Aug 28, 2023 •

edited

Loading

stweil commented Sep 22, 2023

Parthiv-M left a comment

Parthiv-M Dec 7, 2023

stweil Dec 12, 2023

Parthiv-M Dec 7, 2023

Parthiv-M Dec 7, 2023

Parthiv-M Dec 7, 2023

Parthiv-M Dec 7, 2023

Parthiv-M Dec 7, 2023

Parthiv-M Dec 7, 2023

Parthiv-M Dec 7, 2023

Parthiv-M Dec 7, 2023

Parthiv-M Dec 7, 2023

Add kraken OCR engine #89

Are you sure you want to change the base?

Add kraken OCR engine #89

Conversation

stweil commented Aug 19, 2023

stweil commented Aug 19, 2023

samwilson commented Aug 28, 2023

stweil commented Aug 28, 2023

stweil commented Aug 28, 2023 • edited Loading

stweil commented Sep 22, 2023

Parthiv-M left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stweil commented Aug 28, 2023 •

edited

Loading