Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new OCR parameter to normalize the result text #112

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

stweil
Copy link
Contributor

@stweil stweil commented Sep 22, 2023

No description provided.

@stweil
Copy link
Contributor Author

stweil commented Sep 22, 2023

Example: Tesseract OCR with and without normalization.

The normalization works with any OCR engine. The cache always stores the original OCR text. Therefore it is possible to switch to normalized text without a new OCR run.

* Normalize result by replacing some historic characters
*/
public function normalize() {
$this->text = strtr( $this->text, [
Copy link
Contributor Author

@stweil stweil Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some (and more) of these translations could be done with Normalizer::normalize( $this->text, Normalizer::FORM_KC ), but that causes a runtime conflict with the Symfony class which is also called Normalizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that causes a runtime conflict with the Symfony class which is also called Normalizer.

It should work fine as long as you use \Normalizer here or use Normalizer; at the top, to use the intl extension's one. The Symfony class is a polyfill for that for when the intl extension isn't installed. If you use the former, then don't forget to add that extension to the requirements in composer.json.

Copy link
Member

@samwilson samwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good addition, but note that there's been various discussions over the years about how to normalize OCR output, and not always with huge agreement. Mainly because different Wikisources want to do things differently, and many already have gadgets in place for doing the exact replacements that they want.

For example T278443 fix issue with lines being formatted incorrectly, and T250185 Make Wikisource-OCR handle paragraphs better.

I think there needs to be a way to make this configurable per-project, or perhaps retrieve a config from on-wiki (e.g. a normalize_config param could point to a JSON page's URL, where the actual replacement patterns are defined).

* Normalize result by replacing some historic characters
*/
public function normalize() {
$this->text = strtr( $this->text, [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that causes a runtime conflict with the Symfony class which is also called Normalizer.

It should work fine as long as you use \Normalizer here or use Normalizer; at the top, to use the intl extension's one. The Symfony class is a polyfill for that for when the intl extension isn't installed. If you use the former, then don't forget to add that extension to the requirements in composer.json.

@stweil stweil marked this pull request as draft September 23, 2023 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants