-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new OCR parameter to normalize the result text #112
base: main
Are you sure you want to change the base?
Conversation
* Normalize result by replacing some historic characters | ||
*/ | ||
public function normalize() { | ||
$this->text = strtr( $this->text, [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some (and more) of these translations could be done with Normalizer::normalize( $this->text, Normalizer::FORM_KC )
, but that causes a runtime conflict with the Symfony class which is also called Normalizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that causes a runtime conflict with the Symfony class which is also called Normalizer.
It should work fine as long as you use \Normalizer
here or use Normalizer;
at the top, to use the intl extension's one. The Symfony class is a polyfill for that for when the intl extension isn't installed. If you use the former, then don't forget to add that extension to the requirements in composer.json
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good addition, but note that there's been various discussions over the years about how to normalize OCR output, and not always with huge agreement. Mainly because different Wikisources want to do things differently, and many already have gadgets in place for doing the exact replacements that they want.
For example T278443 fix issue with lines being formatted incorrectly, and T250185 Make Wikisource-OCR handle paragraphs better.
I think there needs to be a way to make this configurable per-project, or perhaps retrieve a config from on-wiki (e.g. a normalize_config
param could point to a JSON page's URL, where the actual replacement patterns are defined).
* Normalize result by replacing some historic characters | ||
*/ | ||
public function normalize() { | ||
$this->text = strtr( $this->text, [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that causes a runtime conflict with the Symfony class which is also called Normalizer.
It should work fine as long as you use \Normalizer
here or use Normalizer;
at the top, to use the intl extension's one. The Symfony class is a polyfill for that for when the intl extension isn't installed. If you use the former, then don't forget to add that extension to the requirements in composer.json
.
Signed-off-by: Stefan Weil <[email protected]>
No description provided.