-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Danish traineddata file doesn't include the "@" character #29
Comments
That's a problem of the model (traineddata), not of Tesseract. See dan.unicharset for a list of supported characters. If you want, you can send a pull request which fixes the list of desired characters. |
There won't be a fixed |
@stweil Thanks for the response. |
It is possible to enhance the existing dan.traineddata with missing characters by additional training, so you could try to fix it yourself. Here is a description how this was done for Fraktur. You'll need pairs of line images and text files with a transcription. |
Thank you for your response. I do not think this is a viable option for me but thanks for your reply and for the information! |
It lacks '§' as well which is used in every single legal document in existence... |
@Furtifk, @poizan42, especially for older Danish texts you could also try one of the models which I trained recently, for example Fraktur_50000000.502_198857.traineddata. It was trained based on script/Fraktur with lots of historic documents, and according to my experience it works good although I did not add a dictionary. You will get a warning therefore at runtime, but could add a Danish dictionary if needed. |
Has there been any improvements recently with the Danish dictionary? |
No, and I am afraid there won't be an improvement unless someone works on it. |
Environment
Current Behavior: Danish traineddata file doesn't include the "@" character
Expected Behavior: Danish traineddata file should include the "@" character
Suggested Fix: Danish traineddata file should include the "@" character
File to run OCR on:
In the case of reproducing I have zip file I can send so you may run a VERY basic test which will display both results comparing eng and dan traineddata results. Please whoever looks into the issue to contact me to receive this.
This is a quite a pressing issue so any response is appreciated.
The text was updated successfully, but these errors were encountered: