-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dutch hyphenation #46
Comments
Maybe a useful tip from CBB (Christelijke Bibliotheek voor Blinden en Slechtzienden): they use a version of hyph_nl_NL.dic from OpenTaal. |
The OpenTaal data sounds promising. Will look at this next week and maybe you can fill me in on the best way to implement this in mod-braille (from what I read there is OpenOffice data available for this dict). |
I'm guessing this is the hyphenation dictionary from OpenTaal.org that CBB is using. Maybe I can use the same approach as in snaekobbi/issues#2 for this? |
The dictionary you linked is the one that is already included in Pipeline. I think CBB was maybe referring to an updated version. We'd have to ask them. We need test data before we can do anything else. Then, if you need to modify the dictionary, it's best you copy the file to a new project (like Jukka did with pipeline-mod-celia) because the dictionary from LibreOffice is downloaded and packaged automatically. |
I believe the OpenTaal data dates from 2011, but I'll see if I can confirm this with someone from CBB. |
Hyphenated words I guess. I understand you may not have that kind of data just lying around. But if there's nothing to test then our job is done. Then we just take what's currently available. I think at the minimum we should have a small test, if only so we can easily add more to it later. Jukka's test data is also very limited, but it's easy to add more. He did it in pipeline-mod-celia because that's were his dictionary lives, but we could have your tests in functional-testing. |
So if I understand this correctly, we have:
And we need:
For Finnish the test data is in the JUnit test case. I could clone this into another module, but think it would be a bit nicer to have something similar to liblouis' harness tests for this. I.e. experts only worry about JSON or some other format and the JUnit tests pull these in and run them. Also, which of the three libs (Libhyphen, Hyphenator, TexHyphenator) should we use? |
I suggest we use XML instead of JSON. Something like this. If everybody includes test data in that format in the functional-testing repo, then I can have one test (JUnit or XSpec) that runs them all. Of course from the point of view of the developer it is nice to have to tests closer to the implementation, but since you don't intend to modify the dictionary yet for the time being, that's not a problem. Later we can still copy/move the test to its own module. Which of the libraries we should use is not so important I think. What I've done with Finnish is I convert the patterns into several formats at build time so that several implementations become available in DP2. As long as all implementations behave the same (which they should in theory, and we easily can test each of them with the same test data) we don't have to worry about which one is actually used. |
See snaekobbi/issues#2 for the various options for implementing a hyphenator.
The text was updated successfully, but these errors were encountered: