-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'taakku' — all analyses lost in grammar checker #5
Comments
Our own pipe's kal-tokenise moves such non-baseform prefixes, yielding |
I hadn't realised that the CG reading syntax requires it to start with |
the analysis in xerox format is: $ echo taakku | hfst-lookup src/fst/analyser-gt-desc.hfstol
taakku TA+una+Gram/Dem+Pron+Abs+Pl 0,000000
taakku TA+una+Gram/Dem+Pron+Rel+Pl 0,000000 I can guess that hfst-tokenise needs to do some guessworks to find out which parts of this analysis are lemma or tags based on common practices that don't include this kind of combination. I'm not sure what is the correct lemma/tags here either? |
The assumption for
Placement of string in the analysis is not considered in the Accented chars in Unicode using combining diacritics are always automatically converted to a sequence of symbols (in the FST sense), both to support the criteria above and to make parsing of input text simple and straightforward. |
We have kal-generate to move them back and turn CG into FST for generation. Greenlandic only has 2 prefixes, A decade ago it used to be analyzed as CG has always had somewhat strict stream format. CG-2 required first tag to be either |
I see that both kal-tokenise and kal-generate are perl scripts. That is not very portable for standalone grammar checkers. I understand that there is more to these scripts than just moving prefix tags, but would it be ok to replace that part of the scripts with some simple (Rust/C/whatever) code to move prefix tags back and forth as needed, and leave the rest for now? In the end I would like to have all the functionality of the perl scripts encoded in one of FST/CG/compiled binary, but I suggest we start with the prefix tags and see how that works. |
Sure. There are other things we need in the final pipe, though, such as the https://github.com/Oqaasileriffik/katersat semantic tags module. I would prefer to get |
What is the semantic tag module, and how is it used? |
I think I implemented most of it at on the way some time ago but it needs testing (and development) since it's the least portable and standardised code. |
Almost all of our semantic tags comes from our online dictionary interface, Katersat, and not the FST. Katersat is easier for everyone to work with. Student helpers and others can easily be taught how to tag semantics, provide translations, and explanations in Katersat, without needing to know how to change the FST. It's also vastly faster during development. We then extract those semantic tags from Katersat and apply them to the output of the FST, so the pipe can make use of them for disambiguation. |
Cf this:
In this first step everything is correct. But in the next step something strange is happening:
The word form has been moved to after the analyser output. I have no idea why and how. In any case, this leads to the analyses getting lost later, leaving the bare word form:
Any idea, @TinoDidriksen @unhammer @flammie ?
The text was updated successfully, but these errors were encountered: