Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'taakku' — all analyses lost in grammar checker #5

Open
snomos opened this issue Dec 18, 2024 · 10 comments
Open

'taakku' — all analyses lost in grammar checker #5

snomos opened this issue Dec 18, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@snomos
Copy link
Member

snomos commented Dec 18, 2024

Cf this:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram0-morph.mode
"<taakku>"
    TA "una" Gram/Dem Pron Abs Pl <W:0.0>
    TA "una" Gram/Dem Pron Rel Pl <W:0.0>
:

In this first step everything is correct. But in the next step something strange is happening:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram1-blanktag.mode
    TA "una" Gram/Dem Pron Abs Pl <W:0.0>
    TA "una" Gram/Dem Pron Rel Pl <W:0.0>
"<taakku>"
:

The word form has been moved to after the analyser output. I have no idea why and how. In any case, this leads to the analyses getting lost later, leaving the bare word form:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram.mode 
"<taakku>"
: 
"<marluk>"
	"marluk" Num Abs Pl <W:0.0>
;	"marluk" Num Rel Pl <W:0.0> REMOVE:2385:tidlig0020A
;	"marluk" Orth/Alt N Abs Sg <W:0.0> REMOVE:2189:0001P
:

Any idea, @TinoDidriksen @unhammer @flammie ?

@snomos snomos added the bug Something isn't working label Dec 18, 2024
@TinoDidriksen
Copy link
Member

TA "una" Gram/Dem Pron Abs Pl <W:0.0> is not a valid CG-reading, as it doesn't start with ". Thus it is treated as text and moved out of the cohort it was in.

Our own pipe's kal-tokenise moves such non-baseform prefixes, yielding "una" Prefix/TA Gram/Dem Pron Abs Pl <W:0.0>. And it performs other needed corrections.

@snomos
Copy link
Member Author

snomos commented Dec 18, 2024

I hadn't realised that the CG reading syntax requires it to start with " - this is quite problematic when using as input FST analysis of prefix-heavy languages. What do you do then? In a grammar checker context tag order is important, as the tag order needs to be retained for word form generation at the end of the processing.

@flammie
Copy link
Contributor

flammie commented Dec 18, 2024

the analysis in xerox format is:

$ echo taakku | hfst-lookup src/fst/analyser-gt-desc.hfstol 
taakku	TA+una+Gram/Dem+Pron+Abs+Pl	0,000000
taakku	TA+una+Gram/Dem+Pron+Rel+Pl	0,000000

I can guess that hfst-tokenise needs to do some guessworks to find out which parts of this analysis are lemma or tags based on common practices that don't include this kind of combination. I'm not sure what is the correct lemma/tags here either?

@snomos
Copy link
Member Author

snomos commented Dec 18, 2024

The assumption for hfst-tokenise is very simple, and automatically handled in the FST pipeline:

  • all multichar symbols are tags
  • sequences of non-multichars are strings/word forms
  • there should be one and only one such string pr line in the analysis cohort

Placement of string in the analysis is not considered in the hfst-tokenise output, exactly because of prefixing languages.

Accented chars in Unicode using combining diacritics are always automatically converted to a sequence of symbols (in the FST sense), both to support the criteria above and to make parsing of input text simple and straightforward.

@TinoDidriksen
Copy link
Member

We have kal-generate to move them back and turn CG into FST for generation. Greenlandic only has 2 prefixes, AA and TA, and they are definitely not the baseform.

A decade ago it used to be analyzed as "TA" una Gram/Dem Pron Abs Pl because we simply defined the first tag as baseform, but this caused other issues because the actual baseform was left as a tag. hfst-tokenise fixed that issue and we could very easily work around prefixes.

CG has always had somewhat strict stream format. CG-2 required first tag to be either [baseform] or "baseform". CG-3 changed this to only support "baseform", in order to allow more mixed content.

@snomos
Copy link
Member Author

snomos commented Dec 18, 2024

I see that both kal-tokenise and kal-generate are perl scripts. That is not very portable for standalone grammar checkers. I understand that there is more to these scripts than just moving prefix tags, but would it be ok to replace that part of the scripts with some simple (Rust/C/whatever) code to move prefix tags back and forth as needed, and leave the rest for now?

In the end I would like to have all the functionality of the perl scripts encoded in one of FST/CG/compiled binary, but I suggest we start with the prefix tags and see how that works.

@TinoDidriksen
Copy link
Member

Sure.

There are other things we need in the final pipe, though, such as the https://github.com/Oqaasileriffik/katersat semantic tags module. I would prefer to get <sh> implemented in libdivvun so we can test the actual pipe, before needing to port all the Perl and Python parts to C++. But I guess my yule project could be to port it all.

@snomos
Copy link
Member Author

snomos commented Dec 18, 2024

What is the semantic tag module, and how is it used?

@flammie
Copy link
Contributor

flammie commented Dec 19, 2024

I would prefer to get <sh> implemented in libdivvun so we can test the actual pipe,

I think I implemented most of it at on the way some time ago but it needs testing (and development) since it's the least portable and standardised code.

@TinoDidriksen
Copy link
Member

What is the semantic tag module, and how is it used?

Almost all of our semantic tags comes from our online dictionary interface, Katersat, and not the FST. Katersat is easier for everyone to work with. Student helpers and others can easily be taught how to tag semantics, provide translations, and explanations in Katersat, without needing to know how to change the FST. It's also vastly faster during development.

We then extract those semantic tags from Katersat and apply them to the output of the FST, so the pipe can make use of them for disambiguation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants