'taakku' — all analyses lost in grammar checker #5

snomos · 2024-12-18T12:36:51Z

Cf this:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram0-morph.mode
"<taakku>"
    TA "una" Gram/Dem Pron Abs Pl <W:0.0>
    TA "una" Gram/Dem Pron Rel Pl <W:0.0>
:
…

In this first step everything is correct. But in the next step something strange is happening:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram1-blanktag.mode
    TA "una" Gram/Dem Pron Abs Pl <W:0.0>
    TA "una" Gram/Dem Pron Rel Pl <W:0.0>
"<taakku>"
:
…

The word form has been moved to after the analyser output. I have no idea why and how. In any case, this leads to the analyses getting lost later, leaving the bare word form:

echo "taakku marluk inuunerminni taama pilluartigisimanngisaannarput" | modes/trace-kalgram.mode 
"<taakku>"
: 
"<marluk>"
	"marluk" Num Abs Pl <W:0.0>
;	"marluk" Num Rel Pl <W:0.0> REMOVE:2385:tidlig0020A
;	"marluk" Orth/Alt N Abs Sg <W:0.0> REMOVE:2189:0001P
: 
…

Any idea, @TinoDidriksen @unhammer @flammie ?

TinoDidriksen · 2024-12-18T12:44:29Z

TA "una" Gram/Dem Pron Abs Pl <W:0.0> is not a valid CG-reading, as it doesn't start with ". Thus it is treated as text and moved out of the cohort it was in.

Our own pipe's kal-tokenise moves such non-baseform prefixes, yielding "una" Prefix/TA Gram/Dem Pron Abs Pl <W:0.0>. And it performs other needed corrections.

snomos · 2024-12-18T12:50:58Z

I hadn't realised that the CG reading syntax requires it to start with " - this is quite problematic when using as input FST analysis of prefix-heavy languages. What do you do then? In a grammar checker context tag order is important, as the tag order needs to be retained for word form generation at the end of the processing.

flammie · 2024-12-18T12:54:45Z

the analysis in xerox format is:

$ echo taakku | hfst-lookup src/fst/analyser-gt-desc.hfstol 
taakku	TA+una+Gram/Dem+Pron+Abs+Pl	0,000000
taakku	TA+una+Gram/Dem+Pron+Rel+Pl	0,000000

I can guess that hfst-tokenise needs to do some guessworks to find out which parts of this analysis are lemma or tags based on common practices that don't include this kind of combination. I'm not sure what is the correct lemma/tags here either?

snomos · 2024-12-18T13:01:59Z

The assumption for hfst-tokenise is very simple, and automatically handled in the FST pipeline:

all multichar symbols are tags
sequences of non-multichars are strings/word forms
there should be one and only one such string pr line in the analysis cohort

Placement of string in the analysis is not considered in the hfst-tokenise output, exactly because of prefixing languages.

Accented chars in Unicode using combining diacritics are always automatically converted to a sequence of symbols (in the FST sense), both to support the criteria above and to make parsing of input text simple and straightforward.

TinoDidriksen · 2024-12-18T13:03:00Z

We have kal-generate to move them back and turn CG into FST for generation. Greenlandic only has 2 prefixes, AA and TA, and they are definitely not the baseform.

A decade ago it used to be analyzed as "TA" una Gram/Dem Pron Abs Pl because we simply defined the first tag as baseform, but this caused other issues because the actual baseform was left as a tag. hfst-tokenise fixed that issue and we could very easily work around prefixes.

CG has always had somewhat strict stream format. CG-2 required first tag to be either [baseform] or "baseform". CG-3 changed this to only support "baseform", in order to allow more mixed content.

snomos · 2024-12-18T13:41:47Z

I see that both kal-tokenise and kal-generate are perl scripts. That is not very portable for standalone grammar checkers. I understand that there is more to these scripts than just moving prefix tags, but would it be ok to replace that part of the scripts with some simple (Rust/C/whatever) code to move prefix tags back and forth as needed, and leave the rest for now?

In the end I would like to have all the functionality of the perl scripts encoded in one of FST/CG/compiled binary, but I suggest we start with the prefix tags and see how that works.

TinoDidriksen · 2024-12-18T14:14:13Z

Sure.

There are other things we need in the final pipe, though, such as the https://github.com/Oqaasileriffik/katersat semantic tags module. I would prefer to get <sh> implemented in libdivvun so we can test the actual pipe, before needing to port all the Perl and Python parts to C++. But I guess my yule project could be to port it all.

snomos · 2024-12-18T21:02:26Z

What is the semantic tag module, and how is it used?

flammie · 2024-12-19T00:11:13Z

I would prefer to get <sh> implemented in libdivvun so we can test the actual pipe,

I think I implemented most of it at on the way some time ago but it needs testing (and development) since it's the least portable and standardised code.

TinoDidriksen · 2024-12-19T12:46:37Z

What is the semantic tag module, and how is it used?

Almost all of our semantic tags comes from our online dictionary interface, Katersat, and not the FST. Katersat is easier for everyone to work with. Student helpers and others can easily be taught how to tag semantics, provide translations, and explanations in Katersat, without needing to know how to change the FST. It's also vastly faster during development.

We then extract those semantic tags from Katersat and apply them to the output of the FST, so the pipe can make use of them for disambiguation.

snomos added the bug Something isn't working label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'taakku' — all analyses lost in grammar checker #5

'taakku' — all analyses lost in grammar checker #5

snomos commented Dec 18, 2024 •

edited

Loading

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

flammie commented Dec 18, 2024

snomos commented Dec 18, 2024

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

flammie commented Dec 19, 2024

TinoDidriksen commented Dec 19, 2024

'taakku' — all analyses lost in grammar checker #5

'taakku' — all analyses lost in grammar checker #5

Comments

snomos commented Dec 18, 2024 • edited Loading

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

flammie commented Dec 18, 2024

snomos commented Dec 18, 2024

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

TinoDidriksen commented Dec 18, 2024

snomos commented Dec 18, 2024

flammie commented Dec 19, 2024

TinoDidriksen commented Dec 19, 2024

snomos commented Dec 18, 2024 •

edited

Loading