Erroneous Annotation in Word Alignment #1

twadada · 2020-04-07T00:48:26Z

Hi, I guess the following 4 sentences in 'all/alignment.gr-it.txt' are annotated erroneously:

ndìtimo {##} mi sono vestito {##} 1-1
èstasa {##} arrivai {##} 1-1
klèete {##} si piange {##} 1-1
nditònta {##} essendosi vestito {##} 1-1

In these Griko sentences there is only one word, but the index starts from 1, not 0. So it's not certain whether it actually describes 0-0, 0-1, 1-0, or 1-1, or even the annotation is in fact completely erroneous.

antonisa · 2020-04-07T03:51:49Z

You are right.
All four sentences are actually correct translations of each other.
My guess would be that all of these should an one-to-many alignment, so that:
ndìtimo {##} mi sono vestito {##} 0-0 0-1 0-2
èstasa {##} arrivai {##} 0-0
klèete {##} si piange {##} 0-0 0-1
nditònta {##} essendosi vestito {##} 0-0 0-1

Feel free to correct them if you use the data.

The alignments were done automatically, of course, so there must be a bug somewhere.
I'll try to take a look, although the rest of the alignments look good to me.

twadada · 2020-04-07T04:18:23Z

Thanks for your quick response!

The alignments were done automatically, of course, so there must be a bug somewhere.

Oh was it done automatically? Do you mean it is generated by a word alignment tool such as GIZA++?

I was actually thinking about using the word alignment for model evaluation. The resource paper says this data includes 'gold-standard word-level alignment information for every utterance, including annotated silences', so I thought these alignments were gold annotations. Is it not recommended for using the data for evaluation? If not, maybe it would be clearer to mention that on README.

antonisa · 2020-04-07T13:47:46Z

Oh my, I apologize!
I thought that we were in this repo which actually had Giza++ produced alignments... My bad!

You are right, the ones here were first done by Giza++ but then corrected by hand!
These must have slipped though the cracks -- I'll update the file.

twadada · 2020-04-08T00:28:57Z

Thank you so much! So should I omit the 4 sentences, or use the corrections that you just provided?

I also had a look at the other repo with POS tags, but the paper says the orthography is different from the data in this repo, meaning they have a different vocab I assume. So basically it is not ideal to combine them to learn cross-lingual models such as NMT and GIZA++, is that right?

twadada · 2020-04-20T03:54:46Z

Hi, are you planning to update the file? I'm interested in using the dataset and I would appreciate it if you kindly tell me wether I can use the current version (w/w.o the 4 sentences I mentioned above), or should wait for the update. Thanks!

antonisa · 2020-04-20T13:48:27Z

Hi, I fixed the 4 sentences in the file two weeks ago -- sorry I didn't notify you here!
You can use the current version with all sentences.

antonisa · 2020-04-20T13:49:52Z

Regarding this:

I also had a look at the other repo with POS tags, but the paper says the orthography is different from the data in this repo, meaning they have a different vocab I assume. So basically it is not ideal to combine them to learn cross-lingual models such as NMT and GIZA++, is that right?

The orthography is not wildly different (mostly has to do with accents and the way they were tokenized during pre-processing). I think it might still be beneficial to combine them, that way you'd have more data (just a little bit more noise in them).

twadada · 2020-04-21T04:32:56Z

Yeah I actually noticed you updated the file, but was wondering if there might be further update, as you said that was your guess (one-to-many alignments).

Thank you so much for dealing with this issue so quickly!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erroneous Annotation in Word Alignment #1

Erroneous Annotation in Word Alignment #1

twadada commented Apr 7, 2020

antonisa commented Apr 7, 2020 •

edited

Loading

twadada commented Apr 7, 2020 •

edited

Loading

antonisa commented Apr 7, 2020

twadada commented Apr 8, 2020

twadada commented Apr 20, 2020

antonisa commented Apr 20, 2020

antonisa commented Apr 20, 2020

twadada commented Apr 21, 2020

Erroneous Annotation in Word Alignment #1

Erroneous Annotation in Word Alignment #1

Comments

twadada commented Apr 7, 2020

antonisa commented Apr 7, 2020 • edited Loading

twadada commented Apr 7, 2020 • edited Loading

antonisa commented Apr 7, 2020

twadada commented Apr 8, 2020

twadada commented Apr 20, 2020

antonisa commented Apr 20, 2020

antonisa commented Apr 20, 2020

twadada commented Apr 21, 2020

antonisa commented Apr 7, 2020 •

edited

Loading

twadada commented Apr 7, 2020 •

edited

Loading