Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous Annotation in Word Alignment #1

Open
twadada opened this issue Apr 7, 2020 · 8 comments
Open

Erroneous Annotation in Word Alignment #1

twadada opened this issue Apr 7, 2020 · 8 comments

Comments

@twadada
Copy link

twadada commented Apr 7, 2020

Hi, I guess the following 4 sentences in 'all/alignment.gr-it.txt' are annotated erroneously:

ndìtimo {##} mi sono vestito {##} 1-1
èstasa {##} arrivai {##} 1-1
klèete {##} si piange {##} 1-1
nditònta {##} essendosi vestito {##} 1-1

In these Griko sentences there is only one word, but the index starts from 1, not 0. So it's not certain whether it actually describes 0-0, 0-1, 1-0, or 1-1, or even the annotation is in fact completely erroneous.

@antonisa
Copy link
Owner

antonisa commented Apr 7, 2020

You are right.
All four sentences are actually correct translations of each other.
My guess would be that all of these should an one-to-many alignment, so that:
ndìtimo {##} mi sono vestito {##} 0-0 0-1 0-2
èstasa {##} arrivai {##} 0-0
klèete {##} si piange {##} 0-0 0-1
nditònta {##} essendosi vestito {##} 0-0 0-1

Feel free to correct them if you use the data.

The alignments were done automatically, of course, so there must be a bug somewhere.
I'll try to take a look, although the rest of the alignments look good to me.

@twadada
Copy link
Author

twadada commented Apr 7, 2020

Thanks for your quick response!

The alignments were done automatically, of course, so there must be a bug somewhere.

Oh was it done automatically? Do you mean it is generated by a word alignment tool such as GIZA++?

I was actually thinking about using the word alignment for model evaluation. The resource paper says this data includes 'gold-standard word-level alignment information for every utterance, including annotated silences', so I thought these alignments were gold annotations. Is it not recommended for using the data for evaluation? If not, maybe it would be clearer to mention that on README.

@antonisa
Copy link
Owner

antonisa commented Apr 7, 2020

Oh my, I apologize!
I thought that we were in this repo which actually had Giza++ produced alignments... My bad!

You are right, the ones here were first done by Giza++ but then corrected by hand!
These must have slipped though the cracks -- I'll update the file.

@twadada
Copy link
Author

twadada commented Apr 8, 2020

Thank you so much! So should I omit the 4 sentences, or use the corrections that you just provided?

I also had a look at the other repo with POS tags, but the paper says the orthography is different from the data in this repo, meaning they have a different vocab I assume. So basically it is not ideal to combine them to learn cross-lingual models such as NMT and GIZA++, is that right?

@twadada
Copy link
Author

twadada commented Apr 20, 2020

Hi, are you planning to update the file? I'm interested in using the dataset and I would appreciate it if you kindly tell me wether I can use the current version (w/w.o the 4 sentences I mentioned above), or should wait for the update. Thanks!

@antonisa
Copy link
Owner

Hi, I fixed the 4 sentences in the file two weeks ago -- sorry I didn't notify you here!
You can use the current version with all sentences.

@antonisa
Copy link
Owner

Regarding this:

I also had a look at the other repo with POS tags, but the paper says the orthography is different from the data in this repo, meaning they have a different vocab I assume. So basically it is not ideal to combine them to learn cross-lingual models such as NMT and GIZA++, is that right?

The orthography is not wildly different (mostly has to do with accents and the way they were tokenized during pre-processing). I think it might still be beneficial to combine them, that way you'd have more data (just a little bit more noise in them).

@twadada
Copy link
Author

twadada commented Apr 21, 2020

Yeah I actually noticed you updated the file, but was wondering if there might be further update, as you said that was your guess (one-to-many alignments).

Thank you so much for dealing with this issue so quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants