-
Notifications
You must be signed in to change notification settings - Fork 0
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Erroneous Annotation in Word Alignment #1
Comments
You are right. Feel free to correct them if you use the data. The alignments were done automatically, of course, so there must be a bug somewhere. |
Thanks for your quick response!
Oh was it done automatically? Do you mean it is generated by a word alignment tool such as GIZA++? I was actually thinking about using the word alignment for model evaluation. The resource paper says this data includes 'gold-standard word-level alignment information for every utterance, including annotated silences', so I thought these alignments were gold annotations. Is it not recommended for using the data for evaluation? If not, maybe it would be clearer to mention that on README. |
Oh my, I apologize! You are right, the ones here were first done by Giza++ but then corrected by hand! |
Thank you so much! So should I omit the 4 sentences, or use the corrections that you just provided? I also had a look at the other repo with POS tags, but the paper says the orthography is different from the data in this repo, meaning they have a different vocab I assume. So basically it is not ideal to combine them to learn cross-lingual models such as NMT and GIZA++, is that right? |
Hi, are you planning to update the file? I'm interested in using the dataset and I would appreciate it if you kindly tell me wether I can use the current version (w/w.o the 4 sentences I mentioned above), or should wait for the update. Thanks! |
Hi, I fixed the 4 sentences in the file two weeks ago -- sorry I didn't notify you here! |
Regarding this:
The orthography is not wildly different (mostly has to do with accents and the way they were tokenized during pre-processing). I think it might still be beneficial to combine them, that way you'd have more data (just a little bit more noise in them). |
Yeah I actually noticed you updated the file, but was wondering if there might be further update, as you said that was your guess (one-to-many alignments). Thank you so much for dealing with this issue so quickly! |
Hi, I guess the following 4 sentences in 'all/alignment.gr-it.txt' are annotated erroneously:
ndìtimo {##} mi sono vestito {##} 1-1
èstasa {##} arrivai {##} 1-1
klèete {##} si piange {##} 1-1
nditònta {##} essendosi vestito {##} 1-1
In these Griko sentences there is only one word, but the index starts from 1, not 0. So it's not certain whether it actually describes 0-0, 0-1, 1-0, or 1-1, or even the annotation is in fact completely erroneous.
The text was updated successfully, but these errors were encountered: