-
Notifications
You must be signed in to change notification settings - Fork 0
Some Ideas
-
we could get features (for classifiers) by doing MT to several other target languages.
- Say if we wanted to translate into Spanish, we could run MT (say Joshua?) to get translations of the target sentence into de/fr/it/nl, and extract features from those other translation
- The clear other thing to do: run Joshua into the target language and see if that's informative. Maybe we just take that answer, or maybe we we include it as a feature. It would probably be a very informative feature.
-
Not just Joshua -- we could try Apertium? Would that help? It's easy to experiment with, at least.
-
Could we try to get answers for all five target words, jointly? This would be a clear application for MRFs: we would try to encourage the five variables to make sense together.
-
Fairly important thing to consider: when do we want to just put the one best (for the one-best evaluation), and when do we want to try several options?
-
For the classifiers: make sure to do regularization! Try different regularization approaches: L1 and L2 norm? Maybe
-
Use bag-of-words features for the whole sentence as well as surrounding-context features. Do parsing and chunking: be able to decide what the syntactic head is for all the words in the sentence.
-
How many different classifiers can we use on the same sentence? Could training on a Wordnet-tagged corpus help too? This is like another parallel corpus: it maps from English to WordNet senses. That corpus is available here: http://wordnet.princeton.edu/glosstag.shtml
-
Another thing to try for getting MT translations: If we don't train Joshua, we could use multiple online MT APIs, and take the 'correct' translation by vote. (maybe need to do word alignment first?, would this be bad if the majority of translators perform poorly.)
-
If we use an MRF to get several answers simultaneously, we could train "transition probabilities" by aligning the different target-language corpora and learning which ones tend to correlate...
-
The COLEUR system used a graphical model, there is nothing fancy about the model, but there idea is the same as 'solving WSD for one word and its surrounding words at the same time'.
-
We can lemmatize the target language text with TreeTagger. There's a Python interface to TreeTagger here: https://github.com/miotto/treetagger-python
Questions:
- Would it be helpful to use other corpora for training? Because Euro-Parallel is just one domain.. Do we know where the test sentences come from? What is there domain? If yes it's useful, how can we get other translations for building the parallel corpora?
Other ideas: When I was reading Sentiment Analysis book, someone mentioned WSD for the sentiment words. Sometimes sentiment words has objective senses, so WSD for them helps reduce the false positives. A topic to work on later... Would it hurt or help to include sentiment words..