Some Ideas

we could get features (for classifiers) by doing MT to several other target languages.
- Say if we wanted to translate into Spanish, we could run MT (say Joshua?) to get translations of the target sentence into de/fr/it/nl, and extract features from those other translation
- The clear other thing to do: run Joshua into the target language and see if that's informative. Maybe we just take that answer, or maybe we we include it as a feature. It would probably be a very informative feature.
Not just Joshua -- we could try Apertium? Would that help? It's easy to experiment with, at least.
Could we try to get answers for all five target words, jointly? This would be a clear application for MRFs: we would try to encourage the five variables to make sense together.
Fairly important thing to consider: when do we want to just put the one best (for the one-best evaluation), and when do we want to try several options?
For the classifiers...
- make sure to do regularization! Try different regularization approaches: L1 and L2 norm?
- Try using MegaM as well as the built-in NLTK maxent.
Use bag-of-words features for the whole sentence as well as surrounding-context features.
Do parsing and chunking: be able to decide what the syntactic head is for all the words in the sentence.
How many different classifiers can we use on the same sentence? Could training on a Wordnet-tagged corpus help too? This is like another parallel corpus: it maps from English to WordNet senses. That corpus is available here: http://wordnet.princeton.edu/glosstag.shtml
Another thing to try for getting MT translations: If we don't train Joshua, we could use multiple online MT APIs, and take the 'correct' translation by vote. (maybe need to do word alignment first?, would this be bad if the majority of translators perform poorly.)
If we use an MRF to get several answers simultaneously, we could train "transition probabilities" by aligning the different target-language corpora and learning which ones tend to correlate...
The COLEUR system used a graphical model, there is nothing fancy about the model, but there idea is the same as 'solving WSD for one word and its surrounding words at the same time'.
We can lemmatize the target language text with TreeTagger. There's a Python interface to TreeTagger here: https://github.com/miotto/treetagger-python

Questions:

Would it be helpful to use other corpora for training? Because Euro-Parallel is just one domain.. Do we know where the test sentences come from? What is there domain? If yes it's useful, how can we get other translations for building the parallel corpora?

Other ideas: When I was reading Sentiment Analysis book, someone mentioned WSD for the sentiment words. Sometimes sentiment words has objective senses, so WSD for them helps reduce the false positives. A topic to work on later... Would it hurt or help to include sentiment words..

things that we've done

things that we could do maybe later

We could easily build up more L1 classifiers from any sense-tagged corpus in our source language. That would be extremely easy.

such as, for example, the sense-tagged corpus distributed with Wordnet: http://wordnet.princeton.edu/glosstag.shtml

or the new Google/Amherst wikipedia-tagged corpus, WikiLinks: http://googleresearch.blogspot.com/2013/03/learning-from-big-data-40-million.html

We could even use, as L2 features, other monolingual WSD systems. Any monolingual WSD system, in fact.

we could do active learning, actually, to get labels for the same sentence in multiple target languages. Ask a human "here's a sentence in English, what's the best translation into target languages X and Y", for the ones we're unsure about! This would be really easy to crowdsource.

this would even be pretty easy to crowdsource for es Guarani, if we have people who speak Spanish and any other language handy... (gn, en...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Ideas

things that we've done

things that we could do maybe later

things that are probably bad ideas

Clone this wiki locally