Skip to content
alexr edited this page Mar 27, 2013 · 9 revisions
  • we could get features (for classifiers) by doing MT to several other target languages.

    • Say if we wanted to translate into Spanish, we could run MT (say Joshua?) to get translations of the target sentence into de/fr/it/nl, and extract features from those other translation
    • The clear other thing to do: run Joshua into the target language and see if that's informative. Maybe we just take that answer, or maybe we we include it as a feature. It would probably be a very informative feature.
  • Not just Joshua -- we could try Apertium? Would that help? It's easy to experiment with, at least.

  • Could we try to get answers for all five target words, jointly? This would be a clear application for MRFs: we would try to encourage the five variables to make sense together.

  • Fairly important thing to consider: when do we want to just put the one best (for the one-best evaluation), and when do we want to try several options?

  • For the classifiers...

    • make sure to do regularization! Try different regularization approaches: L1 and L2 norm?
    • Try using MegaM as well as the built-in NLTK maxent.
  • Use bag-of-words features for the whole sentence as well as surrounding-context features.

  • Do parsing and chunking: be able to decide what the syntactic head is for all the words in the sentence.

  • How many different classifiers can we use on the same sentence? Could training on a Wordnet-tagged corpus help too? This is like another parallel corpus: it maps from English to WordNet senses. That corpus is available here: http://wordnet.princeton.edu/glosstag.shtml

  • Another thing to try for getting MT translations: If we don't train Joshua, we could use multiple online MT APIs, and take the 'correct' translation by vote. (maybe need to do word alignment first?, would this be bad if the majority of translators perform poorly.)

  • If we use an MRF to get several answers simultaneously, we could train "transition probabilities" by aligning the different target-language corpora and learning which ones tend to correlate...

  • The COLEUR system used a graphical model, there is nothing fancy about the model, but there idea is the same as 'solving WSD for one word and its surrounding words at the same time'.

  • We can lemmatize the target language text with TreeTagger. There's a Python interface to TreeTagger here: https://github.com/miotto/treetagger-python


Questions:

  • Would it be helpful to use other corpora for training? Because Euro-Parallel is just one domain.. Do we know where the test sentences come from? What is there domain? If yes it's useful, how can we get other translations for building the parallel corpora?

Other ideas: When I was reading Sentiment Analysis book, someone mentioned WSD for the sentiment words. Sometimes sentiment words has objective senses, so WSD for them helps reduce the false positives. A topic to work on later... Would it hurt or help to include sentiment words..


things that we've done


things that we could do maybe later

We could easily build up more L1 classifiers from any sense-tagged corpus in our source language. That would be extremely easy.

such as, for example, the sense-tagged corpus distributed with Wordnet: http://wordnet.princeton.edu/glosstag.shtml

or the new Google/Amherst wikipedia-tagged corpus, WikiLinks: http://googleresearch.blogspot.com/2013/03/learning-from-big-data-40-million.html

We could even use, as L2 features, other monolingual WSD systems. Any monolingual WSD system, in fact.

we could do active learning, actually, to get labels for the same sentence in multiple target languages. Ask a human "here's a sentence in English, what's the best translation into target languages X and Y", for the ones we're unsure about! This would be really easy to crowdsource.

this would even be pretty easy to crowdsource for es Guarani, if we have people who speak Spanish and any other language handy... (gn, en...)


things that are probably bad ideas