Skip to content

Latest commit

 

History

History
98 lines (70 loc) · 5.51 KB

File metadata and controls

98 lines (70 loc) · 5.51 KB

Project title

This project is about unsupervised sense clustering of translations. It is a subfield of machine translation and used there in order to improve or evaluate word translations.

Motivation, method, hypotheses

The initial problem is that the translation of a single word in different contexts in L1 has often multiple target translations in L2. From those target translations one needs to pick carefully in order to match the context.

This is not even a problem restricted to machine translation. If we look at L2 language learners, they can not have the deep understanding and intuition of the language as a L1 speaker. L2 speakers will often just translate a word from their native language to a target language, not knowing that this would sound strange to a native speaker (because a different translation of the possible ones should have been used for a particular context).

From humans to machines: Knowing that humans make such mistakes quite a lot, one would not expect a statistical (machine translation) model to handle those problems with granularity ad hoc.

Method: soft Clustering Monolingual features: PMI of a word with its neighbor Bilingual features: PMI of a word and its translations
The evaluation will be a combination of manual evaluation and comparison to wordnet sense clusters.

Task Adjustment

After some problems with the initial project, the project was reduced to a more rudimentary form. The adjusted task was now to seperate two concepts via normal clustering (instead of soft clustering) e.g. K-Means, where each concept is realized as a list of words which are very similar to each other (list of cloth names vs. list of body parts). As features a bag of words approach was used. The monolingual and bilingual features were still both tested. For the monolingual features a BOW with a window of 5 tokens was used. For the bilingual features a BOW with the whole aligned sentence was used (due to missing word alignment). Although I claim that the difference between a 5 window BOW and a total sentence BOW is not significant due to the corpus which was used. The corpus was the "OpenSubtitles" corpus. Since it contains transcriptions from movies and series the sentences in there are rather short (at least shorter than in corpus of written text).

Results

Surprinsingly the bilingual features performed almost as good as the monolingual features on their own. Although the small subset of the corpus for the bilingual features was only about 100mb while part for the monolingual features was about 2gb. A combination of both did not yield significantly better results.

The best two concepts, which were perfectly clustered into two groups were a list of fruits and a list of words related to living space:

Fruits: 'apple', 'banana', 'oranges', 'watermelons', 'strawberries', 'grape', 'peach', 'cherry', 'pear', 'plum', 'melon', 'lemon', 'coconut', 'lime',

Home space: 'office', 'home', 'building', 'house', 'apartment', 'city', 'town', 'village',

Other concepts were clothing, transportation, cities, furniture, relatives, body parts. But none of them was seperated perfectly via clustering, the results ranged from more or less good seperation to total random combinations of clusters.

How to run:

The relevant file to run is the "BOWApproach.py" file. The path needs to be adjusted in the call of the "readFile" method and in the call of the "readAlignedCorpus" method. Each of the paths is expected to point to a file (not to a directory containing files). The data I used can be found here (it is a small subset of the OpenSubtitles corpus): https://drive.google.com/open?id=14udNeygNpYXzsA0lLnOOoI4FbM6pmyhO

In order to test other concepts, the lists of strings in above mentioned methods needs to be replaced with the target lists. Take care that the lists are the same for both methods. The result is printed to the console as well as shown as a plot.

Relevant literature

A short list of literature (articles/books/blog posts/...). We will pick some of the listed papers for further class discussion.

Mohit Bansal, John DeNero, Dekang Lin Unsupervised Translation Sense Clustering
Michael Denkowski, A Survey of Techniques for Unsupervised Word Sense Induction (Chapter 5 is translation related)
Marianna Apidianaki, Yifan He Marianna Apidianaki, Yifan He. An algorithm for cross-lingual sense-clustering tested in a MT evaluation setting. International Workshop on Spoken Language Translation (IWSLT-2010), Dec 2010, Paris, France. pp.219–226, 2010

Available data, tools, resources

Data for dictionaries: Bilingual Dictionaries for Offline Use (as an alternative to the handcrafted dictionaries described in the Unsupervised Translation Sense Clustering paper)

dictUtil.py is for extracting a pure dictionary from the above source which also contains POSTags, Descriptions, Use cases and other meta data.
Example item from en-de dict:
permanent --> {'permanent', 'unbefristet', 'Dauerwelle', 'beständig', 'dauerhaft', 'ständig', 'Permanente'}

Corpora from: http://opus.nlpl.eu/

Project members

  • Johannes (Joapfel)
  • Name (GitHubID)