Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2017, ACL, Data Augmentation for Low-Resource Neural Machine Translation #85

Open
Sepideh-Ahmadian opened this issue Sep 26, 2024 · 2 comments
Assignees
Labels
literature-review Summary of the paper related to the work

Comments

@Sepideh-Ahmadian
Copy link
Member

Paper
Data Augmentation for Low-Resource Neural Machine Translation

Introduction
This research focuses on the challenges faced by low-resource languages in the neural machine translation (NMT) domain. NMT, as introduced by Bahdanau et al. (2015), Sutskever et al. (2014), and Cho et al. (2014), follows a sequence-to-sequence architecture. In this architecture, the encoder processes the source language and creates a representation of it, while the decoder—using the same structure (hidden LSTM states and attention mechanisms)—generates the target language. Training such a network requires various words to appear in diverse contexts.

Main Problem
The quality of NMT heavily depends on the size of the dataset. To reliably estimate the model’s parameters, many sentence pairs with words in varied contexts is necessary. However, data annotation is time-consuming. Therefore, data augmentation is a viable alternative.

Illustrative Example
English: I had been told that you would [not/voluntarily] be speaking today.
German: Mir wurde signalisiert, Sie würden heute [nicht/freiwillig] sprechen.

Input:
A sentence in the source language.

Output:
A sentence in the target language that is grammatically correct and fluent (They have used BLEU for this purpose) but not necessarily identical in meaning.

Motivation
The paper proposes simple data augmentation methods (inspired by those used in computer vision) to improve neural machine translation. The method introduces a weak notion of label preservation, focusing on low-frequency words. In brief, they synthetically provide contexts that are created to provide a foundation for the model to generate rare words in various contexts during translation.
Related Works and Their Gaps
Sennrich et al. (2016a) proposed a method to back-translate sentences from monolingual data and augment the bitext with the resulting pseudo-parallel corpora.
They assert that their proposed method has a better performance.

Contribution of This Paper
Compared to paraphrasing, their method introduces more information to the model. They claim their method outperforms back-translation.

Proposed Methods
Not included

Experiments
Model: A 4-layer attention-based encoder-decoder model by Luong et al. (2015), with a 2-layer LSTM for forward and backward propagation.
Dataset: WMT’15

Implementation
Not mentioned

Gaps in This Work
This article addresses neural machine translation and the challenges of handling low-resource languages. However, it has some limitations:
The method is limited to rare words. Its practicality depends on whether it can be applied to real-world data, such as social media platforms like Twitter and Instagram.
Since rare words are less frequently used, they do not capture the complexity of common words, which may have various meanings in different contexts (or are used as slang on social media).
Is the method applicable across various language families? The approach may yield better results for languages within the same family compared to languages from different families.

@Sepideh-Ahmadian Sepideh-Ahmadian self-assigned this Sep 26, 2024
@Sepideh-Ahmadian Sepideh-Ahmadian added the literature-review Summary of the paper related to the work label Sep 26, 2024
@hosseinfani
Copy link
Member

Hi @Sepideh-Ahmadian
How they defined "rare" words and find them?

@Sepideh-Ahmadian
Copy link
Member Author

Hello @hosseinfani,
Their definition of rare words is based on the work by Marton et al. in "Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases." This work focused on machine translation, with the motivation to address the challenge of translating low-frequency words.
They proposed a method of using paraphrasing to enhance the dataset by adding paraphrased results. To identify which words are considered rare, they construct "monolingual distributional profiles (DPs)", also known as word association profiles or co-occurrence vectors, for "out-of-vocabulary words" and phrases in the source language. They then generate paraphrase candidates from phrases that co-occur in similar contexts and assign similarity scores to these candidates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
literature-review Summary of the paper related to the work
Projects
None yet
Development

No branches or pull requests

2 participants