Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2018, NAACL, Noising and Denoising Natural Language: Diverse Back Translation for Grammar Correction #84

Open
Sepideh-Ahmadian opened this issue Sep 26, 2024 · 2 comments
Assignees
Labels
literature-review Summary of the paper related to the work

Comments

@Sepideh-Ahmadian
Copy link
Member

Sepideh-Ahmadian commented Sep 26, 2024

Paper
Noising and Denoising Natural Language: Diverse Back Translation for Grammar Correction

Introduction
This research proposes a solution for data sparsity (noisy and clean pairs) in grammar correction in the NLP domain. The lack of enough noisy and clear pairs is a bottleneck in developing machine translation models. by noising is means they add some grammatical error to sentences and then denoising (refine) it using a language model.

Main Problem
There is a need to provide a large corpus of parallel noisy and clean in the field of grammar correction. This article suggests alleviating this problem by generating synthetic noisy data from clean one. To generate data, they proposed a method inspired by back translation from machine translation.

Illustrative Example
Clean version: "Day after day, I get up at 8 o'clock" 
synthesized noisy version: "I got up at 8 o'clock day after day." 

Input
Noisy sentence (having grammatical mistakes)

Output
Clean and grammatically correct sentence

Motivation
The authors were motivated by the need to overcome the data sparsity issue in grammar correction. Grammar correction systems often require a large corpus of parallel noisy and clean sentence pairs, which are hard to come by. The motivation was to generate synthetic noisy sentences from clean ones, which would allow training neural models for grammar correction without the need for extensive manually curated data.

Related works and their gaps
The paper addresses gaps related to the lack of realistic, diverse error types in previous methods for synthesizing noisy data. (Brockett et al., 2006; Felice, 2016). Previous approaches often generated unrealistic noise or were limited to local context windows. (Linzen et al., 2016- Sennrich et al., 2015) The authors aim to generate more realistic, diverse noisy sentences through neural sequence transduction and back-translation techniques.

Contribution of this paper
The paper’s main contributions include:
Proposing a neural sequence transduction model for generating synthetic noisy data for grammar correction.
Introducing several beam search noising procedures to produce diverse and realistic noisy sentences.
Demonstrating that the synthesized data improves grammar correction performance, nearly matching the performance of models trained on large parallel corpora of real noisy data.

Proposed methods
Not included

Experiments
The model is evaluated on the CoNLL 2013 and 2014 datasets for grammar correction and the JFLEG test set, which evaluates fluency in grammar correction.

Implementation
Not mentioned

Gaps this work
I believe based on the limited training dataset the synthesized noisy data may not capture all real-world grammatical errors. Therefore the model does not present good performance in various domains.

@Sepideh-Ahmadian Sepideh-Ahmadian added the literature-review Summary of the paper related to the work label Sep 26, 2024
@Sepideh-Ahmadian Sepideh-Ahmadian self-assigned this Sep 26, 2024
@hosseinfani
Copy link
Member

@Sepideh-Ahmadian
I had an idea of fixing the grammatical or any type of errors in a sentence using backtranslations in an unsupervised way. this is the same idea, right?

@Sepideh-Ahmadian
Copy link
Member Author

@hosseinfani, The purpose of this research project is to generate additional data in the machine translation domain, creating a corpus of correct and noisy sentence pairs.
I think we should do some digging in Grammar Correction literature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
literature-review Summary of the paper related to the work
Projects
None yet
Development

No branches or pull requests

2 participants