Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruchit tripathi patch 1 #34

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# Multilingual Neural Machine Translation System for TV News

_This is my [Google summer of Code 2018](https://summerofcode.withgoogle.com/projects/#6685973346254848) Project with [the Distributed Little Red Hen Lab](http://www.redhenlab.org/)._
_This is my [Google Summer of Code 2018](https://summerofcode.withgoogle.com/projects/#6685973346254848) Project with [the Distributed Little Red Hen Lab](http://www.redhenlab.org/)._

The aim of this project is to build a Multilingual Neural Machine Translation System, which would be capable of translating Red Hen Lab's TV News Transcripts from different source languages to English.

The system uses Reinforcement Learning(Advantage-Actor-Critic algorithm) on the top of neural encoder-decoder architecture and outperforms the results obtained by simple Neural Machine Translation which is based upon maximum log-likelihood training. Our system achieves close to state-of-the-art results on the standard WMT(Workshop on Machine Translation) test datasets.

This project is inspired by the approaches mentioned in the paper [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086).

I have made a GSoC blog, please refer to it for my all GSoC blogposts about the progress made so far.
Blog link: https://vikrant97.github.io/gsoc_blog/
I have made a GSoC blog, please refer to it for my all GSoC blog posts about the progress made so far.
Blog link: [GSoC Blog](https://vikrant97.github.io/gsoc_blog/)

The following languages are supported as the source language & the below are their language codes:
1) **German - de**
Expand All @@ -35,7 +35,7 @@ The target language is English(en).

### Installation & Setup Instructions on CASE HPC

* Users who want the pipeline to work on case HPC, just copy the directory named **nmt** from the home directory of my hpc acoount i.e **/home/vxg195** & then follow the instructions described for training & translation.
* Users who want the pipeline to work on case HPC, just copy the directory named **nmt** from the home directory of my hpc account i.e **/home/vxg195** & then follow the instructions described for training & translation.

* nmt directory will contain the following subdirectories:
* singularity
Expand All @@ -54,15 +54,15 @@ The target language is English(en).
* test.$src-$tgt.$src.processed
* test.$src-$tgt.$tgt.processed

* The **models** directory consists of trained models for the respective language pairs and also follows the same structure of subdirectories as **data** directory. For example, **models/de-en** will contains trained models for the **German-English** language pair.
* The **models** directory consists of trained models for the respective language pairs and also follows the same structure of subdirectories as **data** directory. For example, **models/de-en** will contain trained models for the **German-English** language pair.

* The following commands were used to install dependencies for the project:
```bash
$ git clone https://github.com/RedHenLab/Neural-Machine-Translation.git
$ virtualenv myenv
$ source myenv/bin/activate
$ pip install -r Neural-Machine-Translation/requirements.txt
```

* **Note** that the virtual environment(myenv) created using virtualenv command mentioned above, should be of **Python2** .

## Data Preparation and Preprocessing
Expand All @@ -76,7 +76,9 @@ Please note that these data preparation steps have to be done manually as we are
* test.$src-$tgt.$src
* test.$src-$tgt.$tgt

2) Now create an empty directory named $src-$tgt in the Neural-Machine-Translation/subword_nmt directory. Copy the file named "prepare_data.sh" into the language subdirectory for which we need to prepare the dataset. Then use the following commands to process the dataset for training:
2) Now create an empty directory named $src-$tgt in the Neural-Machine-Translation/subword_nmt directory. Copy the file named "prepare_data.sh" into the language subdirectory for which we need to prepare the dataset. Then use the following commands

to process the dataset for training:
```bash
bash prepare_data.sh $src $tgt
```
Expand Down Expand Up @@ -114,6 +116,7 @@ For evaluation, generate translation of any source test corpora. Now, we need to
perl scripts/multi-bleu.perl $reference-file < $hypothesis-file
```


## Acknowledgements

* [Google Summer of Code 2018](https://summerofcode.withgoogle.com/)
Expand All @@ -122,3 +125,4 @@ perl scripts/multi-bleu.perl $reference-file < $hypothesis-file
* [An Actor-Critic Algorithm for Sequence Prediction](https://arxiv.org/pdf/1607.07086)
* [Europarl](http://www.statmt.org/europarl/)
* [Moses](https://github.com/moses-smt/mosesdecoder)
```