This repository contains the implementation of my summer thesis.
You'll need python3
, pytorch
, spacy
, numpy
, pyemd
, bayesian-optimisation
, rouge
, and pyTelegramBotAPI
. You can install them via pip3
(but you'll want to make use of CUDA so you might want to look into that too).
To make life easier I've added a requirements.txt
that'll allow you to install everything necessary (after installing python3.6+
):
pip install -e .
To make life easier, I've set up a one-command auto running program that'll deal with downloading all the necessary machine-translation datasets needed to make your model. You'll need wget
however.
cd datasets
./init_enfr_dataset.sh
./political_data.sh
That being said, you'll have to compose your own newspaper dataset, since I'm near certain that releasing such a dataset is not allowed by the newspapers for a variety of reasons, ranging from ethical to legal. In the root directory there is a zip called newspapers. run the jupyter notebook called downloader_manual.ipynb
to get the article dataset (you'll need to submit your login for The Times in times.json
first).
If times.json
isn't created use the following format:
{
"action" : "login",
"username": "TIMES_USERNAME",
"password": "PASSWORD",
"s" : 1,
"rememberMe" : "on"
}
After running the notebook run convert_express.sh
and it'll move it to the base directory.
Once the datasets are downloaded, cd base/scripts
and run the following:
cd prod
./train_nmt_models.sh
./train_pol_st_models.sh
./build_pub_corpus.sh
./train_pub_st_models.sh
./train_pub_naturalness_models.sh
It'll probably take quite a while to train the models.