Name		Name	Last commit message	Last commit date
parent directory ..
CreoleTranslations		CreoleTranslations
MCTest160		MCTest160
MCTest500		MCTest500
MCTestHat1		MCTestHat1
MCTestHat2		MCTestHat2
MCTestMar		MCTestMar
.gitignore		.gitignore
LICENSE.pdf		LICENSE.pdf
README.md		README.md
evaluate.sh		evaluate.sh
preproc.py		preproc.py
requirements.pip-freeze.txt		requirements.pip-freeze.txt
requirements.txt		requirements.txt
run_mbert.py		run_mbert.py
run_xlmr.py		run_xlmr.py
train.sh		train.sh

README.md

MCTest for Mauritian Creole & Haitian Kreyòl

This folder contains our new machine comprehension dataset as well as scripts to run experiments on it.

Getting Started

Environment Setup

Tested with Ubuntu 22, Python 3.10 and NVidia a40 GPU.

Create a Python virtual environment, either venv or conda;
Install the necessary dependencies with pip install -r requirements.txt or python3 -m pip install -r requirements.txt if using conda.
Additionally install PyTorch with your preferred configuration https://pytorch.org/

Training

Activate the virtual environment;
Run ./train.sh script which will automatically fine-tune mBERT and XLM-R models on the downstream task;
Output will be saved in ./output folder.

Additional Information

Dataset

The dataset is based on MCTest. The MCTest folder has the original dataset and information from the original authors.

Important note: Our translators identified an error with the original English mc160.dev.17 story: namely, in question 3, the correct answer (B) incorrectly says "pink" flowers rather than "yellow" flowers. We have fixed the English version, included in this repo, and our translations have also been corrected to reflect the text of the original story.

Translations

MC160.dev set has been translated into Marutian Creole and Haitian Creole by professional translators. This consists of 30 stories pertaining to a total of 120 questions (4 multiple choice questions per story).

Notably, we have two distinct translations for Haitian:

MCTestHat1/mc160.dev.json is a direct translation, matching the English.
MCTestHat2/mc160.dev.json is a localized translation, with names, places, and activities adjusted to be more relevant to Haitian people.

See MCTest/CreoleTranslations for the original .txt translations and the .tsv file formats as well.

NB: Once this data has been uploaded to the MIT-Ayiti website, we will remove it from the Github, and instead provide a download script to fetch it from the MIT-Ayiti platform

Details on Data Preprocessing

First, we convert the translated .txt files into .tsv to match the original English .tsv files (see ./MCTest/CreoleTranslations). Then we convert these to .json format, as we found it more tidy to work with (these are the ./MCTest*X* directries, where X={160, Hat1, Hat2, Mar, 500}. The ./MCTest160 dir is the original train, dev, test data in English. The Hat1, Hat2, and Mar datasets have the translated mc160.dev.json files, for Haitian and Mauritian. ). The ./MCTest500 dir is the English MC500 dataset, as json.

Code for this (though not tidied up...) can be found in preproc.py.

Experiments

Requirements

All experiments here run with Huggingface Transformers under PyTorch.

To install all required packages, you can run:

pip install -r requirements.txt

It's recommended to do this in a virtual environment or within a container; we ran our code in a container based on the docker image docker://nvcr.io/nvidia/pytorch:22.08-py3. For reference, we include our full Python environment as requirements.pip-freeze.txt.

Running the Experiments

Please see train.sh and evaluate.sh for concrete examples on how to train and evaluate the models.

The experiments use separate scripts for mBERT (run_mbert.py) and XLM-R (run_xlmr.py) as the latter does not make use of the token_type_ids. We use the Huggingface AutoModelForMultipleChoice to instantiate the models.

License

As this data is translated from Microsoft's MCTest dataset, it inherits the same license as the original.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mctest

mctest

README.md

MCTest for Mauritian Creole & Haitian Kreyòl

Getting Started

Environment Setup

Training

Additional Information

Dataset

Translations

Details on Data Preprocessing

Experiments

Requirements

Running the Experiments

License

Files

mctest

Directory actions

More options

Directory actions

More options

Latest commit

History

mctest

Folders and files

parent directory

README.md

MCTest for Mauritian Creole & Haitian Kreyòl

Getting Started

Environment Setup

Training

Additional Information

Dataset

Translations

Details on Data Preprocessing

Experiments

Requirements

Running the Experiments

License