Name		Name	Last commit message	Last commit date
parent directory ..
data		data
experiments		experiments
plots		plots
tables		tables
README.md		README.md
analysis.py		analysis.py
create_tables.py		create_tables.py
download_tatoeba.sh		download_tatoeba.sh
plot_distributions.py		plot_distributions.py
prepare_data.py		prepare_data.py
requirements.txt		requirements.txt
run_experiments.sh		run_experiments.sh
run_tatoeba.py		run_tatoeba.py

README.md

Tatoeba task

Created for the Creole Suite by Marcell Fekete. Accuracy scores are out of a 100.

Getting Started

Environment Setup

Tested with Ubuntu 22 and Python 3.10.

Create a Python virtual environment, either venv or conda;
Install the necessary dependencies with pip install -r requirements.txt or python3 -m pip install -r requirements.txt if using conda.
Additionally install PyTorch with your preferred configuration https://pytorch.org/

Generating Results

Activate your Python environment;
Run bash ./run_experiments.sh from tatoeba_task/ as your working directory.
Results are stored in the experiments/ folder.

The script will download the necessary data and run the experiments for the models bert-base-multilingual-cased, xlm-roberta-base, google/mt5-base, and random (for random baseline). By default python3 is used to run the Python code, change this as needed.

Analysis

Run the plot_distributions.py script with the input folder data/ and output folder ./plots/length as arguments. This will plot sentence lengths per language pair.
Run the create_tables.py script with the input folder experiments/ as an argument.
analysis.py: calculates tokenizer fertility and token overlap between source and target sentences and plots them with a default output in ./plots/analysis/

./data/ folder: contains test samples in the format of tsv files
./experiments/ folder: contains experiment outputs per language model (and random baseline)
./plots/length/ folder: contains barplots plotting the length distribution of the test samples
./plots/analysis folder: contains plots of tokenizer fertility and token overlap between source and target sentences per language pair per language model
./tables/ folder: contains the aggregated results of the experiments with accuracy and average cosine similarity scores per language
./tatoeba/ folder: if the download_tatoeba.sh or run_experiments.sh script is run, it contains the data from the Tatoeba Challenge repository arranged in folders per language

Scripts

create_tables.py: aggregates experimental results with a default output in ./tables/
download_tatoeba.sh: downloads data from the Tatoeba Challenge repository and arranges it in the right format for further processing with a default output in ./tatoeba/
prepare_data.py: creates test samples conforming with the Tatoeba task with a default output in ./data/
run_experiments.sh: downloads and prepares the data from the Tatoeba Challenge repository and carries out the sentence pair retrieval task, by default for bert-base-multilingual-cased, xlm-roberta-base, google/mt5-base and random (for random baseline) with outputs in ./experiments/ It does this by calling the other scripts run_tatoeba.py: carries out the sentence pair retrieval task with a default output in ./experiments/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tatoeba_task

tatoeba_task

README.md

Tatoeba task

Getting Started

Environment Setup

Generating Results

Analysis

Contents

Folders

Scripts

Files

tatoeba_task

Directory actions

More options

Directory actions

More options

Latest commit

History

tatoeba_task

Folders and files

parent directory

README.md

Tatoeba task

Getting Started

Environment Setup

Generating Results

Analysis

Contents

Folders

Scripts