Experiment runner for paper Cross-Lingual Emotion Lexicon Induction using Representation Alignment in Low-Resource Settings.
- Clone this repo
- Clone fastText, multilingual-nlm and vecmap.
- Install python packages
numpy
,cupy
,torch
andpot
Some conventions:
- Languages are referred to by their 3 letter ISO code.
- Each bible file should be named
<iso>.txt
, e.g. Spanish would bespa.txt
. - Each line in a Bible file should contain 1 sentence/verse. The text should be pre-processed to be lowercased and contain space-separated words (no punctuation unless it's in the middle of a word, e.g. hypenation).
The 3 algorithms used for embedding alignment:
python align.py \
--langs <space-separated list of languages> \
--bible_dir <directory with Bibles> \
--align_dir <directory to save aligned embeddings> \
--emb_dir <directory to save initial unaligned fasttext embeddings> \
--num_gpus <number of GPUs to use> \
--algorithm fb \
--fasttext_dir <path to fastText clone>
python align.py \
--langs <space-separated list of languages> \
--bible_dir <directory with Bibles> \
--align_dir <directory to save aligned embeddings> \
--num_gpus <number of GPUs to use> \
--algorithm nlm \
--nlm_dir <path to nlm clone> \
--nlm_preproc_dir <directory to save preprocessed stuff like vocabulary> \
--nlm_preprocess <flag to not skip preprocessing, can remove after running once> \
--nlm_modified <run modified version, not original>
python align.py \
--langs <space-separated list of languages> \
--bible_dir <directory with Bibles> \
--sid_bible_dir <directory with Bibles with first column having sentence ID> \
--align_dir <directory to save aligned embeddings> \
--emb_dir <directory to save initial unaligned fasttext embeddings> \
--num_gpus <number of GPUs to use> \
--algorithm vecmap \
--vecmap_dir <path to vecmap clone> \
--fasttext_dir <path to fastText clone>
The sentence ID Bibles are like the normal Bibles, except each line is prefixed with a sentence ID followed by a tab. Sentence ID for translations of the same sentence across different language should be the same.
Omit the --sid_bible_dir
argument to run the original vecmap algorithm.
python eval.py \
--langs <space-separated list of languages> \
--align_dir <directory with aligned embeddings from previous step> \
--exp_id <experiment ID> \
--trans_dir <directory with ground-truth word translations> \
--emo_lex_dir <directory with ground-truth emotion lexicons> \
--nns_dir <directory to save derived word translations to> \
--reports_dir <directory to save evaluation reports and induced emotion lexicons>
- The path provided to
nns_dir
andreports_dir
is suffixed byexp_id
so that multiple runs can share the same paths and just have different experiment IDs. - The ground-truth word translation files should be named
<src_iso>_<tgt_iso>.txt
, e.g.spa_eng.txt
for Spanish-to-English translations. They should be tab-separated files, with the first column being a word in the source language and the second column being the translation of the word into the target language. - The ground-truth emotion lexicons should be named
<iso>.txt
. They can be obtained from the NRC EIL webpage.
To run only the lexicon induction and skip the evaluation:
python eval.py \
--langs <space-separated list of languages> \
--align_dir <directory with aligned embeddings from previous step> \
--exp_id <experiment ID> \
--nns_dir <directory to save derived word translations to> \
--reports_dir <directory to save evaluation reports and induced emotion lexicons> \
--skip_eval