Data and scripts for reproducing the results from the ACL 2016 paper. The vectors used for the experiments can be found here.
Simply run
python qvec_cca2.py --in_vectors ~/mydir/en1.vectors
Run the following over a pair of models for which you want to compute whether the difference is significant,
python stat_signf.py ~/mydir/en1.vectors ~/mydir/vulic_vectors/en2.vectors data/word-sim/EN-SIMLEX-999.txt
This will output something like,
Vectors read from: ~/mydir/en1.vectors
Vectors read from: ~/mydir/en2.vectors
==================================================
Num Pairs Not found Rho
==================================================
999 90 0.1234
999 41 0.5678
999 129 0.8429
(1.376206182800332, 0.014533569095982752)
Where the p-value is 0.014.
The dictionaries used for evaluation are provided in en.*.dict
.
The format is,
en1 en2 <tab> fr1 fr2 fr3
Where en1
and en2
are entries on the english side which all share fr1, fr2, fr3
as possible translations. For eg.
dangerous hazardous unsafe risky dangereux dangereuse
If you compute the english side tokens of the dictionary you should get 1510 (fr), 1425 (de), 1610(zh) and 1024(sv).
The evaluation code is provided under my-evaluation
. First setup the dependencies using mvn compile
and mvn dependency:copy-dependencies
.
Then run,
sh run.sh evaluateBiDict de ~/mydir/bicvm_vectors/bicvm.en-de.en.200 ~/mydir/bicvm_vectors/bicvm.en-de.de.200
This should print the top-10 accuracy and MRR.
Code to extract dictionaries from word alignments for performing CCA is also provided in my-evaluation. Try running,
run.sh WriteDictForCCA parallelFile(uniq.en-es) alignFile(tr.*.intersect) outfile(*.dict) minCount(0-5) limit(-1)
Download the code for the Klementiev et al. paper from here.
You will also need to procure the RCV2 Multilingual Corpus.
For en --> L
(train on english and test on language L)
The english train split is same as one provided by Klementiev et al. Similarly for de's test split. The test splits for fr,sv,zh are in file test-en-l2.txt
(There should be a distribution of 10 C, 300 E, 600 G, 900 M).
For L --> en
Train files for fr,sv,zh are in file train-l2-en.txt
. For de, its same as before. The test files (en) are also same as before.
You will need to get the universal dependencies treebank v1.2 from here.
You will also need to install the parser released here.
An example training script is provided under the file example.sh
. It trains on english and german treebanks using the embeddings trained by a particular model. It uses the config file provided under config/config.cfg
. Note that you may need to change the embedding dimensions embedding_size
for your embeddings.
###Reference
@inproceedings{bicompare:16,
author = {Upadhyay, Shyam and Faruqui, Manaal and Dyer, Chris and Roth, Dan},
title = {Cross-lingual Models of Word Embeddings: An Empirical Comparison},
booktitle = {Proc. of ACL},
year = {2016},
url = {http://arxiv.org/abs/1604.00425}
}