Dependencies

See requirements.txt. Some packages such as blocks and fuel should be installed with pip using the github link to the projects:

pip install git+git://github.com/mila-udem/blocks.git@stable -r https://raw.githubusercontent.com/mila-udem/blocks/stable/requirements.txt
pip install git+git://github.com/mila-udem/fuel.git@stable

This code is heavily based on the dict_based_learning repo.

We directly include the files of several softwares that are slightly modified:

Word Embeddings Benchmark which we have prepackaged into the archive because it is a modified version which includes more datasets and also reads specific model files.
Retrofitting which corrects a minor bug and adds more options.

We also include a the wordnet dictionary (definitions only) in data/dict_wn.json and the license that goes with it in data/wordnet_LICENSE.

Prepare the data

Run ./build_split_dict.sh to build the split dictionary.
Run ./build_full_dict.sh to build the full dictionary.

Pretrained embeddings

In order to use pretrained embeddings, you need .npy archives that will be loaded as input embeddings into the model and frozen (not trained). Additionally, you will need a custom vocabulary. For that purpose, you can modify and use two different scripts build_pretrained_archive.sh and build_pretrained_w2v_defs.sh. The first one include words that have definitions but that do not appear in definitions, while the second one does not.

Once you have the custom vocabulary, you can create configurations for the new models into dictlearn/s2s_configs.py. We give the configurations for the full dump experiment, the (very similar) dictionary data with word2vec pretrained archive and the full dictionary experiment without any pretraining.

Train

See run.sh and the corresponding configuration names in dictlearn/s2s_configs.py for how to run one specific experiment.

Generate and evaluate embeddings

Once your model is trained, you can use it to generate embeddings for all the words which have a definition. Use evaluate_embeddings.sh to generate and evalute embeddings. It is not fully automatic (requires the right .tar archive that contains the trained model), so please read it to make sure that the filenames are coherent with the number of epochs that you have, etc. The script generates the scores on dev and test sets. You can use the notebook in notebooks/eval_embs.ipynb which shows how to do model selection.

There is a distinct script to evaluate the one-shot learning abilities of model: see analyze_one_shot.sh.

Comparing against the baselines

Hill's model is recovered (with shared embeddings between the encoder and decoder and a L2 distance instead of cosine) when c['proximity_coef'] = 0 for the configuration c. So you can use the same code as for AE and CPAE to run that model.
To do retrofitting, you can look into preparation_retrofitting/README.md.
To use dict2vec, please look at preparation_dict2vec/README.md.

Misc

In order to export definitions that are word2vec readable (using the naive concatenation scheme described in the paper), you can use bin/export_definitions.py. If you are looking for something that's not described, please look at the scripts in bin/, there might be something undocumented that can help you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dependencies

Prepare the data

Pretrained embeddings

Train

Generate and evaluate embeddings

Comparing against the baselines

Misc

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
data		data
dictlearn		dictlearn
notebooks		notebooks
preparation_dict2vec		preparation_dict2vec
preparation_retrofitting		preparation_retrofitting
word-embeddings-benchmarks		word-embeddings-benchmarks
LICENSE		LICENSE
README.md		README.md
analyze_one_shot.sh		analyze_one_shot.sh
build_full_dict.sh		build_full_dict.sh
build_pretrained_archive.sh		build_pretrained_archive.sh
build_pretrained_w2v_defs.sh		build_pretrained_w2v_defs.sh
build_split_dict.sh		build_split_dict.sh
evaluate_embeddings.sh		evaluate_embeddings.sh
requirements.txt		requirements.txt
run.sh		run.sh

License

tombosc/cpae

Folders and files

Latest commit

History

Repository files navigation

Dependencies

Prepare the data

Pretrained embeddings

Train

Generate and evaluate embeddings

Comparing against the baselines

Misc

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages