This repository contains the code and sample data for the experiments related to our NAACL 2019 paper "Bridging the Gap: Attending to Discontinuity in Identification of Multiword Expressions".
pip install -r ./requirements.txt
To check if you have the proper dependencies and see a sample output, simply run main.py
. The results (including the predicted labels and the evaluation performance) will be saved in the directory results/{model settings}.
Please follow the steps below to run your own model:
The models are trained using pre-trained ELMo embeddings. To obtain representations for your inputs, follow the instructions at ELMo For Many Langs. Save the output h5py file (as ELMo_EN, ELMo_FR, etc) and place the files in the ./embeddings
folder.
In the interest of readability, we provide an overview of the structure of our project:
.
├── README.md
├── corpus.py
├── corpus_reader.py
├── data
│ ├── EN_SAMPLE
│ │ ├── dev.cupt
│ │ ├── test.cupt
│ │ └── train.cupt
│ └── bin
│ ├── evaluate_v1.py
│ └── tsvlib.py
├── embeddings
│ └── ELMo_EN_SAMPLE
├── evaluation.py
├── main.py
├── models
│ ├── layers.py
│ └── tag_models.py
├── preprocessing.py
├── requirements.txt
├── results
└── train_test.py
corpus.py
, andcorpus_reader.py
: read the data in the conll format and contain methods that is used by the preprocessor.embeddings
: pre-trained embedding files with the format ELMO_{EN|FR|FA|DE}. It contains a sample file (ELMo_EN_SAMPLE) for the trial run.evaluation.py
: contains the script for evaluation.preprocessing.py
: prepares data in the proper format and loads ELMo embeddings.layers.py
: self-attention, GCN, and highway layers are defined here.tag_models.py
: our models are all defined here. This part depends onlayers.py
.train_test.py
: contains the functions for training and testing the models.main.py
: the main part connesting all the other scripts. You can specify the main variables here (language, epochs, model, ...).data
: contains sample data (inEN_SAMPLE
) including 5, 2 and 4 sentences in train, dev and test respectively for a trial run. To obtain the data used in the experiments, download train, test, and dev files for each language from Parseme's gitlab page.bin
: contains the evaluation code of the PARSEME shared task on identifying VMWEs.requirements.txt
: contains the names and versions of the required dependencies.results
: results appear here aftermain.py
is run.
If you only care about the model architecture, have a look at ./models/tag_models
. Self-attention and GCN are defined in the separate file ./models/layers.py
.
@inproceedings{Rohanian2019,
author = {Omid Rohanian and
Shiva Taslimipoor and
Samaneh Kouchaki and
Le An Ha and
Ruslan Mitkov},
title = {Bridging the Gap: Attending to Discontinuity in Identification of
Multiword Expressions},
year = {2019},
booktitle = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
url = {https://www.aclweb.org/anthology/N19-1275/},
pages = "2692--2698",
abstract = "We introduce a new method to tag Multiword Expressions (MWEs) using a linguistically interpretable language-independent deep learning architecture. We specifically target discontinuity, an under-explored aspect that poses a significant challenge to computational treatment of MWEs. Two neural architectures are explored: Graph Convolutional Network (GCN) and multi-head self-attention. GCN leverages dependency parse information, and self-attention attends to long-range relations. We finally propose a combined model that integrates complementary information from both, through a gating mechanism. The experiments on a standard multilingual dataset for verbal MWEs show that our model outperforms the baselines not only in the case of discontinuous MWEs but also in overall F-score."
}