Authors: Erwan Moreau, Ashjan Alsulaimani and Alfredo Maldonado
- Shared task website
- Shared Task data (gitlab)
- Our paper will be published in August, link coming soon (if I don't forget!).
This repository contains two distinct systems for detecting verbal multiword expressions from text. This short description assumes that the reader is familiar with the task; if not, please see link above.
This system attempts to exploit the dependency tree structure of the sentences in order to identify MWEs. This is achieved by training a tree-structured CRF model which takes into account conditional dependencies between the nodes of the tree (node-parent and possibly node-next sibling). The system is also trained to predict MWE categories. The tree-structured CRF software used is XCRF.
A robust sequential method which can work with only lemmas and morphosyntactic tags. It uses the Wapiti CRF sequence labeling software.
- libxml2 must be installed to compile the dep-tree system, including the source libraries (header files)
- on Ubuntu the most convenient way is to install the package
libxml2-dev
:sudo apt install libxml2-dev
.
- on Ubuntu the most convenient way is to install the package
- CRF++ must be installed and accessible via
PATH
- Wapiti must be installed and accessible via
PATH
- The shared task data can be downloaded or cloned from https://gitlab.com/parseme/sharedtask-data
- XCRF is also required but provided in this repository
From the main directory run:
source setup-path.sh
This will compile the code if needed and add the relevant directories to PATH
. You can add this to your .bashrc
file in order to have the PATH
set up whenever you open a new session.
From the directory dep-tree
:
train-test-class-method.sh -l sharedtask-data/1.1/FR/train.cupt -a sharedtask-data/1.1/FR/dev.cupt conf/minimal.conf model output
sharedtask-data
is the directory containing the official shared task data, as its name suggests (see link above); replace with the appropriate path.-l
"learn" option: indicates to perform training from the specified file-a
"apply" option: indicates to perform testing on the specified file- using configuration file
conf/minimal.conf
(see Configuration files in section Details below) model
will contain the model at the end of the processoutput
is the "work directory"; at the end of the testing process it contains:- The predictions stored in
<work dir>/predictions.cupt
- Evaluation results are stored in
<work dir>/eval.out
if-e
is used (see below).
- The predictions stored in
- if option
-a
is supplied,-e <training file>
can be used to perfom evaluation. The training file is required in order for the script (provided by the organizers) to count the cases seen in the training data. - To run the script from a different directory, one has to provide the path to the XCRF directory in the following way:
train-test-class-method.sh -o '-x dep-tree/xcrf-1.0/' -l sharedtask-data/1.1/FR/train.cupt dep-tree/conf/minimal.conf model-dir output-dir
CAUTION: RAM ISSUES. XCRF requires a lot of memory. Depending on the amount of training data, the number of features and the "clique level" option, it might crash even with as much as 64GB. Memory options can be passed to the Java VM (XCRF is implemented in Java) through option -o
:
train-test-class-method.sh -o "-j '-Xms32g -Xmx32g' -x /path/to/xcrf-1.0/" ...
Scripts are provided to allow batch processing. In order to train and test the system each time with a distinct config file and dataset, the script process-multiple-datasets.sh
can be used to generate the commands to run. This way the tasks can be started in parallel or any way convenient, ideally on a cluster.
# generate a few config files
mkdir configs; echo dep-tree/conf/basic.multi-conf | expand-multi-config.pl configs/
# generate the command to train and test for each dataset and each config file
process-multiple-datasets.sh sharedtask-data/1.1/ configs results >tasks
# split to run 10 processes in parallel
split -d -l 6 tasks batch.
# run
for f in batch.*; do (bash $f &); done
From the directory seq
:
seq-train-test.sh -l sharedtask-data/1.1/FR/train.cupt -a sharedtask-data/1.1/FR/test.cupt -e sharedtask-data/1.1/FR/train.cupt conf/example.conf model output
sharedtask-data
is the directory containing the official shared task data, as its name suggests (see link above); replace with the appropriate path.-l
"learn" option: indicates to perform training from the specified file-a
"apply" option: indicates to perform testing on the specified file- using configuration file
conf/example.conf
(see Configuration files in section Details below) model
will contain the model at the end of the processoutput
is the "work directory"; at the end of the testing process it contains:- The predictions stored in
<work dir>/predictions.cupt
- Evaluation results are stored in
<work dir>/eval.out
if-e
is used (see below).
- The predictions stored in
- if option
-a
is supplied,-e <training file>
can be used to perfom evaluation. The training file is required in order for the script (provided by the organizers) to count the cases seen in the training data.
Generating multiple configuration files:
crf-generate-multi-config.pl 3-4:8:5:1:1:C >seq.multi-conf
echo "labels=IO BIO BILOU" >>seq.multi-conf
mkdir configs; echo seq.multi-conf | expand-multi-config.pl configs/
- Columns 3 and 4 represent the lemma and POS tag, respectively.
- Alternatively, each column (feature) can be given separately, e.g.:
crf-generate-multi-config.pl 3:6:2:1:1:C 4:8:5:1:1:C >seq.multi-conf
- This allows the combination of different patterns for each column so it might improve the model, but at the cost of multiplying the number of combinations (hence increasing the computation time).
Generating the commands and executing the tasks in parallel:
seq-multi.sh -e -t test.cupt sharedtask-data/1.1/ configs/ expe >tasks
# split to run processes in parallel
split -d -l 700 tasks batch.
# run
for f in batch.*; do (bash $f &); done
The scripts are meant to be used with configuration files which contain values for the parameters. Examples can be found in the directory conf
. Additionally, a batch of configuration files can be generated using e.g.:
# generates a set of config files (written to directory 'configs')
mkdir configs; echo dep-tree/conf/large.multi-conf | expand-multi-config.pl configs/
In order to generate a different set of configurations, either customize the values that a parameter can take in conf/options.multi-conf
or use the -r
option to generate a random subset of config files, e.g.:
# generate a random 50 config files
mkdir configs; echo dep-tree/conf/large.multi-conf | expand-multi-config.pl -r 50 configs
The two approaches work with a sequential labelling scheme, as opposed to the numbering of the expressions by sentence used in the cupt
format provided in the shared task. Scripts are provided to convert between the two formats.
cupt-to-bio-labels IO sharedtask-data/1.1/FR/train.cupt fr-train.io
- Note that the conversion entails a simplification, i.e. loss of information: in the cases of overlapped or nested expressions, the program discards one of the expressions (the shortest).
- By default it adds the tokens corresponding to the discarded expression as if they belonged to the preserved expression. Alternatively, if
-k
is supplied, the tokens of the shortest expressions are not added to the other.
- By default it adds the tokens corresponding to the discarded expression as if they belonged to the preserved expression. Alternatively, if
- The labelling scheme must be one of:
IO
: only mark tokens as belonging to an expression or notBIO
: special markB
for the starts of a token; this allows the detection of multiple expressions by sentenceBILOU
: more sophisticated labelling scheme withL
for last token andU
for unit expressions.
- Categories:
- "joint": by default the program keeps the categories of the expressions as a suffix, thus generating a number of distinct labels up to three times the number of categories (if using BIO).
- "indep": option
-c <category>
makes the program focus on a single category of expressions and ignore the others. This allows the training of independent models, one by category. - "none": option
-i
makes the program ignore cateogries and process all the expressions as if they all belong to the same category.
bio-to-cupt-labels fr-train.io fr-train.cupt
- See also
merge-independent-labels
in order to merge categories back together after predicting them independently.
Scripts are provided to collect results from a large experiment with multiple datasets and configurations, and store the results is tab-separated file with columns representing the parameters in the config file. The resulting file can then be analyzed more easily.
echo -e "configs\tresults" | collect-results-multiple-experiments.sh results.tsv
configs
is the directory containing the set of config files used in the experiment (see above).results
is the output directory, which must contain a fileresults/<dataset>/<config>/eval.out
generated at the end of the training+testing process for each case.
Example: extracting best performance by language (based on results collected from sequential approach experiments)
for f in sharedtask-data/1.1/??; do l=$(basename $f); cat results.tsv | grep $l | grep Tok | sort -k 14,14n | tail -n 1 | cut -f 2,3,4,8,913,14; done
Copyright (C) 2018 Trinity College Dublin, Adapt Centre, Erwan Moreau, Ashjan Alsulaimani and Alfredo Maldonado
adapt-vmwe18 is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.