Thanks for interest in our code for Treebank Embedding Vectors for Out-of-Domain Dependency Parsing. The code for the first step, training all required multi-treebank parsing models, is ready. If you want to replicate the work asap you can start training these models now. This is independent of the remaining code and it easily takes a week. We aim to complete cleaning and making the code more easy to use by the end of August 2020.
For questions, please open an issue in this repository.
If you use this code please cite the paper linked above.
The scripts in this repository currently assume the following:
- This repository is located in
~/tbemb/tbev-prediction
. We added code to support settingPRJ_DIR
to an alternative location but this has not been tested. Please let us know if you used this variable to run in a different location successfully or if you encounter problems. python2
,python3
andpython
executables are inPATH
andpython
is Python 2. If necessary, create symlinks or wrapper scripts with these names in a new folder and point to this folder at the start ofPATH
.- ELMoForManyLangs is in
~/tbemb/ELMoForManyLangs
. An alternative location can be configured intbev-prediction/config/locations.sh
. - UUParser with our multi-treebank extension is in
~/tbemb/uuparser
. An alternative location can be configured intbev-prediction/config/locations.sh
.
TODO:
- re-construct what Python environments are needed and when they need to be activated
https://github.com/jowagner/uuparser/tree/tbemb
cd ~/tbemb
git clone ...
cd uuparser
git checkout tbemb
The script uuparser-tbemb-create-virtualenv.sh
creates a virtualenv
Python environment for
running this parser. A list of Python dependencies can be found in it.
TODO: The list of Python packages
seems to be bigger than needed. From memory, uuparser-tbemb only needs
numpy
, dynet
and cython
.
To replicate all development results of the paper, ELMo-derived sentence representation are needed. For the winning model, this dependency can be skipped.
https://github.com/jowagner/ud-combination
TODO: add instructions how to put / symlink this into the expected location
Alternatively, a dummy script that uncompresses an input .conllu.gz
file to
a .conllu
file using the same command line as our combiner should also work
as we here only explore scenarios with 1 system output.
Only needed for visualising the candidate vectors and LAS in weight space
as in Figures 1 and 2 in the paper.
Can be skipped if not using render-graph.py
.
https://github.com/stefankoegl/kdtree You can install this as a python module or simply place it in our scripts folder:
cd ~/tbemb
git clone [email protected]:stefankoegl/kdtree.git
cd tbev-prediction/scripts
ln -s ~/tbemb/kdtree/kdtree.py
In our ACL 2020 paper, we used version 0.15 of kdtree, which further
requires the file bounded_priority_queue.py
.
It is recommended to place the UD folder in the project folder or any other folder from which you plan to run the experiment, e.g. via a symlink:
mkdir workdir
cd workdir
ln -s $HOME/data/ud-treebanks-v2.3/
The name of the symlink must not contain whitespace as we use split() to parse intermediate outputs.
Below, we run scripts with ./<scriptname>
but they can also be called
from other locations such as workdir
.
It is recommended to create a symlink to the scripts
folder or to
add it to the PATH
variable.
If replicating preliminary experiments with the 5 genres of the English Web Treebank:
- Split EWT into genres:
split-en-ewt.py
If using a newer version than UD v2.3:
- Add new treebank names and codes to
config/tbnames.tsv
-
Train grammars for all treebank combinations of interest and for the number of seeds needed for the k-NN experiments. In the ACL 2020 paper, we use 9 seeds and explore all treebank combinations with 3 of the 4 usable UD v2.3 treebanks for Czech, English and French in development. For final testing, we use the combination of all usable non-PUD treebanks. The choice of seeds is not critical as an exact reproduction of parser training is not possible because parallel training on GPUs randomises the order of numeric operations. (Models for the same seed tend to make the same predictions for a few epochs but then start to diverge noticeable.)
gen_train_multi-subset-3.py
: Writes a.tfm
task-farming file with one command per line training all multi-treebank models needed for development, i.e. training on each combination of three treebanks of the four treebanks of each development language. Add option--epochs 20
or lower if you are pressed for time (the Czech models involvingcs_pdt
take quite long) and are ok with less accurate models (default is to train for 30 epochs). UUParser picks the best model from all trained epochs according to development data. Typical usage:gen_train_multi-subset-3.py
- Run the
.tfm
file generated by the above command with task-farming or job arrays (check your cluster documentation), or just withbash
(you may then want to append&
to each line and insertwait
everyn
lines and at the end of the file to keepn
CPU cores busy) in a suitable Python environment, see above. (If you do not have access to strong server CPUs but have a GPU you may want to change the wrapper script to set dynet to use your GPU.) - You can append the
.tfm
file for training parsing models needed for testing later (see below) now as the choice of parsing models does not change during development.
-
Choose weight vectors to try and generate the task list for parsing:
run_gen_tasks_for_dev_data_points.sh
: Chooses the candidate treebank embedding vectors as weighted averages of the fixed vectors and writes taskfarming files for each development language. The weight space is restricted as in the ACL 2020 paper. Remove option--tab-tasks
from the script's call togen_tasks.py
to obtain shell commands in the.tfm
file, rather than tab-separated lists of command arguments. Note that the option--seed
ofgen_tasks.py
, which is called in this script, was not used in the ACL 2020 experiments, making small deviations in the candidate set of treebank vectors unavoidable. Seegen_tasks.py --help
andexplore-candidate-point-decay.sh
for options to change the vector sampling. The option--skip-indomain-parsing
may sound right for out-of-domain experiment but this option was not used in the ACL 2020 experiment as in-domain results were used as training data for the k-NN models. TODO: produce more clear log output: log all points and clearly mark rejected points
-
Parse both training data and dev data with the selected tbemb weights:
ichec-test-all.job
runs workers in te-worker/. The workers calltest-m-bist-multi-en_ewt-subsets-3.sh
ortest-uuparser-multi-en_ewt-subsets-3.sh
. (These scripts have an alternative linefor DATASET in dev ; do
that excludes parsing of training data. This is not suitable for the k-NN experiment that requires parse results for the training data.) If you want to parse with a different parser or with a different parser setting, e.g. elmo, please create a new wrapper script and update variablescript
ingen_tasks.py
(or manually update the task file(s)).
On grove, we can use xmlrpc-based taskfarming:
grove-worker-parsing-for-data-points-t12.job
(there is a comment how to start the master)
-
Collect LAS summary table:
collect_ewt_results.py
. This can be done in parallel for each data set, seeichec-collect-results.job
. -
Create graphs to verify that results make sense:
render-graph.py
,update-graphs.sh
ln -s $HOME/tbemb/data-points/ data-points
../concat-treebanks.sh
../run_concat.sh
get_elmo_sentence_features.py
get_similarity_matrix.py
run_get_elmo_sentence_features.sh
get_tfidf_sentence_features.py
run_get_tfidf_sentence_features.sh
run_get_length_and_punct_features.py
ln -s $HOME/tbemb/sent-rep/length-and-punct/ length-and-punct
gen_tbemb_sampling.py
test-tbemb-sampling-template.sh
preproc-parse-sampling-task.py
preproc-parse-sampling-uuparser-task.py
grove-gen-tbemb-sampling-1-of-6.job
grove-gen-tbemb-sampling-2.job
grove-gen-tbemb-sampling-3.job
grove-gen-tbemb-sampling-5.job
grove-gen-tbemb-sampling-4.job
grove-gen-tbemb-sampling-6.job
Checking progress:
for I in tb*samp*txt ; do
echo ; echo ; echo == $I ==
fgrep "== Scenario" -A 7 $I | tail -n 9
fgrep "Duration of gen_tbemb_sampling for scenario" $I | tail -n 1
done
For PUD and other final testing, use --test-type test
with gen_tbemb_sampling.py
and add the test treebanks to the collection. Treebanks ending in _pud
are
automatically excluded from model tbid triplets and from k-NN learning.
Compile task farming file. (Ok to make preliminary runs while
gen_tbemb_sampling.py
is still running
but must be repeated with new task farming file when finished.
Example shows how to filter the task farming file by the tasks
of a previous run.)
find te-parse/ -type f | ./filter_tfm.py all-te-parse-31a-part-0923.tfm | sort > all-te-parse-31a-part-2359.tfm
cat all-te-parse-31a-part-31a-3001.tfm | xargs -d'\n' chmod 755
Task farming master:
xmlrpc_master.py --port 8743 --secret-from-file secret.txt --show-task all-te-parse-31a-part-2359.tfm
Task farming workers:
grove-parse-sampling-uuparser-worker-t12.job
and variants (not all up to date):
grove-parse-sampling-worker-t12.job
grove-parse-sampling-single.job
grove-parse-sampling-worker-n0128d.job
grove-parse-sampling-worker-t10.job
grove-parse-sampling-worker-gpu.job
grove-parse-sampling-worker-t12.job
grove-parse-sampling-manual-t6.sh
The component parsers are run with the parse_*.sh
scripts.
gen_tbemb_combine.py
:
Compile task farming file. (Ok to make preliminary runs while parsing is still running or unfinished but must be repeated with new task farming file when finished. The script automatically detects finished combine and eval tasks of previous runs.)
source ~/tbemb/dynet-cpu-py27/bin/activate
find tbemb-sampling/ | \
./gen_tbemb_combine.py \
--treebank-dir /data/ud-treebanks-v2.3 \
dev \
> combine-31a-001.tfm
Replace dev
with test
when testing on test data, e.g. PUD.
For the second run, if still also parsing, it's a good idea to shuffle the tasks to spread the I/O load more evenly between combining and evaluating:
find tbemb-sampling/ | \
./gen_tbemb_combine.py \
--treebank-dir /data/ud-treebanks-v2.3 \
dev | \
shuf > combine-31a-002.tfm
Task farming master:
xmlrpc_master.py --port 8544 \
--secret-from-file secret.txt --show-task \
combine-31a-001.tfm
Task farming workers:
grove-combine-and-eval-worker-t12.job
Repeat until task file is empty (at step 003 if not overlapping with parsing).
find tbemb-sampling/ | ./collect_sampling_results.py > results.tsv
Condense:
analyse-1-3-6-for-10-sets.sh
run_analyse.sh
LaTeX table:
find tbemb-sampling/ | \
./select_model.py --median \
> selected-median-models-with-dev-results.tsv
find tbemb-sampling/ | \
./filter_test_results.py \
selected-median-models-with-dev-results.tsv \
> dev-results-by-model.tsv
./latex_test_table.py --dev dev-results-by-model.tsv
Test results:
./run_select-seed.sh > pud-best-of-7-seeds.tsv
find tbemb-sampling/ | ./collect_sampling_results.py \
--model-seeds-from pud-best-of-7-seeds.tsv \
> pud-results-with-best-dev-seed.tsv
rm tbemb-sampling
ln -s /data/results/indom-dev/ \
tbemb-sampling
find tbemb-sampling/ | \
./select_model.py \
> selected-best-models-with-dev-results.tsv
rm tbemb-sampling
ln -s /data/results/indom-test/ \
tbemb-sampling
find tbemb-sampling/ | \
./filter_test_results.py \
selected-best-models-with-dev-results.tsv \
> test-results-by-model.tsv
./latex_test_table.py test-results-by-model.tsv
- Train multi-treebanks models:
-
gen_pud_training.py
: Writes a.tfm
task-farming file with one command per line training all multi-treebank models needed for PUD testing, i.e. training on the combination of all treebanks with training data for each test language. Supports the same options asgen_train_multi-subset-3.py
. Expects as input the output ofassess-pud-situation.sh
. Typical usage:assess-pud-situation.sh ud-treebanks-v2.3 > pud-situation.txt gen_pud_training.py --treebank-folder ud-treebanks-v2.3 <pud-situation.txt >pud-training.tsv
-
See development steps for running the
.tfm
file. If running in parallel with fewer workers than tasks consider moving Czech and Russian tasks to the top of the file to avoid long idle times of some of the workers near the end of the job. -
run_gen_tasks_for_pud_data_points.sh
: callsgen_tasks.py
with options needed for PUD test set experiments, e.g. adjusting the number of samples to the dimensionality of the treebank vector weight space and switching off the box clipping used in development. For the highest dimensionality 4, each language takes about half an hour.
-
This research was funded by the ADAPT Centre for Digital Content Technology under the SFI Research Centres Programme (Grant 13/RC/2106) and the European Regional Development Fund.