Skip to content

Treebank embedding vector prediction for out-of-domain dependency parsing


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



48 Commits

Repository files navigation

Treebank Embedding Vector Prediction for Out-of-Domain Dependency Parsing

Thanks for interest in our code for Treebank Embedding Vectors for Out-of-Domain Dependency Parsing. The code for the first step, training all required multi-treebank parsing models, is ready. If you want to replicate the work asap you can start training these models now. This is independent of the remaining code and it easily takes a week. We aim to complete cleaning and making the code more easy to use by the end of August 2020.

For questions, please open an issue in this repository.

If you use this code please cite the paper linked above.

Dependencies (and Installation Suggestions)

The scripts in this repository currently assume the following:

  • This repository is located in ~/tbemb/tbev-prediction. We added code to support setting PRJ_DIR to an alternative location but this has not been tested. Please let us know if you used this variable to run in a different location successfully or if you encounter problems.
  • python2, python3 and python executables are in PATH and python is Python 2. If necessary, create symlinks or wrapper scripts with these names in a new folder and point to this folder at the start of PATH.
  • ELMoForManyLangs is in ~/tbemb/ELMoForManyLangs. An alternative location can be configured in tbev-prediction/config/
  • UUParser with our multi-treebank extension is in ~/tbemb/uuparser. An alternative location can be configured in tbev-prediction/config/


  • re-construct what Python environments are needed and when they need to be activated

UUParser with Our Multi-Treebank Extension

cd ~/tbemb
git clone ...
cd uuparser
git checkout tbemb

The script creates a virtualenv Python environment for running this parser. A list of Python dependencies can be found in it.

TODO: The list of Python packages seems to be bigger than needed. From memory, uuparser-tbemb only needs numpy, dynet and cython.


To replicate all development results of the paper, ELMo-derived sentence representation are needed. For the winning model, this dependency can be skipped.

Linear Tree Combiner

TODO: add instructions how to put / symlink this into the expected location

Alternatively, a dummy script that uncompresses an input .conllu.gz file to a .conllu file using the same command line as our combiner should also work as we here only explore scenarios with 1 system output.


Only needed for visualising the candidate vectors and LAS in weight space as in Figures 1 and 2 in the paper. Can be skipped if not using You can install this as a python module or simply place it in our scripts folder:

cd ~/tbemb
git clone [email protected]:stefankoegl/kdtree.git
cd tbev-prediction/scripts
ln -s ~/tbemb/kdtree/

In our ACL 2020 paper, we used version 0.15 of kdtree, which further requires the file

Prepare Treebanks

It is recommended to place the UD folder in the project folder or any other folder from which you plan to run the experiment, e.g. via a symlink:

mkdir workdir
cd workdir
ln -s $HOME/data/ud-treebanks-v2.3/

The name of the symlink must not contain whitespace as we use split() to parse intermediate outputs.

Below, we run scripts with ./<scriptname> but they can also be called from other locations such as workdir. It is recommended to create a symlink to the scripts folder or to add it to the PATH variable.

If replicating preliminary experiments with the 5 genres of the English Web Treebank:

  • Split EWT into genres:

If using a newer version than UD v2.3:

  • Add new treebank names and codes to config/tbnames.tsv

Obtaining Data Points (tbweights --> LAS)

  1. Train grammars for all treebank combinations of interest and for the number of seeds needed for the k-NN experiments. In the ACL 2020 paper, we use 9 seeds and explore all treebank combinations with 3 of the 4 usable UD v2.3 treebanks for Czech, English and French in development. For final testing, we use the combination of all usable non-PUD treebanks. The choice of seeds is not critical as an exact reproduction of parser training is not possible because parallel training on GPUs randomises the order of numeric operations. (Models for the same seed tend to make the same predictions for a few epochs but then start to diverge noticeable.)

    • Writes a .tfm task-farming file with one command per line training all multi-treebank models needed for development, i.e. training on each combination of three treebanks of the four treebanks of each development language. Add option --epochs 20 or lower if you are pressed for time (the Czech models involving cs_pdt take quite long) and are ok with less accurate models (default is to train for 30 epochs). UUParser picks the best model from all trained epochs according to development data. Typical usage:
    • Run the .tfm file generated by the above command with task-farming or job arrays (check your cluster documentation), or just with bash (you may then want to append & to each line and insert wait every n lines and at the end of the file to keep n CPU cores busy) in a suitable Python environment, see above. (If you do not have access to strong server CPUs but have a GPU you may want to change the wrapper script to set dynet to use your GPU.)
    • You can append the .tfm file for training parsing models needed for testing later (see below) now as the choice of parsing models does not change during development.
  2. Choose weight vectors to try and generate the task list for parsing:

    • Chooses the candidate treebank embedding vectors as weighted averages of the fixed vectors and writes taskfarming files for each development language. The weight space is restricted as in the ACL 2020 paper. Remove option --tab-tasks from the script's call to to obtain shell commands in the .tfm file, rather than tab-separated lists of command arguments. Note that the option --seed of, which is called in this script, was not used in the ACL 2020 experiments, making small deviations in the candidate set of treebank vectors unavoidable. See --help and for options to change the vector sampling. The option --skip-indomain-parsing may sound right for out-of-domain experiment but this option was not used in the ACL 2020 experiment as in-domain results were used as training data for the k-NN models. TODO: produce more clear log output: log all points and clearly mark rejected points
  3. Parse both training data and dev data with the selected tbemb weights: ichec-test-all.job runs workers in te-worker/. The workers call or (These scripts have an alternative line for DATASET in dev ; do that excludes parsing of training data. This is not suitable for the k-NN experiment that requires parse results for the training data.) If you want to parse with a different parser or with a different parser setting, e.g. elmo, please create a new wrapper script and update variable script in (or manually update the task file(s)).

On grove, we can use xmlrpc-based taskfarming: grove-worker-parsing-for-data-points-t12.job (there is a comment how to start the master)

  1. Collect LAS summary table: This can be done in parallel for each data set, see ichec-collect-results.job.

  2. Create graphs to verify that results make sense:,

ln -s $HOME/tbemb/data-points/ data-points

Prepare Sentence Representations for Similiarity Measure

../ ../

ln -s $HOME/tbemb/sent-rep/length-and-punct/ length-and-punct

Predict Weights and Generate Parsing Tasks (on Dev) grove-gen-tbemb-sampling-1-of-6.job grove-gen-tbemb-sampling-2.job grove-gen-tbemb-sampling-3.job grove-gen-tbemb-sampling-5.job grove-gen-tbemb-sampling-4.job grove-gen-tbemb-sampling-6.job

Checking progress:

for I in tb*samp*txt ; do
    echo ; echo ; echo == $I ==
    fgrep "== Scenario" -A 7 $I | tail -n 9
    fgrep "Duration of gen_tbemb_sampling for scenario" $I | tail -n 1

For PUD and other final testing, use --test-type test with and add the test treebanks to the collection. Treebanks ending in _pud are automatically excluded from model tbid triplets and from k-NN learning.


Compile task farming file. (Ok to make preliminary runs while is still running but must be repeated with new task farming file when finished. Example shows how to filter the task farming file by the tasks of a previous run.)

find te-parse/ -type f | ./ all-te-parse-31a-part-0923.tfm | sort > all-te-parse-31a-part-2359.tfm
cat all-te-parse-31a-part-31a-3001.tfm | xargs -d'\n' chmod 755

Task farming master: --port 8743 --secret-from-file secret.txt --show-task all-te-parse-31a-part-2359.tfm

Task farming workers: grove-parse-sampling-uuparser-worker-t12.job and variants (not all up to date): grove-parse-sampling-worker-t12.job grove-parse-sampling-single.job grove-parse-sampling-worker-n0128d.job grove-parse-sampling-worker-t10.job grove-parse-sampling-worker-gpu.job grove-parse-sampling-worker-t12.job

The component parsers are run with the parse_*.sh scripts.

Combine and Evaluate

Compile task farming file. (Ok to make preliminary runs while parsing is still running or unfinished but must be repeated with new task farming file when finished. The script automatically detects finished combine and eval tasks of previous runs.)

source ~/tbemb/dynet-cpu-py27/bin/activate
find tbemb-sampling/ | \
    ./  \
        --treebank-dir /data/ud-treebanks-v2.3 \
        dev  \
        > combine-31a-001.tfm

Replace dev with test when testing on test data, e.g. PUD.

For the second run, if still also parsing, it's a good idea to shuffle the tasks to spread the I/O load more evenly between combining and evaluating:

find tbemb-sampling/ | \
    ./ \
        --treebank-dir /data/ud-treebanks-v2.3  \
        dev  | \
    shuf > combine-31a-002.tfm

Task farming master: --port 8544 \
    --secret-from-file secret.txt --show-task \

Task farming workers: grove-combine-and-eval-worker-t12.job

Repeat until task file is empty (at step 003 if not overlapping with parsing).

Summary of Results

find tbemb-sampling/ | ./ > results.tsv


LaTeX table:

find tbemb-sampling/ | \
    ./ --median  \
        > selected-median-models-with-dev-results.tsv
find tbemb-sampling/ | \
    ./  \
        selected-median-models-with-dev-results.tsv  \
        > dev-results-by-model.tsv
./ --dev dev-results-by-model.tsv

Test results:

./ > pud-best-of-7-seeds.tsv
find tbemb-sampling/ | ./ \
    --model-seeds-from pud-best-of-7-seeds.tsv       \
    > pud-results-with-best-dev-seed.tsv

Model Selection and Testing

rm tbemb-sampling
ln -s /data/results/indom-dev/ \
find tbemb-sampling/ | \
    ./  \
        > selected-best-models-with-dev-results.tsv
rm tbemb-sampling
ln -s /data/results/indom-test/ \
find tbemb-sampling/ | \
    ./  \
        selected-best-models-with-dev-results.tsv  \
        > test-results-by-model.tsv

LaTeX Table

./ test-results-by-model.tsv

Testing on Parallel UD Treebanks (PUD)

  1. Train multi-treebanks models:
    • Writes a .tfm task-farming file with one command per line training all multi-treebank models needed for PUD testing, i.e. training on the combination of all treebanks with training data for each test language. Supports the same options as Expects as input the output of Typical usage: ud-treebanks-v2.3 > pud-situation.txt --treebank-folder ud-treebanks-v2.3 <pud-situation.txt >pud-training.tsv
    • See development steps for running the .tfm file. If running in parallel with fewer workers than tasks consider moving Czech and Russian tasks to the top of the file to avoid long idle times of some of the workers near the end of the job.

    • calls with options needed for PUD test set experiments, e.g. adjusting the number of samples to the dimensionality of the treebank vector weight space and switching off the box clipping used in development. For the highest dimensionality 4, each language takes about half an hour.


This research was funded by the ADAPT Centre for Digital Content Technology under the SFI Research Centres Programme (Grant 13/RC/2106) and the European Regional Development Fund.


Treebank embedding vector prediction for out-of-domain dependency parsing







No releases published


No packages published