This repository contains code which can be used to replicate or extend experiments described in the paper "On the Stability of System Rankings at WMT" by Rebecca Knowles, published at the Sixth Conference on Machine Translation (WMT21).
This code relies on SciPy (https://www.scipy.org) and NumPy (https://numpy.org/).
It has been tested with Python versions 3.6.9 and 3.5.6, with SciPy versions 1.6.2, 1.5.2, and 1.1, and with NumPy versions 1.21.2 and 1.15.2.
Running with SciPy version 1.7.1 produces different significance clusters than those reported in the Findings papers and this paper.
To replicate the tables from the paper, do the following:
First run scripts/get_data.sh
to download and extract data (to the data/
directory).
Next run scripts/run_all_rankings.sh
to generate all rankings required to replicate tables in the paper (see rankings/
for output; see scripts/run_ranking.sh
to understand filenames). Note that you may wish to edit scripts/run_all_rankings.sh
to run scripts/run_ranking.sh
jobs in parallel and/or submit them to a compute cluster; if you run it as-is, it will take quite some time to generate all rankings.
Finally, run scripts/run_compare_rankings.sh
to generate values from Tables 2 and 3 and Figure 1 (these should match scripts/reference_tables.txt
).
scripts/get_ranking.py
produces an output file containing a WMT-style system ranking for a single language pair (in one year, with data collected through one interface). It provides options for removing arbitrary sets of systems from all computations or merely from the computation of significance clusters and the final ranking. It also provides the option to degrade the scores of human/reference translations in the data.
scripts/compare_rankings.py
produces comparisons between a given pair of ranking variations (each generated by scripts/get_ranking.py
and, if generated by the bash scripts provided, containing the same set of systems). In the example code, it uses the file scripts/pairs.txt
to do this over all language pairs used in the paper.
This code uses human annotation data released by WMT organizers to replicate and experiment with modifications to the system ranking process. It relies on the existing processed files for removal of annotators who did not pass quality assurance; it does not compute those values itself. It averages duplicates, computes z-scores, averages raw scores, averages z-scores, computes rankings and significance clusters, and outputs rankings. In most cases, it exactly replicates the system rankings as described in the WMT News Task Findings papers (2018-2020). Appendix A of the paper provides more detail on the instances where that is not the case.
Code is also provided to compare two sets of rankings, which expects rankings in the format produced by scripts/get_ranking.py
.
Note that if you intend to use this code beyond the bash wrapper provided, you should take care to be sure whether you are comparing two rankings with different sets of systems (that should never occur with the scripts provided to replicate the paper's rankings).
The -v/--verbose
flag in scripts/compare_rankings.py
does output that information.
This code was written following the description in the Findings papers regarding how rankings are produced. In writing it, we also referenced code from the following repositories related to WMT ranking production:
https://github.com/ygraham/da-wmt16
https://github.com/ygraham/direct-assessment
https://github.com/ygraham/crowd-alone
Our code represents both an incomplete reimplementation (we do not perform quality assurance or work directly with the raw data) and an extension (we provide ways of modifying the rankings to test hypotheses about task composition) of these and the official WMT rankings.
Multilingual Text Processing / Traitement multilingue de textes
Digital Technologies Research Centre / Centre de recherche en technologies numériques
National Research Council Canada / Conseil national de recherches Canada
Copyright 2021, Sa Majesté la Reine du Chef du Canada / Her Majesty in Right of Canada
Published under the GPL v3.0 License (see LICENSE).
If you use this code, you may wish to cite:
@inproceedings{knowles-2021-stability,
title = "On the Stability of System Rankings at {WMT}",
author = "Knowles, Rebecca",
booktitle = "Proceedings of the Sixth Conference on Machine Translation",
month = nov,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.statmt.org/wmt21/pdf/2021.wmt-1.56.pdf",
}
You may also wish to cite the WMT findings papers for the data used: https://aclanthology.org/W18-6401.bib, https://aclanthology.org/W19-5301.bib, https://aclanthology.org/2020.wmt-1.1.bib