-
Notifications
You must be signed in to change notification settings - Fork 23
Coreference_Evaluation
Pradhan et al. have published "Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation" (ACL 2014) describing their Perl-based scoring tool AKA scorer.pl
. The neleval package reimplements these measures (MUC, B-cubed, Entity CEAF, Mention CEAF, and the pairwise coreference and non-coreference measures that constitute BLANC) with a number of efficiency improvements, particularly to CEAF, and especially valuable in the cross-document coreference evaluation setting.
The slow part of calculating CEAF is identifying the maximal linear-sum assignment between key and response entities, using the Hungarian Algorithm or a variant thereof. Our implementation is much faster because:
- scorer.pl manipulates Perl arrays and may be O(n^4), though I haven't checked, where n is the number of key and response entities; we use an O(n^3) implementation with vectorised NumPy operations in a very efficient implementation that was recently adopted into scipy. Even before further optimisations, this resulted in an order of magnitude or more runtime improvement over .
- Our n is much smaller in practice. We only perform the Hungarian Algorithm on each strongly connected component of the assignment graph, and explicitly eliminate trivial portions of the assignment problem (where there is no confusion with other entities). So our time complexity is O(n^3) where n is the number of entities in the largest component, rather than the total number of entities in the evaluation. These optimisations are particularly valuable in cross-document coref evaluation because the number of entities is large relative to the number of confusions.
- We have also made some efficient choices elsewhere in processing, such as determining entity overlaps using
scipy.sparse
matrix multiplication.
Both our implementation and scorer.pl
support φ3 and φ4 of Luo's 2005 paper introducing CEAF. Our mention_ceaf = ceafm = φ3. Our entity_ceaf = ceafe = φ4.
Note that we do not directly report BLANC, although we facilitate calculation of both its components, using pairwise
and pairwise_negative
aggregates (see our list-measures
command), according to Luo et al. 2015's extension of the metric to system mentions.
We have empirically verified the equivalence of metric implementation between our system and scorer.pl
. By pointing the COREFSCORER
environment variable to a local copy of scorer.pl
, our system will cross-check the results automatically. (This will, however, be extremely slow for large CEAF calculations.)
We provide the prepare-conll-coref
command to import CoNLL shared task-formatted annotations. We have validated that our metrics match those produced by Pradhan et al.'s reference implementation for the CoNLL 2011 runs.