Coreference resolution is the task of finding all mentions (noun phrases) that refer to the same entity (e.g. a person, a location etc, see also the NER doc) in a text.
Typically, in a document, entities are first introduced by their name (e.g. Dronning Margrethe II
) and later refered by pronouns (e.g. hun
) or expressions/titles (e.g. Hendes Majestæt
, Danmarks dronning
, etc).
The goal of the coreference resolution task is to find all these references and link them through a common ID.
Model | Train Data | License | Trained by | Tags | DaNLP |
---|---|---|---|---|---|
XLM-R | Dacoref | GPLv2 | Maria Jung Barrett | Generic QIDs | ✔️ |
If you want to read more about coreference resolution and the DaNLP model, we also have a blog post (in Danish).
Coreference resolution is an important subtask in NLP. It is used in particular for information extraction (e.g. for building a knowledge graph, see our tutorial) and could help with other NLP tasks such as machine translation (e.g. in order to apply the right gender or number) or text summarization, or in dialog systems.
The XLM-R Coref model is based on the pre-trained XLM-Roberta, a transformer-based multilingual masked language model (Conneau et al. 2020), and finetuned on the Dacoref dataset. The finetuning has been done using the pytorch-based implementation from AllenNLP 1.3.0..
The XLM-R Coref model can be loaded with the load_xlmr_coref_model()
method.
Please note that it can maximum take 512 tokens as input at a time. For longer text sequences split before hand, for example using sentence boundary detection (e.g. by using the spacy model.)
from danlp.models import load_xlmr_coref_model
# load the coreference model
coref_model = load_xlmr_coref_model()
# a document is a list of tokenized sentences
doc = [["Lotte", "arbejder", "med", "Mads", "."], ["Hun", "er", "tandlæge", "."]]
# apply coreference resolution to the document and get a list of features (see below)
preds = coref_model.predict(doc)
# apply coreference resolution to the document and get a list of clusters
clusters = coref_model.predict_clusters(doc)
The preds
variable is a dictionary including the following entries :
top_spans
: list of indices of all references (spans) in the documentantecedent_indices
: list of antecedents indicespredicted_antecedents
: list of indices of the antecedent span (fromtop_spans
), i.e. previous referencedocument
: list of tokens' indices for the whole documentclusters
: list of clusters (indices of tokens) The most relevant entry to use is the list of clusters. One cluster contains the indices of references (spans) that refer to the same entity. To make it easier, we provide thepredict_clusters
function that returns a list of the clusters with the references and their ids in the document.
See detailed scoring of the benchmarks in the example folder.
The benchmarks has been performed on the test part of the Dacoref dataset.
Model | Precision | Recall | F1 | Mention Recall | Sentences per second (CPU*) |
---|---|---|---|---|---|
XLM-R | 69.86 | 59.17 | 64.02 | 88.01 | ~1 |
*Sentences per second is based on a Macbook Pro with Apple M1 chip.
The evaluation script coreference_benchmarks.py
can be found here.