Skip to content

Latest commit

 

History

History
73 lines (45 loc) · 4.19 KB

coreference.md

File metadata and controls

73 lines (45 loc) · 4.19 KB

Coreference Resolution

Coreference resolution is the task of finding all mentions (noun phrases) that refer to the same entity (e.g. a person, a location etc, see also the NER doc) in a text.

Typically, in a document, entities are first introduced by their name (e.g. Dronning Margrethe II) and later refered by pronouns (e.g. hun) or expressions/titles (e.g. Hendes Majestæt, Danmarks dronning, etc). The goal of the coreference resolution task is to find all these references and link them through a common ID.

Model Train Data License Trained by Tags DaNLP
XLM-R Dacoref GPLv2 Maria Jung Barrett Generic QIDs ✔️

If you want to read more about coreference resolution and the DaNLP model, we also have a blog post (in Danish).

Use cases

Coreference resolution is an important subtask in NLP. It is used in particular for information extraction (e.g. for building a knowledge graph, see our tutorial) and could help with other NLP tasks such as machine translation (e.g. in order to apply the right gender or number) or text summarization, or in dialog systems.

Models

🔧 XLM-R {#xlmr}

The XLM-R Coref model is based on the pre-trained XLM-Roberta, a transformer-based multilingual masked language model (Conneau et al. 2020), and finetuned on the Dacoref dataset. The finetuning has been done using the pytorch-based implementation from AllenNLP 1.3.0..

The XLM-R Coref model can be loaded with the load_xlmr_coref_model() method. Please note that it can maximum take 512 tokens as input at a time. For longer text sequences split before hand, for example using sentence boundary detection (e.g. by using the spacy model.)

from danlp.models import load_xlmr_coref_model

# load the coreference model
coref_model = load_xlmr_coref_model()

# a document is a list of tokenized sentences
doc = [["Lotte", "arbejder", "med", "Mads", "."], ["Hun", "er", "tandlæge", "."]]

# apply coreference resolution to the document and get a list of features (see below)
preds = coref_model.predict(doc)

# apply coreference resolution to the document and get a list of clusters
clusters = coref_model.predict_clusters(doc)

The preds variable is a dictionary including the following entries :

  • top_spans : list of indices of all references (spans) in the document
  • antecedent_indices : list of antecedents indices
  • predicted_antecedents : list of indices of the antecedent span (from top_spans), i.e. previous reference
  • document : list of tokens' indices for the whole document
  • clusters : list of clusters (indices of tokens) The most relevant entry to use is the list of clusters. One cluster contains the indices of references (spans) that refer to the same entity. To make it easier, we provide the predict_clusters function that returns a list of the clusters with the references and their ids in the document.

📈 Benchmarks

See detailed scoring of the benchmarks in the example folder.

The benchmarks has been performed on the test part of the Dacoref dataset.

Model Precision Recall F1 Mention Recall Sentences per second (CPU*)
XLM-R 69.86 59.17 64.02 88.01 ~1

*Sentences per second is based on a Macbook Pro with Apple M1 chip.

The evaluation script coreference_benchmarks.py can be found here.