Training an Entity Linker model using spaCy and Prodigy

This repository was created in December 2022 as a result of a project undertaken for the Journalism AI 2022 Fellowship. The motivation behind the project can be found in this Notion page

The project was conducted by:

Alet Law and Tinashe Munyuki from the Daily Maverick
Dimitri Tokmetzis and Heleen Emanuel from Follow the Money
Luis Flores, Michel Schammel and Anna Vissens from The Guardian, the main developers of the codebase

This Python repository provides a template of the code required to train a spaCy entity linker model for PERSON text mentions.

Structure:

/src - code to pre-process datasets and train an EL model /prodigy - custom recipe to annotate training/test datasets /data - dataset directory. All dataset files should be placed in this folder.

Datasets:

Knowledge Bases

Our code uses two open-access Knowledge Bases (KBs) of potential interest to investigative reporting, namely Open Sanctions and LittleSis. The complete datasets can be freely obtained from either source under json format.

The code includes pre-processing steps for each KB to transform the raw json data into pandas dataframes. Each entity in the processed dataset will have an individual ID, an alias and a description field. Please be aware the final dataset still contains redundant entities, i.e. two or more IDs referring to the same "real-world" entity.

Text Documents

The documents originally used in this work were article paragraphs acquired through the Guardian Content API. This dataset is interchangeable with any text containing named entities that can be linked to the KBs. The text dataset should be used to create annotation files via Prodigy. Please refer to /prodigy README.md for more information.

Prodigy annotations

Annotations for the training and test datasets were manually generated using Prodigy. The custom Prodigy recipe used in this task is included in the /prodigy folder.

Running the code:

The code requires pre-installation of the DVC data versioning package in the project environment. Running the command dvc repro will run the code end-to-end.

Packaging the model:

The entity linker model trained in this repository uses a custom function to generate a pool of candidates per text mention. To package the model after training run the following cli command:

python -m spacy package --force <trained_model_input_directory> <packaged_model_output_directory> --code scripts/custom_functions.py

More information

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.dvc		.dvc
data		data
prodigy		prodigy
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
DVC.md		DVC.md
LICENSE.txt		LICENSE.txt
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training an Entity Linker model using spaCy and Prodigy

Structure:

Datasets:

Knowledge Bases

Text Documents

Prodigy annotations

Running the code:

Packaging the model:

About

Releases

Packages

Contributors 2

Languages

License

JournalismAI/bad-will-hunting

Folders and files

Latest commit

History

Repository files navigation

Training an Entity Linker model using spaCy and Prodigy

Structure:

Datasets:

Knowledge Bases

Text Documents

Prodigy annotations

Running the code:

Packaging the model:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages