FRWIKI dataset for Entity Linking

This repository contains scripts to build an Entity Linking Dataset from Wikipedia. It is configured to work with the Frenche Wikipedia, but it should work with other languages too after minor changes.

HTML pages are scrapped from the Wikipedia website, then cleaned to keep only the text. Links between pages are used to annotate named entities.

How it works

Following the work done in Pointer Sentinel Mixture Models, the dataset is build on featured and good Wikipedia pages, mainly because scrapping the whole website would be dereasonable. Pywikibot is used to get the html pages listing good and featured articles (Bons articles and Articles de qualité in French).

# From the get_data script

# Download pages listing good and featured articles
python core/pwb.py listpages -cat:Wikipédia:Bons_articles/Justification_de_leur_promotion -save:$CAT_DIR
python core/pwb.py listpages -cat:Catégorie:Wikipédia:Articles_de_qualité/Justification_de_leur_promotion -save:$CAT_DIR

# Build a file listing titles of good featured articles
python list_good_pages.py "$CAT_DIR" "$OUT_DIR/list-good-pages.txt"

Then good and featured pages are downloaded, cleaned and scanned to detect all links to other Wikipedia pages. Those links are tagged between [E][/E] tags. For instance, the following HTML link:

<a href="/wiki/Paris" title="Paris">la ville lumière</a>

will be replaced by [E=Paris]la ville lumière[/E] in the cleaned document. Everything that is not text is removed, as well as some sections, such as references sections.

# From the get_data script

# Get the list of all pages to download by extracting links from good pages
python get_pages_list.py "$OUT_DIR/list-good-pages.txt" "$OUT_DIR/list-all-pages.txt" "$OUT_DIR/list-all-pages.csv" $HTML_DIR --compress gzip

# Download html pages
python download_html_pages.py "$OUT_DIR/list-all-pages.csv" "$SCRAP_DIR/all-pages-paths.csv" "$SCRAP_DIR/all-pages-paths-errors.csv" $HTML_DIR --compress gzip

# Clean html pages
python clean_html_pages.py "$SCRAP_DIR/all-pages-paths.csv" $PAGES_DIR "$SCRAP_DIR/frwiki.csv" "$SCRAP_DIR/frwiki-errors.csv" --compress gzip

Wikidata features of all the downloaded pages are then extracted from a Wikidata dump, that should be downloaded beforehand. Wikidata features includes QID, labels, description and aliases. Types are also suggested, but one should probably not rely on it since it is based on quick-and-dirty rules. Suggested types are: GEOLOC, PERSON, DATE, ORG and OTHER.

# From the get_data script

# Retrieve Wikidata properties for each page
python getwikidatapropertiesfromdump.py $WIKIDATA_DUMP_PATH "$SCRAP_DIR/frwiki.csv" "$SCRAP_DIR/wikidata.csv"

Finally, all the extracted data are gathered into one CSV file whose columns are:

qid: QID of the page (the Wikidata id).
title: Wikipedia title of the page.
path: Path of the cleaned page on disk.
url: URL to the page.
wikipedia_description: Description extracted from Wikipedia. It corresponds to the first paragraph of the page.
wikidata_description: Description extracted from Wikidata.
label: Label extracted from Wikidata.
aliases: Aliases extracted from Wikidata.
type: Suggested type of the entity, guessed from Wikidata properties (but, seriously, do not rely on it).

# From the get_data script

# Build the final dataset
python build_final_dataset.py "$SCRAP_DIR/frwiki.csv" "$SCRAP_DIR/wikidata.csv" "$SCRAP_DIR/final-dataset.csv"

Steps to build the dataset from scratch

First, download a copy of the wikidata json dump. Instructions can be found here. Run the file get_data.ps1 (or run_data.sh on unix) to download the data required to build the dataset. This will download a copy of the pywikibot repository.

Then, use the datasets module from HuggingFace to load the dataset described in frwiki_good_articles_el.py.

Why don’t you rely on the XML Wikipedia dump?

Because it is way too complicated, there is no easy-to-use tools to properly parse MediaWiki documents. Wikitextprocessor seems really promising but is not yet capable of parsing non-English Wikipedia dumps.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
wikiscrap		wikiscrap
.gitignore		.gitignore
README.md		README.md
build_final_dataset.py		build_final_dataset.py
clean_html_pages.py		clean_html_pages.py
download_html_pages.py		download_html_pages.py
funcs.py		funcs.py
get_data.ps1		get_data.ps1
get_data.sh		get_data.sh
get_pages_list.py		get_pages_list.py
getwikidatapropertiesfromdump.py		getwikidatapropertiesfromdump.py
list_good_pages.py		list_good_pages.py
scrap_frwiki.py		scrap_frwiki.py
user-config.py		user-config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FRWIKI dataset for Entity Linking

How it works

Steps to build the dataset from scratch

Why don’t you rely on the XML Wikipedia dump?

About

Releases

Packages

Languages

gcaillaut/frwiki_good_pages_el

Folders and files

Latest commit

History

Repository files navigation

FRWIKI dataset for Entity Linking

How it works

Steps to build the dataset from scratch

Why don’t you rely on the XML Wikipedia dump?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages