Skip to content

Latest commit

 

History

History
21 lines (17 loc) · 965 Bytes

README.md

File metadata and controls

21 lines (17 loc) · 965 Bytes

polish-wikipedia-graph-dataset

The repository contains dataset of over 75 000 polish Wikipedia pages assigned to specific science fields and links between these pages. Dataset can be use as simple classification task in NLP, especially as benchmark for graph based methods.

wiki_pages.csv

Articles information file. Columns:

  • title - article title,
  • text - article text,
  • category - one of 7 main Wikipiedia categories related with science fields that was the closest to article categories in scrapped categories tree.

Articles categories:

  • Astronomia - astronomy,
  • Biologia - biology,
  • Matematyka - math,
  • Psychologia - psychology,
  • Fizyka - physics,
  • Informatyka - computer science,
  • Chemia - chemistry.

annotations.csv

File with links between pages. First column is source article title and second column is target article title. Take a note that file includes links to pages that are not present in wiki_pages.csv.