Skip to content

polifonia-project/meetups_corpus_collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

component-id name description type release-date release-number project resource work-package pilot licence release link contributors related-components
meetups-corpus-collection
MEETUPS Corpus collection
This is a tool to download the Wikipedia pages of people in the music scene in Europe
Software
20/07/2022
v1.0
polifonia-project
WP4
MEETUPS
Apache-2.0
informed-by
meetups-corpus

MEETUPS Corpus collection

DOI

Collecting Wikipedia pages of people in the music scene in Europe

MEETUPS Corpus collection is a tool developed in Python and PyCharm IDE. It collects Wikipedia web pages (in txt format) of music authors in Europe. Refer to the Meetups Pilot for use and implementation.

  • Uses the "wikipedia" library to download only wikipedia webpage text
  • Process the list of files in chunks of 100 units
  • The process can start and stop any time as it controls the last downloaded item

Information on installation and setup

  • Pre-requirements:
    • A CSV file with the list of authors' wikipedia id and store in sparqlQueryResults/ directory
    • Python 3.9
  • Install wikipedia library:
    • pip install wikipedia
  • To execute:
    • Download project and execute init.py file

Details of dataset

SPARQL queries to retrieve authors' names and dbo:wikiPageID information using Dbpedia SPARQL Endpoint https://dbpedia.org/sparql

Query filters:

Categories: <http://dbpedia.org/resource/Category:Music_people>
            <http://dbpedia.org/resource/Category:People
Location:
            sparqlQueryResults/query.sparql
Query results"
            sparqlQueryResults/Q<1>_sparql.csv

Dataset:

Location:
            dataset/
Format:
            Text files .txt
Name convention:
            <Author_wikiPageID>.txt
Total biographies collected: 
            33,309 authors wikipedia webpage
Summary total biographies collected: 
            sparqlQueryResults/TOTAL_download_biography.csv
Meetups pilot sample: 1.002

Select random biographies -> sampleBiographies.py

Acknowledgements

This work was supported by the EU’s Horizon Europe research and innovation programme within the Polifonia project (grant agreement N. 101004746).

About

MEETUPS Software for corpus collection

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages