component-id | name | description | type | release-date | release-number | project | resource | work-package | pilot | licence | release link | contributors | related-components | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
meetups-corpus-collection |
MEETUPS Corpus collection |
This is a tool to download the Wikipedia pages of people in the music scene in Europe |
Software |
20/07/2022 |
v1.0 |
polifonia-project |
|
|
|
|
MEETUPS Corpus collection is a tool developed in Python and PyCharm IDE. It collects Wikipedia web pages (in txt format) of music authors in Europe. Refer to the Meetups Pilot for use and implementation.
- Uses the "wikipedia" library to download only wikipedia webpage text
- Process the list of files in chunks of 100 units
- The process can start and stop any time as it controls the last downloaded item
- Pre-requirements:
- A CSV file with the list of authors' wikipedia id and store in sparqlQueryResults/ directory
- Python 3.9
- Install wikipedia library:
- pip install wikipedia
- To execute:
- Download project and execute init.py file
SPARQL queries to retrieve authors' names and dbo:wikiPageID information using Dbpedia SPARQL Endpoint https://dbpedia.org/sparql
Query filters:
Categories: <http://dbpedia.org/resource/Category:Music_people>
<http://dbpedia.org/resource/Category:People
Location:
sparqlQueryResults/query.sparql
Query results"
sparqlQueryResults/Q<1>_sparql.csv
Dataset:
Location:
dataset/
Format:
Text files .txt
Name convention:
<Author_wikiPageID>.txt
Total biographies collected:
33,309 authors wikipedia webpage
Summary total biographies collected:
sparqlQueryResults/TOTAL_download_biography.csv
Meetups pilot sample: 1.002
Select random biographies -> sampleBiographies.py
This work was supported by the EU’s Horizon Europe research and innovation programme within the Polifonia project (grant agreement N. 101004746).