MEETUPS Corpus collection

component-id

name

description

type

release-date

release-number

project

resource

work-package

pilot

licence

release link

contributors

related-components

meetups-corpus-collection

MEETUPS Corpus collection

This is a tool to download the Wikipedia pages of people in the music scene in Europe

Software

20/07/2022

v1.0

polifonia-project

https://github.com/polifonia-project/meetups_corpus_collection/

WP4

MEETUPS

Apache-2.0

https://github.com/polifonia-project/meetups_corpus_collection/releases/tag/v1.0

https://github.com/albamoralest

informed-by

meetups-corpus

MEETUPS Corpus collection

Collecting Wikipedia pages of people in the music scene in Europe

MEETUPS Corpus collection is a tool developed in Python and PyCharm IDE. It collects Wikipedia web pages (in txt format) of music authors in Europe. Refer to the Meetups Pilot for use and implementation.

Uses the "wikipedia" library to download only wikipedia webpage text
Process the list of files in chunks of 100 units
The process can start and stop any time as it controls the last downloaded item

Information on installation and setup

Pre-requirements:
- A CSV file with the list of authors' wikipedia id and store in sparqlQueryResults/ directory
- Python 3.9
Install wikipedia library:
- pip install wikipedia
To execute:
- Download project and execute init.py file

Details of dataset

SPARQL queries to retrieve authors' names and dbo:wikiPageID information using Dbpedia SPARQL Endpoint https://dbpedia.org/sparql

Query filters:

Categories: <http://dbpedia.org/resource/Category:Music_people>
            <http://dbpedia.org/resource/Category:People
Location:
            sparqlQueryResults/query.sparql
Query results"
            sparqlQueryResults/Q<1>_sparql.csv

Dataset:

Location:
            dataset/
Format:
            Text files .txt
Name convention:
            <Author_wikiPageID>.txt
Total biographies collected: 
            33,309 authors wikipedia webpage
Summary total biographies collected: 
            sparqlQueryResults/TOTAL_download_biography.csv
Meetups pilot sample: 1.002

Select random biographies -> sampleBiographies.py

Acknowledgements

This work was supported by the EU’s Horizon Europe research and innovation programme within the Polifonia project (grant agreement N. 101004746).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset		dataset
sparqlQueryResults		sparqlQueryResults
README-meetups-corpus.md		README-meetups-corpus.md
README.md		README.md
__init__.py		__init__.py
errors.txt		errors.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MEETUPS Corpus collection

Collecting Wikipedia pages of people in the music scene in Europe

Information on installation and setup

Details of dataset

Acknowledgements

About

Releases 2

Packages

Contributors 2

Languages

polifonia-project/meetups_corpus_collection

Folders and files

Latest commit

History

Repository files navigation

MEETUPS Corpus collection

Collecting Wikipedia pages of people in the music scene in Europe

Information on installation and setup

Details of dataset

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages