GSoC_2016_Progress_Federica

The List Extractor

Abstract

The project focuses on the extraction of relevant but hidden data which lies inside lists in Wikipedia pages. The information is unstructured and thus cannot be easily used to form semantic statements and be integrated in the DBpedia ontology. Hence, the main task consists in creating a tool which can take one or more Wikipedia pages with lists within as an input and then construct appropriate mappings to be inserted in a DBpedia dataset. The extractor must prove to work well on a given domain and to have the ability to be expanded to reach generalization.

###Last Updates: [18th June] Obtaining first results from writer pages in Italian and English. I used Wikidata API to reconcile URIs and a sparql call to find the equivalent resource on DBpedia, as suggested by my mentor. I also used regex to extract relevant info from unstructured text contained in list elements. There is still much to do to refine this solution, but I think I'm on the right track :)

[14th June] Working on Mapper module. I'm using DBpedia Lookup service to retrieve the URI represented by a given string. Unfortunately it only works for english language and after some tests I realized that it can't be used for section titles since the accuracy is too low. I'm considering now a new approach, different from my original idea.

[9th June] Parser module completed. Successful on resources: List of Works of William Gibson, english William Gibson and italian William Gibson Now proceeding with further testing and figuring out how to implement next modules.

[6th June] Currently working on cleaning and parsing as lists the data obtained by [JSONpedia] (http://jsonpedia.org/frontend/index.html). Starting from page https://en.wikipedia.org/wiki/List_of_works_of_William_Gibson.

[30th May]

There are 29581 English Wiki pages about Writers, including 90717 lists.
In italian there are 1177 pages about writers, including 3232 lists. It seems an interesting domain and I'm going to start from a writer and his lists of works (currently working on William Gibson). Other good candidates are directors and actors (with their lists of featured movies and awards).
181790 English pages about directors, with 52114 lists
6326 Italian pages about directors, with 24957 lists

[23th May] START OF CODING. Figuring out how to include and use JSONpedia on my project, as the online web service is often unavailable due to crawlers. I will use the online web service for now (http://jsonpedia.org/frontend/index.html).

[22nd May] ENDING OF BONDING PERIOD. I had a constant discussion and feedback with my mentors and made some preliminary analysis via the python script (statistics.py) available on the Table Extractor; I am ready to start coding. I have also installed the DBpedia extraction framework and performed an extraction on italian wiki pages to gain a better understanding of the framework.

[19th May] Discussion with all co-mentors about suitable wiki domains for lists. We decided to further examine filmographies, bibliographies and related contexts such as lists of nominations and awards.

[18th May] Contributing with papalinis in statistics.py from https://github.com/dbpedia/table-extractor to do some domain analysis useful for both table and list extractor. Will be improved shortly.

[11th May] I am currently analyzing the occurrencies of lists in various Wikipedia pages and querying SPARQL Endpoints to choose a suitable domain to start with.

Mentors:

Marco Fossati
Claudia Diamantini
Domenico Potena
Emanuele Storti

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC_2016_Progress_Federica

The List Extractor

Abstract

Mentors:

Clone this wiki locally