A framework for extracting RDFa, JSON-LD, Microdata and text content from webpages. In general, it relies and uses on RDF4J framework.
(note; this documentation was written 6 years after initial development).
The project contains the following modules:
-
microdata-parser
: code for parsing Microdata markup, initialy copied from the Any23 project if I remember. -
extractor-core
: this is where the actual extraction happens. The main interface isfr.sparna.rdf.extractor.DataExtractor
. There are 4 implementations :fr.sparna.rdf.extractor.jsonld.JsonLDExtractor
- this is JSON-LD 1.0 only (remember this was developped in 2018). It relies on RDF4J JSON-LD parserfr.sparna.rdf.extractor.rdfa.RdfaExtractor
- this relies on semargl RDFa parserfr.sparna.rdf.extractor.microdata.MicrodataExtractor
- this relies on themicrodata-parser
modulefr.sparna.rdf.extractor.content.ContentExtractor
- this one extracts the actual text of the page, removing HTML markup etc, trying to remove menus, etc. This is relying on the boilerpipe library. This is not useful for the ELI project and was developped for other purposes.
In addition, the resulting triples are being filtered and post-processed by a bunch of filters inside the
fr.sparna.rdf.handler
package -
extractor-cli
: a command-line interface to run the extractor, coupled with a simple crawler implementation. The idea is to crawl a set of webpages, and pass each webpage content to the extractor. -
extractor-server
: the encapsulation of the extractor in an API (e.g. something likecurl --header "Accept: application/ld+json" http://localhost:8080/extractor-server/api/v1/extract?uri=http://sparna.fr
). The idea is that this server can connect to an underlying RDF repository, storing the resulting triples in named graphs.
- Try
java -jar extractor-cli-onejar-{version}-onejar.jar --help
tp get the help message - There are 2 commands :
crawl
list
crawl
was an attempt to run free-form crawler, relying on Crawler4J, with additionnal deciderules taken from Heritrix. The main class isfr.sparna.rdf.extractor.cli.crawl.ExtractorCrawler
. This is really "poor man's crawling" and I don't think this should be reused; more rebust crawlers should be used for free crawlingprocessList
is simply reading a text file containing one URL per line, and applies the extraction to each of those URLs one by one. This was originally applied on list of ELI URIs sent by the Member States.
Here is a complete command-line example:
java -Xms512M -Xmx2048M -jar extractor-cli-1.0-SNAPSHOT-onejar.jar list \ --input random-elis-test.txt \ --output output \ --exclude processed-urls.log \ --namespaces eli,http://data.europa.eu/eli/ontology# xsd,http://www.w3.org/2001/XMLSchema# ev,http://eurovoc.europa.eu/ corp,http://publications.europa.eu/resource/authority/corporate-body/ lang,http://publications.europa.eu/resource/authority/language/ m-app,http://www.iana.org/assignments/media-types/application/ res-oj,http://publications.europa.eu/resource/oj/ res-celex,http://publications.europa.eu/resource/celex/
The file random-elis-test.txt
contains the following URIs:
http://data.europa.eu/eli/reg/2003/20/oj
http://data.europa.eu/eli/reg/2002/41/oj
http://data.europa.eu/eli/dec/2002/95/oj
http://data.europa.eu/eli/dir/1998/91/oj
http://data.europa.eu/eli/dec/1983/59/oj
http://data.europa.eu/eli/reg/2002/43/oj
http://data.europa.eu/eli/dir/2001/5/oj
http://data.europa.eu/eli/dec/1982/39/oj
http://data.europa.eu/eli/dir/1997/53/oj
Note : somehow (I can't remember where exactly) the successfully processed URLs are logged into processed-urls.log
, so that if the process fails for any reason, this file is being passed back to the second run and URLs already processed are not extracted a second time.
The folder under extractor-cli/src/test/resources/URI-lists
contains the files of the URI lists that were used to extract dataset from Portugal, Eur-Lex, Ireland, Denmark and Italy, for the 2018 Datathon. The resulting datasets were published on the EU ODP, see https://data.europa.eu/data/datasets?locale=en&minScoring=0&query=ELI&page=1
The extractor-server
modules persists the result of the extraction in a triplestore. Each set of triples from each page is kept in a separate named graph. The named graph is identified by the original URL of the page, and is described with a dcterms:modified
triple with the date of insertion, and a dcterms:isPartOf
triple with the domain name of the page. This is useful to e.g. select or delete or query all named graphs coming from a known domain/website. See the class fr.sparna.rdf.extractor.RepositoryManagementListener
that is responsible for this.
- part of the module
extractor-core
could be reused, with only RDFa and JSON-LD extractors. JSON-LD extractor would need to be upgraded to JSON-LD 1.1. The code is well documented and - I think - rather robust. - you may want to run that extraction service as a Microservice, so
extractor-server
could be useful too.