Skip to content

Releases: dice-group/Squirrel

Squirrel 0.4

14 Dec 15:33
324a1a3
Compare
Choose a tag to compare

This release includes several Improvements on the worker component:

  • Improved speed while decompressing files

  • RDFAnalyzer - Detects the fetched file serialization instead of using brute force

  • CkanFetcher - paginates fetched data in order to improve performance.

  • Included an abstract analyzer that has the TripleEncoder class. This class provides the method encodeTriple that encodes the triple characters. All the analyzers are using it.

Also, significant changes were made to the frontier component:

  • Included the UriFilterConfigurator class. This class allows the user to combine multiples filters, for focused crawling. The UriFilterconfigurator requires at least one KnownUriFilter.

  • Included the DepthFilter class. This filter allows depth crawling.

  • Recrawling of outdated URI's.

Squirrel 0.3

11 Jan 15:41
f5e798c
Compare
Choose a tag to compare

This release includes several new Work Components

  • SparqlDatasetFetcher - Fetches RDF graphs from SparqlEndpoints that are a dcat:Dataset

  • MicrodataParser - An Analyzer to extract microdata from HTML pages as triples

  • MicroformatMF2JParser - An Analyzer to extract microformats from HTML pages as triples

  • RDFaSemarglParser - An Analyzer to extract triples from from structured documents (https://github.com/semarglproject/semargl)

Also, this release includes a simplified build of the front end.
You can use the script file build-squirrel to build the project and create docker images. Check the file
docker-compose-web.yml and run the web image with:

docker-compose -f docker-compose-web.yml up web

Squirrel 0.2

31 Oct 11:27
9981001
Compare
Choose a tag to compare

Overall:

This release includes several performance improvements and new features.

First, the project has been splited in several modules. Thus, the class loading will be reduce dramatically when some parts of Squirrel are not started.

These are the following modules:

- squirrel.api :

  • Containing the core classes of Squirrel

- squirrel.deduplication :

  • The deduplication is a component computes {@link org.dice_research.squirrel.deduplication.hashing.HashValue}s for the triples found in the new uris, stores those hash values in the {@link KnownUriFilter} and compares the newly computed hash values with the hash values of all old triples. By doing that, duplicate data can be found and eliminated.

- squirrel.frontier :

  • The frontier relative classes

- squirrel.worker :

  • The worker relative classes

- squirrel.web :

  • The web front end

- squirrel.web-api :

  • Specific functionalities relative to the front end .

- SquirrelWebService :

  • The webservice communication between the front end and the frontier.

New Components:

Fetcher:

*SparqlBasedFetcher: located under squirrel.worker, it allows you to fetch uris from a Sparql endpoint. Please check the docker-compose-sparql.yml for env variables needed

Sink:

SparqlBasedSink: located under squirrel.worker, it allows you to use a Sparql endpoint as sink. Please check the docker-compose-sparql.yml for env variables needed.

Build Notes:

Run mvn clean install and run the Makefile for building.
To run Squirrel: docker-compose -f docker-compose-file up

Squirrel 0.1

31 Aug 12:38
Compare
Choose a tag to compare

This is the first stable release of Squirrel.
It includes several implementations of Sinks, Analyzers and Collectors .

The docker-compose file is configured with the Frontier and 3 more workers.
The spring-config/context.xml contains the implementations used by the workers.
(it is possible to define individual config files for each worker, see the docker-compose file)

In the following, we will briefly list the worker modules that are available in this release:

Fetcher

  • HTTPFetcher - Fetches data from html sources.

  • FTPFetcher - Fetches data from html sources.

  • Note: The fetchers are not managed as spring beans in this release yet, since only two are available. The worker will try to fetch data from both.

Analyzer

Analyses the fetched data and extract triples from it. Note: the analyzer implementations are managed by the SimpleAnalyzerManager. Any implementations should be passed in the constructor of this class, like the example below:

<bean id="analyzerBean" class="org.aksw.simba.squirrel.analyzer.manager.SimpleAnalyzerManager">
        <constructor-arg index="0" ref="uriCollectorBean" />
        <constructor-arg index="1" >
        	<array value-type="java.lang.String">
			  <value>org.aksw.simba.squirrel.analyzer.impl.HDTAnalyzer</value>
			  <value>org.aksw.simba.squirrel.analyzer.impl.RDFAnalyzer</value>
			  <value>org.aksw.simba.squirrel.analyzer.impl.HTMLScraperAnalyzer</value>
		</array>
       	</constructor-arg>
</bean>

Also, if you want to implement your own analyzer, it is necessary to implement the method isEligible(), that checks if that analyzer matches the condition to call the analyze method.

Collectors

Collects new URIs found during the analysis process and serialize it before they are sent to the Frontier.

  • SimpleUriCollector - Serialize uri's and stores it in memory (mainly used for testing purposes).
  • SqlBasedUriCollector - Serialize uri's and stores it in a hsqldb database.

Sink

Responsible for persisting the collected RDF data.

  • FileBasedSink - persists the triples in NT files,
  • InMemorySink - persists the triples only in memory, not in disk (mainly used for testing purposes).
  • HdtBasedSink - persists the triples in a HDT file (compressed RDF format - http://www.rdfhdt.org/).