Releases · dice-group/Squirrel

14 Dec 15:33

gsjunior86

0.4

324a1a3

Squirrel 0.4 Latest

Latest

This release includes several Improvements on the worker component:

Improved speed while decompressing files
RDFAnalyzer - Detects the fetched file serialization instead of using brute force
CkanFetcher - paginates fetched data in order to improve performance.
Included an abstract analyzer that has the TripleEncoder class. This class provides the method encodeTriple that encodes the triple characters. All the analyzers are using it.

Also, significant changes were made to the frontier component:

Included the UriFilterConfigurator class. This class allows the user to combine multiples filters, for focused crawling. The UriFilterconfigurator requires at least one KnownUriFilter.
Included the DepthFilter class. This filter allows depth crawling.
Recrawling of outdated URI's.

Assets 2

11 Jan 15:41

gsjunior86

0.3

f5e798c

Squirrel 0.3

This release includes several new Work Components

SparqlDatasetFetcher - Fetches RDF graphs from SparqlEndpoints that are a dcat:Dataset
MicrodataParser - An Analyzer to extract microdata from HTML pages as triples
MicroformatMF2JParser - An Analyzer to extract microformats from HTML pages as triples
RDFaSemarglParser - An Analyzer to extract triples from from structured documents (https://github.com/semarglproject/semargl)

Also, this release includes a simplified build of the front end.
You can use the script file build-squirrel to build the project and create docker images. Check the file
docker-compose-web.yml and run the web image with:

docker-compose -f docker-compose-web.yml up web

Assets 2

31 Oct 11:27

gsjunior86

0.2

9981001

Squirrel 0.2

Overall:

This release includes several performance improvements and new features.

First, the project has been splited in several modules. Thus, the class loading will be reduce dramatically when some parts of Squirrel are not started.

These are the following modules:

- squirrel.api :

Containing the core classes of Squirrel

- squirrel.deduplication :

The deduplication is a component computes {@link org.dice_research.squirrel.deduplication.hashing.HashValue}s for the triples found in the new uris, stores those hash values in the {@link KnownUriFilter} and compares the newly computed hash values with the hash values of all old triples. By doing that, duplicate data can be found and eliminated.

- squirrel.frontier :

The frontier relative classes

- squirrel.worker :

The worker relative classes

- squirrel.web :

The web front end

- squirrel.web-api :

Specific functionalities relative to the front end .

- SquirrelWebService :

The webservice communication between the front end and the frontier.

New Components:

Fetcher:

*SparqlBasedFetcher: located under squirrel.worker, it allows you to fetch uris from a Sparql endpoint. Please check the docker-compose-sparql.yml for env variables needed

Sink:

SparqlBasedSink: located under squirrel.worker, it allows you to use a Sparql endpoint as sink. Please check the docker-compose-sparql.yml for env variables needed.

Build Notes:

Run mvn clean install and run the Makefile for building.
To run Squirrel: docker-compose -f docker-compose-file up

Assets 2

31 Aug 12:38

gsjunior86

0.1

4c4ccb5

Squirrel 0.1

This is the first stable release of Squirrel.
It includes several implementations of Sinks, Analyzers and Collectors .

The docker-compose file is configured with the Frontier and 3 more workers.
The spring-config/context.xml contains the implementations used by the workers.
(it is possible to define individual config files for each worker, see the docker-compose file)

In the following, we will briefly list the worker modules that are available in this release:

Fetcher

HTTPFetcher - Fetches data from html sources.
FTPFetcher - Fetches data from html sources.
Note: The fetchers are not managed as spring beans in this release yet, since only two are available. The worker will try to fetch data from both.

Analyzer

Analyses the fetched data and extract triples from it. Note: the analyzer implementations are managed by the SimpleAnalyzerManager. Any implementations should be passed in the constructor of this class, like the example below:

<bean id="analyzerBean" class="org.aksw.simba.squirrel.analyzer.manager.SimpleAnalyzerManager">
        <constructor-arg index="0" ref="uriCollectorBean" />
        <constructor-arg index="1" >
        	<array value-type="java.lang.String">
			  <value>org.aksw.simba.squirrel.analyzer.impl.HDTAnalyzer</value>
			  <value>org.aksw.simba.squirrel.analyzer.impl.RDFAnalyzer</value>
			  <value>org.aksw.simba.squirrel.analyzer.impl.HTMLScraperAnalyzer</value>
		</array>
       	</constructor-arg>
</bean>

Also, if you want to implement your own analyzer, it is necessary to implement the method isEligible(), that checks if that analyzer matches the condition to call the analyze method.

RDFAnalyzer - Analyses RDF formats.
HTMLScraperAnalyzer - Analyses and scrapes HTML data base on Jsoup selector-synthax (see: https://github.com/dice-group/Squirrel/wiki/HtmlScraper_how_to)
HDTAnalyzer - Analyses HDT binary RDF format.

Collectors

Collects new URIs found during the analysis process and serialize it before they are sent to the Frontier.

SimpleUriCollector - Serialize uri's and stores it in memory (mainly used for testing purposes).
SqlBasedUriCollector - Serialize uri's and stores it in a hsqldb database.

Sink

Responsible for persisting the collected RDF data.

FileBasedSink - persists the triples in NT files,
InMemorySink - persists the triples only in memory, not in disk (mainly used for testing purposes).
HdtBasedSink - persists the triples in a HDT file (compressed RDF format - http://www.rdfhdt.org/).

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved speed while decompressing files

RDFAnalyzer - Detects the fetched file serialization instead of using brute force

CkanFetcher - paginates fetched data in order to improve performance.

Included an abstract analyzer that has the TripleEncoder class. This class provides the method encodeTriple that encodes the triple characters. All the analyzers are using it.

Included the UriFilterConfigurator class. This class allows the user to combine multiples filters, for focused crawling. The UriFilterconfigurator requires at least one KnownUriFilter.

Included the DepthFilter class. This filter allows depth crawling.

Recrawling of outdated URI's.

SparqlDatasetFetcher - Fetches RDF graphs from SparqlEndpoints that are a dcat:Dataset

MicrodataParser - An Analyzer to extract microdata from HTML pages as triples

MicroformatMF2JParser - An Analyzer to extract microformats from HTML pages as triples

RDFaSemarglParser - An Analyzer to extract triples from from structured documents (https://github.com/semarglproject/semargl)

Overall:

- squirrel.api :

- squirrel.deduplication :

- squirrel.frontier :

- squirrel.worker :

- squirrel.web :

- squirrel.web-api :

- SquirrelWebService :

New Components:

Fetcher:

Sink:

Build Notes:

Fetcher

Analyzer

Collectors

Sink

Releases: dice-group/Squirrel

Squirrel 0.4

Improved speed while decompressing files

RDFAnalyzer - Detects the fetched file serialization instead of using brute force

CkanFetcher - paginates fetched data in order to improve performance.

Included an abstract analyzer that has the TripleEncoder class. This class provides the method encodeTriple that encodes the triple characters. All the analyzers are using it.

Included the UriFilterConfigurator class. This class allows the user to combine multiples filters, for focused crawling. The UriFilterconfigurator requires at least one KnownUriFilter.

Included the DepthFilter class. This filter allows depth crawling.

Recrawling of outdated URI's.

Squirrel 0.3

SparqlDatasetFetcher - Fetches RDF graphs from SparqlEndpoints that are a dcat:Dataset

MicrodataParser - An Analyzer to extract microdata from HTML pages as triples

MicroformatMF2JParser - An Analyzer to extract microformats from HTML pages as triples

RDFaSemarglParser - An Analyzer to extract triples from from structured documents (https://github.com/semarglproject/semargl)

Squirrel 0.2

Overall:

- squirrel.api :

- squirrel.deduplication :

- squirrel.frontier :

- squirrel.worker :

- squirrel.web :

- squirrel.web-api :

- SquirrelWebService :

New Components:

Fetcher:

Sink:

Build Notes:

Squirrel 0.1

Fetcher

Analyzer

Collectors

Sink