Squirrel 0.1
This is the first stable release of Squirrel.
It includes several implementations of Sinks, Analyzers and Collectors .
The docker-compose
file is configured with the Frontier and 3 more workers.
The spring-config/context.xml
contains the implementations used by the workers.
(it is possible to define individual config files for each worker, see the docker-compose
file)
In the following, we will briefly list the worker modules that are available in this release:
Fetcher
-
HTTPFetcher - Fetches data from html sources.
-
FTPFetcher - Fetches data from html sources.
-
Note: The fetchers are not managed as spring beans in this release yet, since only two are available. The worker will try to fetch data from both.
Analyzer
Analyses the fetched data and extract triples from it. Note: the analyzer implementations are managed by the SimpleAnalyzerManager
. Any implementations should be passed in the constructor of this class, like the example below:
<bean id="analyzerBean" class="org.aksw.simba.squirrel.analyzer.manager.SimpleAnalyzerManager">
<constructor-arg index="0" ref="uriCollectorBean" />
<constructor-arg index="1" >
<array value-type="java.lang.String">
<value>org.aksw.simba.squirrel.analyzer.impl.HDTAnalyzer</value>
<value>org.aksw.simba.squirrel.analyzer.impl.RDFAnalyzer</value>
<value>org.aksw.simba.squirrel.analyzer.impl.HTMLScraperAnalyzer</value>
</array>
</constructor-arg>
</bean>
Also, if you want to implement your own analyzer, it is necessary to implement the method isEligible()
, that checks if that analyzer matches the condition to call the analyze
method.
- RDFAnalyzer - Analyses RDF formats.
- HTMLScraperAnalyzer - Analyses and scrapes HTML data base on Jsoup selector-synthax (see: https://github.com/dice-group/Squirrel/wiki/HtmlScraper_how_to)
- HDTAnalyzer - Analyses HDT binary RDF format.
Collectors
Collects new URIs found during the analysis process and serialize it before they are sent to the Frontier.
- SimpleUriCollector - Serialize uri's and stores it in memory (mainly used for testing purposes).
- SqlBasedUriCollector - Serialize uri's and stores it in a hsqldb database.
Sink
Responsible for persisting the collected RDF data.
- FileBasedSink - persists the triples in NT files,
- InMemorySink - persists the triples only in memory, not in disk (mainly used for testing purposes).
- HdtBasedSink - persists the triples in a HDT file (compressed RDF format - http://www.rdfhdt.org/).