Skip to content

Squirrel 0.1

Compare
Choose a tag to compare
@gsjunior86 gsjunior86 released this 31 Aug 12:38
· 647 commits to develop since this release

This is the first stable release of Squirrel.
It includes several implementations of Sinks, Analyzers and Collectors .

The docker-compose file is configured with the Frontier and 3 more workers.
The spring-config/context.xml contains the implementations used by the workers.
(it is possible to define individual config files for each worker, see the docker-compose file)

In the following, we will briefly list the worker modules that are available in this release:

Fetcher

  • HTTPFetcher - Fetches data from html sources.

  • FTPFetcher - Fetches data from html sources.

  • Note: The fetchers are not managed as spring beans in this release yet, since only two are available. The worker will try to fetch data from both.

Analyzer

Analyses the fetched data and extract triples from it. Note: the analyzer implementations are managed by the SimpleAnalyzerManager. Any implementations should be passed in the constructor of this class, like the example below:

<bean id="analyzerBean" class="org.aksw.simba.squirrel.analyzer.manager.SimpleAnalyzerManager">
        <constructor-arg index="0" ref="uriCollectorBean" />
        <constructor-arg index="1" >
        	<array value-type="java.lang.String">
			  <value>org.aksw.simba.squirrel.analyzer.impl.HDTAnalyzer</value>
			  <value>org.aksw.simba.squirrel.analyzer.impl.RDFAnalyzer</value>
			  <value>org.aksw.simba.squirrel.analyzer.impl.HTMLScraperAnalyzer</value>
		</array>
       	</constructor-arg>
</bean>

Also, if you want to implement your own analyzer, it is necessary to implement the method isEligible(), that checks if that analyzer matches the condition to call the analyze method.

Collectors

Collects new URIs found during the analysis process and serialize it before they are sent to the Frontier.

  • SimpleUriCollector - Serialize uri's and stores it in memory (mainly used for testing purposes).
  • SqlBasedUriCollector - Serialize uri's and stores it in a hsqldb database.

Sink

Responsible for persisting the collected RDF data.

  • FileBasedSink - persists the triples in NT files,
  • InMemorySink - persists the triples only in memory, not in disk (mainly used for testing purposes).
  • HdtBasedSink - persists the triples in a HDT file (compressed RDF format - http://www.rdfhdt.org/).