-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #80 from dice-group/develop
Develop - Next Release
- Loading branch information
Showing
24 changed files
with
242 additions
and
218 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,57 +1,78 @@ | ||
# Squirrel | ||
Squirrel searches and collects Linked Data | ||
# Squirrel - Crawler of linked data. | ||
|
||
## Running with docker | ||
## Introduction | ||
Squirrel is a crawler for the linked web. It provides several tools to search and collect data | ||
from the heterogeneous content of the linked web. | ||
|
||
### Using the Makefile... | ||
<img src="https://hobbitdata.informatik.uni-leipzig.de/squirrel/squirrel-logo.png" align="center" height="248" width="244" > | ||
|
||
|
||
## Build notes | ||
You can build the project with a simple ***mvn clean install*** | ||
and then you can use the *makefile* | ||
|
||
``` | ||
$ make build dockerize | ||
$ docker-compose build | ||
$ docker-compose up | ||
``` | ||
|
||
![Squirrel logo](https://hobbitdata.informatik.uni-leipzig.de/squirrel/squirrel-logo.png) | ||
======= | ||
### ... or do it manually | ||
## Run | ||
You can run by using the docker-compose file. | ||
|
||
``` | ||
$ docker-compose -f docker-compose-sparql.yml up | ||
``` | ||
|
||
1. ``mvn clean package shade:shade -U -DskipTests`` | ||
1. if you have a new version of squirrel, e.g. version 0.3.0, you **can** execute``mvn install:install-file -DgroupId=org.aksw.simba -DartifactId=squirrel -Dpackaging=jar -Dversion=0.3.0 -Dfile="target\original-squirrel.jar" -DgeneratePom=true -DlocalRepositoryPath=repository`` | ||
1. If you want to use the Web-Components, have a look to the Dependencies in this file | ||
1. ``docker build -t squirrel .`` | ||
1. execute a `.yml` file with ``docker-compose -f <file> up``/ ``down`` | ||
Squirrel uses spring context configuration to define the implementation of its components in Runtime. | ||
you can check the default implementation file in spring-config/sparqlStoreBased.xml and define your own | ||
beans on it. | ||
|
||
#### There are currently 3 yml-options | ||
You can also define a different context for each one of the workers. Check the docker-compose file and change | ||
an implementation file in each worker's env variable. | ||
|
||
All yml files in the root folder crawls real existing data portals with the help of [HtmlScraper](https://github.com/dice-group/Squirrel/wiki/HtmlScraper_how_to) | ||
- `docker-compose.yml`: file-sink based, without web | ||
- `docker-compose-sparql.yml`: sparql-sink based (_JENA_), without web | ||
- `docker-compose-sparql-web.yml`: sparql-sink based (_JENA_), with web including the visualization of crawled graph! | ||
These are the components of Squirrel that can be customized: | ||
|
||
--- | ||
#### Fetcher | ||
|
||
* *HTTPFetcher* - Fetches data from html sources. | ||
* *FTPFetcher* - Fetches data from html sources. | ||
* *SparqlBasedFetcher* - Fetches data from Sparql endpoints. | ||
|
||
* *Note*: The fetchers are not managed as spring beans yet, since only three are available. | ||
|
||
#### Analyzer | ||
Analyses the fetched data and extract triples from it. Note: the analyzer implementations are managed by the `SimpleAnalyzerManager`. Any implementations should be passed in the constructor of this class, like the example below: | ||
```xml | ||
<bean id="analyzerBean" class="org.aksw.simba.squirrel.analyzer.manager.SimpleAnalyzerManager"> | ||
<constructor-arg index="0" ref="uriCollectorBean" /> | ||
<constructor-arg index="1" > | ||
<array value-type="java.lang.String"> | ||
<value>org.aksw.simba.squirrel.analyzer.impl.HDTAnalyzer</value> | ||
<value>org.aksw.simba.squirrel.analyzer.impl.RDFAnalyzer</value> | ||
<value>org.aksw.simba.squirrel.analyzer.impl.HTMLScraperAnalyzer</value> | ||
</array> | ||
</constructor-arg> | ||
</bean> | ||
``` | ||
Also, if you want to implement your own analyzer, it is necessary to implement the method `isEligible()`, that checks if that analyzer matches the condition to call the `analyze` method. | ||
|
||
## Dependencies | ||
* *RDFAnalyzer* - Analyses RDF formats. | ||
* *HTMLScraperAnalyzer* - Analyses and scrapes HTML data base on Jsoup selector-synthax (see: https://github.com/dice-group/Squirrel/wiki/HtmlScraper_how_to) | ||
* *HDTAnalyzer* - Analyses HDT binary RDF format. | ||
|
||
### Using a Sparql-Host | ||
#### Collectors | ||
Collects new URIs found during the analysis process and serialize it before they are sent to the Frontier. | ||
|
||
You can use a sparql-based triple store to store the crawled data. If you want use it, you have to do the following: | ||
* *SimpleUriCollector* - Serialize uri's and stores it in memory (mainly used for testing purposes). | ||
* *SqlBasedUriCollector* - Serialize uri's and stores it in a hsqldb database. | ||
|
||
Until yet, the necessary datasets in the database are not created automatically. So you have to create them by hand: | ||
1. Run Squirrel as explained above | ||
2. Enter *localhost:3030* in your browser's address line | ||
3. Go to *manage datasets* | ||
4. Click *add new dataset* | ||
5. For *Dataset name* paste *contentset* | ||
6. For *Dataset type* select *Persistent – dataset will persist across Fuseki restarts* | ||
7. Go to step 4 again and do the same, **but this time with *"Metadata"* as *"Dataset name"*** | ||
#### Sink | ||
Responsible for persisting the collected RDF data. | ||
|
||
### Further dependencies | ||
* *FileBasedSink* - persists the triples in NT files, | ||
* *InMemorySink* - persists the triples only in memory, not in disk (mainly used for testing purposes). | ||
* *HdtBasedSink* - persists the triples in a HDT file (compressed RDF format - http://www.rdfhdt.org/). | ||
* *SparqlBasedSink* - persists the triples in a SparqlEndPoint. | ||
|
||
The [Squirrel-Webservice](https://github.com/phhei/Squirrel-Webservice) and the [SquirrelWebObject](https://github.com/phhei/SquirrelWebObject) are included in this project, now. This leads to the fact, that this project is a multi module maven project now. For that, there are 2 pom.xml's in the root layer: | ||
- `pom.xml`: this is the module bundle pom xml. If you execute ``mvn clean package``, this file will be called. As a consequence from this, all submodules including the _squirrel_ will be complied an packed | ||
- `squirrel-pom.xml`: the pom for the _squirrel_ | ||
|
||
If you want to run the squirrel with the **Webservice**, take care that you have already the current Webservice image (Docker). If not, execute | ||
1. ``mvn clean package`` _(only necessary if you want to compile each subproject (module) for itself)_ | ||
1. (``SquirrelWebObject\install.bat``) | ||
1. ``SquirrelWebService\buildImage.bat`` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,3 @@ | ||
https://mcloud.de/web/guest/suche/-/results/search/0 | ||
https://www.europeandataportal.eu/data/en/dataset?q=&page=1 | ||
https://portal.opengeoedu.de | ||
https://opendatainception.io | ||
https://dataportals.org/search | ||
|
||
https:/www.govdata.de/ckan/catalog/catalog.rdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.