Skip to content

Commit

Permalink
Merge pull request #80 from dice-group/develop
Browse files Browse the repository at this point in the history
Develop - Next Release
  • Loading branch information
gsjunior86 authored Oct 30, 2018
2 parents 9fdab67 + d24a67c commit 9981001
Show file tree
Hide file tree
Showing 24 changed files with 242 additions and 218 deletions.
95 changes: 58 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,57 +1,78 @@
# Squirrel
Squirrel searches and collects Linked Data
# Squirrel - Crawler of linked data.

## Running with docker
## Introduction
Squirrel is a crawler for the linked web. It provides several tools to search and collect data
from the heterogeneous content of the linked web.

### Using the Makefile...
<img src="https://hobbitdata.informatik.uni-leipzig.de/squirrel/squirrel-logo.png" align="center" height="248" width="244" >


## Build notes
You can build the project with a simple ***mvn clean install***
and then you can use the *makefile*

```
$ make build dockerize
$ docker-compose build
$ docker-compose up
```

![Squirrel logo](https://hobbitdata.informatik.uni-leipzig.de/squirrel/squirrel-logo.png)
=======
### ... or do it manually
## Run
You can run by using the docker-compose file.

```
$ docker-compose -f docker-compose-sparql.yml up
```

1. ``mvn clean package shade:shade -U -DskipTests``
1. if you have a new version of squirrel, e.g. version 0.3.0, you **can** execute``mvn install:install-file -DgroupId=org.aksw.simba -DartifactId=squirrel -Dpackaging=jar -Dversion=0.3.0 -Dfile="target\original-squirrel.jar" -DgeneratePom=true -DlocalRepositoryPath=repository``
1. If you want to use the Web-Components, have a look to the Dependencies in this file
1. ``docker build -t squirrel .``
1. execute a `.yml` file with ``docker-compose -f <file> up``/ ``down``
Squirrel uses spring context configuration to define the implementation of its components in Runtime.
you can check the default implementation file in spring-config/sparqlStoreBased.xml and define your own
beans on it.

#### There are currently 3 yml-options
You can also define a different context for each one of the workers. Check the docker-compose file and change
an implementation file in each worker's env variable.

All yml files in the root folder crawls real existing data portals with the help of [HtmlScraper](https://github.com/dice-group/Squirrel/wiki/HtmlScraper_how_to)
- `docker-compose.yml`: file-sink based, without web
- `docker-compose-sparql.yml`: sparql-sink based (_JENA_), without web
- `docker-compose-sparql-web.yml`: sparql-sink based (_JENA_), with web including the visualization of crawled graph!
These are the components of Squirrel that can be customized:

---
#### Fetcher

* *HTTPFetcher* - Fetches data from html sources.
* *FTPFetcher* - Fetches data from html sources.
* *SparqlBasedFetcher* - Fetches data from Sparql endpoints.

* *Note*: The fetchers are not managed as spring beans yet, since only three are available.

#### Analyzer
Analyses the fetched data and extract triples from it. Note: the analyzer implementations are managed by the `SimpleAnalyzerManager`. Any implementations should be passed in the constructor of this class, like the example below:
```xml
<bean id="analyzerBean" class="org.aksw.simba.squirrel.analyzer.manager.SimpleAnalyzerManager">
<constructor-arg index="0" ref="uriCollectorBean" />
<constructor-arg index="1" >
<array value-type="java.lang.String">
<value>org.aksw.simba.squirrel.analyzer.impl.HDTAnalyzer</value>
<value>org.aksw.simba.squirrel.analyzer.impl.RDFAnalyzer</value>
<value>org.aksw.simba.squirrel.analyzer.impl.HTMLScraperAnalyzer</value>
</array>
</constructor-arg>
</bean>
```
Also, if you want to implement your own analyzer, it is necessary to implement the method `isEligible()`, that checks if that analyzer matches the condition to call the `analyze` method.

## Dependencies
* *RDFAnalyzer* - Analyses RDF formats.
* *HTMLScraperAnalyzer* - Analyses and scrapes HTML data base on Jsoup selector-synthax (see: https://github.com/dice-group/Squirrel/wiki/HtmlScraper_how_to)
* *HDTAnalyzer* - Analyses HDT binary RDF format.

### Using a Sparql-Host
#### Collectors
Collects new URIs found during the analysis process and serialize it before they are sent to the Frontier.

You can use a sparql-based triple store to store the crawled data. If you want use it, you have to do the following:
* *SimpleUriCollector* - Serialize uri's and stores it in memory (mainly used for testing purposes).
* *SqlBasedUriCollector* - Serialize uri's and stores it in a hsqldb database.

Until yet, the necessary datasets in the database are not created automatically. So you have to create them by hand:
1. Run Squirrel as explained above
2. Enter *localhost:3030* in your browser's address line
3. Go to *manage datasets*
4. Click *add new dataset*
5. For *Dataset name* paste *contentset*
6. For *Dataset type* select *Persistent – dataset will persist across Fuseki restarts*
7. Go to step 4 again and do the same, **but this time with *"Metadata"* as *"Dataset name"***
#### Sink
Responsible for persisting the collected RDF data.

### Further dependencies
* *FileBasedSink* - persists the triples in NT files,
* *InMemorySink* - persists the triples only in memory, not in disk (mainly used for testing purposes).
* *HdtBasedSink* - persists the triples in a HDT file (compressed RDF format - http://www.rdfhdt.org/).
* *SparqlBasedSink* - persists the triples in a SparqlEndPoint.

The [Squirrel-Webservice](https://github.com/phhei/Squirrel-Webservice) and the [SquirrelWebObject](https://github.com/phhei/SquirrelWebObject) are included in this project, now. This leads to the fact, that this project is a multi module maven project now. For that, there are 2 pom.xml's in the root layer:
- `pom.xml`: this is the module bundle pom xml. If you execute ``mvn clean package``, this file will be called. As a consequence from this, all submodules including the _squirrel_ will be complied an packed
- `squirrel-pom.xml`: the pom for the _squirrel_

If you want to run the squirrel with the **Webservice**, take care that you have already the current Webservice image (Docker). If not, execute
1. ``mvn clean package`` _(only necessary if you want to compile each subproject (module) for itself)_
1. (``SquirrelWebObject\install.bat``)
1. ``SquirrelWebService\buildImage.bat``
2 changes: 1 addition & 1 deletion docker-compose-sparql.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ services:
container_name: frontier
environment:
- HOBBIT_RABBIT_HOST=rabbit
- SEED_FILE=/var/squirrel/seeds.txt
- URI_WHITELIST_FILE=/var/squirrel/whitelist.txt
- SEED_FILE=/var/squirrel/seeds.txt
- MDB_HOST_NAME=mongodb
- MDB_PORT=27017
- COMMUNICATION_WITH_WEBSERVICE=false
Expand Down
5 changes: 1 addition & 4 deletions seed/seeds.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
https://mcloud.de/web/guest/suche/-/results/search/0
https://www.europeandataportal.eu/data/en/dataset?q=&page=1
https://portal.opengeoedu.de
https://opendatainception.io
https://dataportals.org/search

https:/www.govdata.de/ckan/catalog/catalog.rdf
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,14 @@
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Set;

import org.bson.Document;
import org.bson.conversions.Bson;
import org.dice_research.squirrel.data.uri.CrawleableUri;
import org.dice_research.squirrel.data.uri.UriType;
import org.dice_research.squirrel.deduplication.hashing.HashValue;
import org.dice_research.squirrel.deduplication.hashing.UriHashCustodian;
import org.dice_research.squirrel.frontier.impl.FrontierImpl;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
Expand All @@ -29,7 +32,7 @@
import com.mongodb.client.model.Filters;
import com.mongodb.client.model.Indexes;

public class MongoDBKnowUriFilter implements KnownUriFilter, Cloneable, Closeable {
public class MongoDBKnowUriFilter implements KnownUriFilter, Cloneable, Closeable,UriHashCustodian {

private static final Logger LOGGER = LoggerFactory.getLogger(MongoDBKnowUriFilter.class);

Expand All @@ -46,6 +49,10 @@ public class MongoDBKnowUriFilter implements KnownUriFilter, Cloneable, Closeabl
public static final String COLUMN_IP = "ipAddress";
public static final String COLUMN_TYPE = "type";
public static final String COLUMN_HASH_VALUE = "hashValue";
/**
* Used as a default hash value for URIS, will be replaced by real hash value as soon as it has been computed.
*/
private static final String DUMMY_HASH_VALUE = "dummyValue";

public MongoDBKnowUriFilter(String hostName, Integer port) {

Expand All @@ -61,7 +68,7 @@ public boolean isUriGood(CrawleableUri uri) {
if (cursor.hasNext()) {
LOGGER.debug("URI {} is not good", uri.toString());
Document doc = cursor.next();
Long timestampRetrieved = Long.parseLong(doc.get("timestamp").toString());
Long timestampRetrieved = Long.parseLong(doc.get(COLUMN_TIMESTAMP_LAST_CRAWL).toString());
cursor.close();
if ((System.currentTimeMillis() - timestampRetrieved) < recrawlEveryWeek) {
return false;
Expand All @@ -77,10 +84,8 @@ public boolean isUriGood(CrawleableUri uri) {
}

@Override
public void add(CrawleableUri uri, long timestamp) {
mongoDB.getCollection(COLLECTION_NAME)
.insertOne(crawleableUriToMongoDocument(uri).append("timestamp", timestamp));
LOGGER.debug("Adding URI {} to the known uri filter list", uri.toString());
public void add(CrawleableUri uri, long nextCrawlTimestamp) {
add(uri, System.currentTimeMillis(), nextCrawlTimestamp);
}

public Document crawleableUriToMongoDocument(CrawleableUri uri) {
Expand Down Expand Up @@ -122,10 +127,25 @@ public boolean knowUriTableExists() {

@Override
public void add(CrawleableUri uri, long lastCrawlTimestamp, long nextCrawlTimestamp) {
// TODO Add recrawling support
add(uri, System.currentTimeMillis());
mongoDB.getCollection(COLLECTION_NAME)
.insertOne(crawleableUriToMongoDocument(uri)
.append(COLUMN_TIMESTAMP_LAST_CRAWL, lastCrawlTimestamp)
.append(COLUMN_TIMESTAMP_NEXT_CRAWL, nextCrawlTimestamp)
.append(COLUMN_CRAWLING_IN_PROCESS, false)
.append(COLUMN_HASH_VALUE, DUMMY_HASH_VALUE)
);
LOGGER.debug("Adding URI {} to the known uri filter list", uri.toString());
}

@Override
public void addHashValuesForUris(List<CrawleableUri> uris) {
for (CrawleableUri uri : uris) {
// r.db(DATABASE_NAME).table(TABLE_NAME).filter(doc -> doc.getField(COLUMN_URI).eq(uri.getUri().toString())).
// update(r.hashMap(COLUMN_HASH_VALUE, ((HashValue) uri.getData(Constants.URI_HASH_KEY)).encodeToString())).run(connector.connection);
}
}


public void purge() {
mongoDB.getCollection(COLLECTION_NAME).drop();
}
Expand All @@ -135,42 +155,44 @@ public List<CrawleableUri> getOutdatedUris() {
// get all uris with the following property:
// (nextCrawlTimestamp has passed) AND (crawlingInProcess==false OR lastCrawlTimestamp is 3 times older than generalRecrawlTime)

long generalRecrawlTime = Math.max(FrontierImpl.DEFAULT_GENERAL_RECRAWL_TIME, FrontierImpl.getGeneralRecrawlTime());
long generalRecrawlTime = Math.max(FrontierImpl.DEFAULT_GENERAL_RECRAWL_TIME, FrontierImpl.getGeneralRecrawlTime());

Bson filter = Filters.and(Filters.eq("COLUMN_TIMESTAMP_NEXT_CRAWL", System.currentTimeMillis()),
Filters.or(
Filters.eq("COLUMN_CRAWLING_IN_PROCESS", false),
Filters.eq("COLUMN_TIMESTAMP_LAST_CRAWL", System.currentTimeMillis() - generalRecrawlTime * 3)
));




Iterator<Document> uriDocs = mongoDB.getCollection(COLLECTION_NAME).find(filter).iterator();



// List<CrawleableUri> urisToRecrawl = new ArrayList<>();
// while (uriDocs.hasNext()) {
// try {
// Document doc = uriDocs.next();
// String ipString = (String) row.get(COLUMN_IP);
// if (ipString.contains("/")) {
// ipString = ipString.split("/")[1];
// }
// urisToRecrawl.add(new CrawleableUri(new URI((String) row.get(COLUMN_URI)), InetAddress.getByName(ipString)));
// } catch (URISyntaxException | UnknownHostException e) {
// LOGGER.warn(e.toString());
// }
// }
//
// // mark that the uris are in process now
// for (CrawleableUri uri : urisToRecrawl) {
// r.db(DATABASE_NAME).table(TABLE_NAME).filter(doc -> doc.getField(COLUMN_URI).eq(uri.getUri().toString())).
// update(r.hashMap(COLUMN_CRAWLING_IN_PROCESS, true)).run(connector.connection);
// }
//

List<CrawleableUri> urisToRecrawl = new ArrayList<>();
while (uriDocs.hasNext()) {
try {
Document doc = uriDocs.next();
String ipString = (String) doc.get(COLUMN_IP);
if (ipString.contains("/")) {
ipString = ipString.split("/")[1];
}
urisToRecrawl.add(new CrawleableUri(new URI((String) doc.get(COLUMN_URI)), InetAddress.getByName(ipString)));
} catch (URISyntaxException | UnknownHostException e) {
LOGGER.warn(e.toString());
}
}

// mark that the uris are in process now
for (CrawleableUri uri : urisToRecrawl) {

BasicDBObject newDocument = new BasicDBObject();
newDocument.append("$set", new BasicDBObject().append(COLUMN_CRAWLING_IN_PROCESS, true));

BasicDBObject searchQuery = new BasicDBObject().append(COLUMN_URI, uri.getUri().toString());

mongoDB.getCollection(COLLECTION_NAME).updateMany(searchQuery, newDocument);

}

// cursor.close();
return null;
return urisToRecrawl;
}

@Override
Expand All @@ -179,4 +201,10 @@ public long count() {
return 0;
}

@Override
public Set<CrawleableUri> getUrisWithSameHashValues(Set<HashValue> hashValuesForComparison) {
// TODO Auto-generated method stub
return null;
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoCursor;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.DropIndexOptions;
import com.mongodb.client.model.Indexes;

@SuppressWarnings("deprecation")
Expand Down Expand Up @@ -164,15 +165,20 @@ protected List<CrawleableUri> getUris(IpUriTypePair pair) {

try {
while(uriDocs.hasNext()) {
listUris.add( serializer.deserialize( ((Binary) uriDocs.next().get("uri")).getData()) );

Document doc = uriDocs.next();

listUris.add( serializer.deserialize( ((Binary) doc.get("uri")).getData()) );

}

}catch (Exception e) {
LOGGER.error("Error while retrieving uri from MongoDBQueue",e);
}

// mongoDB.getCollection(COLLECTION_NAME).deleteOne(new Document("ipAddress",pair.ip.getHostAddress()).append("type", pair.type.toString()));
// mongoDB.getCollection(COLLECTION_URIS).deleteMany(new Document("ipAddress",pair.ip.getHostAddress()).append("type", pair.type.toString()));
mongoDB.getCollection(COLLECTION_NAME).deleteOne(new Document("ipAddress",pair.ip.getHostAddress()).append("type", pair.type.toString()));
mongoDB.getCollection(COLLECTION_URIS).deleteMany(new Document("ipAddress",pair.ip.getHostAddress()).append("type", pair.type.toString()));


return listUris;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ public class MongoDBKnownUriFilterTest {
public void setUp() throws Exception {


filter = new MongoDBKnowUriFilter(MongoDBBasedTest.DB_HOST_NAME,MongoDBBasedTest.DB_PORT);
filter = new MongoDBKnowUriFilter(MongoDBBasedTest.DB_HOST_NAME,27017);
filter.open();
MongoDBBasedTest.tearDownMDB();
MongoDBBasedTest.setUpMDB();
Expand Down
Loading

0 comments on commit 9981001

Please sign in to comment.