Merge pull request #80 from dice-group/develop

Develop - Next Release
dice-group · Oct 30, 2018 · 9981001 · 9981001
2 parents 9fdab67 + d24a67c
commit 9981001
Show file tree

Hide file tree

Showing 24 changed files with 242 additions and 218 deletions.
diff --git a/README.md b/README.md
@@ -1,57 +1,78 @@
-# Squirrel
-Squirrel searches and collects Linked Data
+# Squirrel - Crawler of linked data.
 
-## Running with docker
+## Introduction
+Squirrel is a crawler for the linked web. It provides several tools to search and collect data
+from the heterogeneous content of the linked web.
 
-### Using the Makefile...
+<img src="https://hobbitdata.informatik.uni-leipzig.de/squirrel/squirrel-logo.png" align="center" height="248" width="244" > 
+
+
+## Build notes
+You can build the project with a simple ***mvn clean install***
+and then you can use the *makefile*
 
 ```
   $ make build dockerize
   $ docker-compose build
   $ docker-compose up
 ```
 
-![Squirrel logo](https://hobbitdata.informatik.uni-leipzig.de/squirrel/squirrel-logo.png)
-=======
-### ... or do it manually
+## Run
+You can run by using the docker-compose file.
+
+```
+  $ docker-compose -f docker-compose-sparql.yml up
+```
 
-1. ``mvn clean package shade:shade -U -DskipTests``
-1. if you have a new version of squirrel, e.g. version 0.3.0, you **can** execute``mvn install:install-file -DgroupId=org.aksw.simba -DartifactId=squirrel -Dpackaging=jar -Dversion=0.3.0 -Dfile="target\original-squirrel.jar" -DgeneratePom=true -DlocalRepositoryPath=repository``
-1. If you want to use the Web-Components, have a look to the Dependencies in this file
-1. ``docker build -t squirrel .``
-1. execute a `.yml` file with ``docker-compose -f <file> up``/ ``down``
+Squirrel uses spring context configuration to define the implementation of its components in Runtime.
+you can check the default implementation file in spring-config/sparqlStoreBased.xml and define your own
+beans on it.
 
-#### There are currently 3 yml-options
+You can also define a different context for each one of the workers. Check the docker-compose file and change
+an implementation file in each worker's env variable.
 
-All yml files in the root folder crawls real existing data portals with the help of [HtmlScraper](https://github.com/dice-group/Squirrel/wiki/HtmlScraper_how_to)
-- `docker-compose.yml`: file-sink based, without web
-- `docker-compose-sparql.yml`: sparql-sink based (_JENA_), without web
-- `docker-compose-sparql-web.yml`: sparql-sink based (_JENA_), with web including the visualization of crawled graph!
+These are the components of Squirrel that can be customized:
 
----
+#### Fetcher
+
+* *HTTPFetcher* - Fetches data from html sources.
+* *FTPFetcher* - Fetches data from html sources.
+* *SparqlBasedFetcher* - Fetches data from Sparql endpoints.
+
+* *Note*: The fetchers are not managed as spring beans yet, since only three are available.
+
+#### Analyzer
+Analyses the fetched data and extract triples from it. Note: the analyzer implementations are managed by the `SimpleAnalyzerManager`. Any implementations should be passed in the constructor of this class, like the example below:
+```xml
+<bean id="analyzerBean" class="org.aksw.simba.squirrel.analyzer.manager.SimpleAnalyzerManager">
+        <constructor-arg index="0" ref="uriCollectorBean" />
+        <constructor-arg index="1" >
+        	<array value-type="java.lang.String">
+			  <value>org.aksw.simba.squirrel.analyzer.impl.HDTAnalyzer</value>
+			  <value>org.aksw.simba.squirrel.analyzer.impl.RDFAnalyzer</value>
+			  <value>org.aksw.simba.squirrel.analyzer.impl.HTMLScraperAnalyzer</value>
+		</array>
+       	</constructor-arg>
+</bean>
+```
+Also, if you want to implement your own analyzer, it is necessary to implement the method `isEligible()`, that checks if that analyzer matches the condition to call the `analyze` method.
 
-## Dependencies
+* *RDFAnalyzer* - Analyses RDF formats.
+* *HTMLScraperAnalyzer* - Analyses and scrapes HTML data base on Jsoup selector-synthax (see: https://github.com/dice-group/Squirrel/wiki/HtmlScraper_how_to)
+* *HDTAnalyzer* - Analyses HDT binary RDF format.
 
-### Using a Sparql-Host
+#### Collectors
+Collects new URIs found during the analysis process and serialize it before they are sent to the Frontier.
 
-You can use a sparql-based triple store to store the crawled data. If you want use it, you have to do the following:
+* *SimpleUriCollector* - Serialize uri's and stores it in memory (mainly used for testing purposes).
+* *SqlBasedUriCollector* - Serialize uri's and stores it in a hsqldb database.
 
-Until yet, the necessary datasets in the database are not created automatically. So you have to create them by hand:
-1. Run Squirrel as explained above 
-2. Enter *localhost:3030* in your browser's address line
-3. Go to *manage datasets*
-4. Click *add new dataset*
-5. For *Dataset name* paste *contentset*
-6. For *Dataset type* select *Persistent – dataset will persist across Fuseki restarts*
-7. Go to step 4 again and do the same, **but this time with *"Metadata"* as *"Dataset name"***
+#### Sink
+Responsible for persisting the collected RDF data.
 
-### Further dependencies
+* *FileBasedSink* - persists the triples in NT files,
+* *InMemorySink* - persists the triples only in memory, not in disk (mainly used for testing purposes).
+* *HdtBasedSink* - persists the triples in a HDT file (compressed RDF format - http://www.rdfhdt.org/).
+* *SparqlBasedSink* - persists the triples in a SparqlEndPoint.
 
-The [Squirrel-Webservice](https://github.com/phhei/Squirrel-Webservice) and the [SquirrelWebObject](https://github.com/phhei/SquirrelWebObject) are included in this project, now. This leads to the fact, that this project is a multi module maven project now. For that, there are 2 pom.xml's in the root layer:
-- `pom.xml`: this is the module bundle pom xml. If you execute ``mvn clean package``, this file will be called. As a consequence from this, all submodules including the _squirrel_ will be complied an packed
-- `squirrel-pom.xml`: the pom for the _squirrel_
 
-If you want to run the squirrel with the **Webservice**, take care that you have already the current Webservice image (Docker). If not, execute
-1. ``mvn clean package`` _(only necessary if you want to compile each subproject (module) for itself)_
-1. (``SquirrelWebObject\install.bat``)
-1. ``SquirrelWebService\buildImage.bat``
diff --git a/docker-compose-sparql.yml b/docker-compose-sparql.yml
@@ -14,8 +14,8 @@ services:
     container_name: frontier
     environment:
       - HOBBIT_RABBIT_HOST=rabbit
-      - SEED_FILE=/var/squirrel/seeds.txt
       - URI_WHITELIST_FILE=/var/squirrel/whitelist.txt
+      - SEED_FILE=/var/squirrel/seeds.txt
       - MDB_HOST_NAME=mongodb
       - MDB_PORT=27017
       - COMMUNICATION_WITH_WEBSERVICE=false

diff --git a/seed/seeds.txt b/seed/seeds.txt
@@ -1,6 +1,3 @@
 https://mcloud.de/web/guest/suche/-/results/search/0
 https://www.europeandataportal.eu/data/en/dataset?q=&page=1
-https://portal.opengeoedu.de
-https://opendatainception.io
-https://dataportals.org/search
-
+https:/www.govdata.de/ckan/catalog/catalog.rdf
diff --git a/...ontier/src/main/java/org/dice_research/squirrel/data/uri/filter/MongoDBKnowUriFilter.java b/...ontier/src/main/java/org/dice_research/squirrel/data/uri/filter/MongoDBKnowUriFilter.java
@@ -10,11 +10,14 @@
 import java.util.HashMap;
 import java.util.Iterator;
 import java.util.List;
+import java.util.Set;
 
 import org.bson.Document;
 import org.bson.conversions.Bson;
 import org.dice_research.squirrel.data.uri.CrawleableUri;
 import org.dice_research.squirrel.data.uri.UriType;
+import org.dice_research.squirrel.deduplication.hashing.HashValue;
+import org.dice_research.squirrel.deduplication.hashing.UriHashCustodian;
 import org.dice_research.squirrel.frontier.impl.FrontierImpl;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -29,7 +32,7 @@
 import com.mongodb.client.model.Filters;
 import com.mongodb.client.model.Indexes;
 
-public class MongoDBKnowUriFilter implements KnownUriFilter, Cloneable, Closeable {
+public class MongoDBKnowUriFilter implements KnownUriFilter, Cloneable, Closeable,UriHashCustodian {
 
     private static final Logger LOGGER = LoggerFactory.getLogger(MongoDBKnowUriFilter.class);
 
@@ -46,6 +49,10 @@ public class MongoDBKnowUriFilter implements KnownUriFilter, Cloneable, Closeabl
     public static final String COLUMN_IP = "ipAddress";
     public static final String COLUMN_TYPE = "type";
     public static final String COLUMN_HASH_VALUE = "hashValue";
+    /**
+     * Used as a default hash value for URIS, will be replaced by real hash value as soon as it has been computed.
+     */
+    private static final String DUMMY_HASH_VALUE = "dummyValue";
 
     public MongoDBKnowUriFilter(String hostName, Integer port) {
 
@@ -61,7 +68,7 @@ public boolean isUriGood(CrawleableUri uri) {
         if (cursor.hasNext()) {
             LOGGER.debug("URI {} is not good", uri.toString());
             Document doc = cursor.next();
-            Long timestampRetrieved = Long.parseLong(doc.get("timestamp").toString());
+            Long timestampRetrieved = Long.parseLong(doc.get(COLUMN_TIMESTAMP_LAST_CRAWL).toString());
             cursor.close();
             if ((System.currentTimeMillis() - timestampRetrieved) < recrawlEveryWeek) {
                 return false;
@@ -77,10 +84,8 @@ public boolean isUriGood(CrawleableUri uri) {
     }
 
     @Override
-    public void add(CrawleableUri uri, long timestamp) {
-        mongoDB.getCollection(COLLECTION_NAME)
-                .insertOne(crawleableUriToMongoDocument(uri).append("timestamp", timestamp));
-        LOGGER.debug("Adding URI {} to the known uri filter list", uri.toString());
+    public void add(CrawleableUri uri, long nextCrawlTimestamp) {
+    	add(uri, System.currentTimeMillis(), nextCrawlTimestamp);
     }
 
     public Document crawleableUriToMongoDocument(CrawleableUri uri) {
@@ -122,10 +127,25 @@ public boolean knowUriTableExists() {
 
     @Override
     public void add(CrawleableUri uri, long lastCrawlTimestamp, long nextCrawlTimestamp) {
-        // TODO Add recrawling support
-        add(uri, System.currentTimeMillis());
+    	 mongoDB.getCollection(COLLECTION_NAME)
+         .insertOne(crawleableUriToMongoDocument(uri)
+        		 .append(COLUMN_TIMESTAMP_LAST_CRAWL, lastCrawlTimestamp)
+        		 .append(COLUMN_TIMESTAMP_NEXT_CRAWL, nextCrawlTimestamp)
+        		 .append(COLUMN_CRAWLING_IN_PROCESS, false)
+        		 .append(COLUMN_HASH_VALUE, DUMMY_HASH_VALUE)
+        		 );
+    	 LOGGER.debug("Adding URI {} to the known uri filter list", uri.toString());
     }
 
+    @Override
+    public void addHashValuesForUris(List<CrawleableUri> uris) {
+        for (CrawleableUri uri : uris) {
+//            r.db(DATABASE_NAME).table(TABLE_NAME).filter(doc -> doc.getField(COLUMN_URI).eq(uri.getUri().toString())).
+//                update(r.hashMap(COLUMN_HASH_VALUE, ((HashValue) uri.getData(Constants.URI_HASH_KEY)).encodeToString())).run(connector.connection);
+        }
+    }
+
+
     public void purge() {
     	mongoDB.getCollection(COLLECTION_NAME).drop();
     }
@@ -135,42 +155,44 @@ public List<CrawleableUri> getOutdatedUris() {
     	// get all uris with the following property:
         // (nextCrawlTimestamp has passed) AND (crawlingInProcess==false OR lastCrawlTimestamp is 3 times older than generalRecrawlTime)
 
-    	 long generalRecrawlTime = Math.max(FrontierImpl.DEFAULT_GENERAL_RECRAWL_TIME, FrontierImpl.getGeneralRecrawlTime());
+    	long generalRecrawlTime = Math.max(FrontierImpl.DEFAULT_GENERAL_RECRAWL_TIME, FrontierImpl.getGeneralRecrawlTime());
 
     	Bson filter = Filters.and(Filters.eq("COLUMN_TIMESTAMP_NEXT_CRAWL", System.currentTimeMillis()),
     			Filters.or(
 	    			Filters.eq("COLUMN_CRAWLING_IN_PROCESS", false),
 	    			Filters.eq("COLUMN_TIMESTAMP_LAST_CRAWL", System.currentTimeMillis() - generalRecrawlTime * 3)
     			));
-
-
-
+
         Iterator<Document> uriDocs = mongoDB.getCollection(COLLECTION_NAME).find(filter).iterator();
-
-
-
-//        List<CrawleableUri> urisToRecrawl = new ArrayList<>();
-//        while (uriDocs.hasNext()) {
-//            try {
-//                Document doc = uriDocs.next();
-//                String ipString = (String) row.get(COLUMN_IP);
-//                if (ipString.contains("/")) {
-//                    ipString = ipString.split("/")[1];
-//                }
-//                urisToRecrawl.add(new CrawleableUri(new URI((String) row.get(COLUMN_URI)), InetAddress.getByName(ipString)));
-//            } catch (URISyntaxException | UnknownHostException e) {
-//                LOGGER.warn(e.toString());
-//            }
-//        }
-//
-//        // mark that the uris are in process now
-//        for (CrawleableUri uri : urisToRecrawl) {
-//            r.db(DATABASE_NAME).table(TABLE_NAME).filter(doc -> doc.getField(COLUMN_URI).eq(uri.getUri().toString())).
-//                update(r.hashMap(COLUMN_CRAWLING_IN_PROCESS, true)).run(connector.connection);
-//        }
-//
+
+        List<CrawleableUri> urisToRecrawl = new ArrayList<>();
+        while (uriDocs.hasNext()) {
+            try {
+                Document doc = uriDocs.next();
+                String ipString = (String) doc.get(COLUMN_IP);
+                if (ipString.contains("/")) {
+                    ipString = ipString.split("/")[1];
+                }
+                urisToRecrawl.add(new CrawleableUri(new URI((String) doc.get(COLUMN_URI)), InetAddress.getByName(ipString)));
+            } catch (URISyntaxException | UnknownHostException e) {
+                LOGGER.warn(e.toString());
+            }
+        }
+
+        // mark that the uris are in process now
+        for (CrawleableUri uri : urisToRecrawl) {
+
+        	BasicDBObject newDocument = new BasicDBObject();
+        	newDocument.append("$set", new BasicDBObject().append(COLUMN_CRAWLING_IN_PROCESS, true));
+
+        	BasicDBObject searchQuery = new BasicDBObject().append(COLUMN_URI,  uri.getUri().toString());
+
+        	 mongoDB.getCollection(COLLECTION_NAME).updateMany(searchQuery, newDocument);
+
+        }
+
 //        cursor.close();
-        return null;
+        return urisToRecrawl;
     }
 
     @Override
@@ -179,4 +201,10 @@ public long count() {
         return 0;
     }
 
+	@Override
+	public Set<CrawleableUri> getUrisWithSameHashValues(Set<HashValue> hashValuesForComparison) {
+		// TODO Auto-generated method stub
+		return null;
+	}
+
 }
diff --git a/squirrel.frontier/src/main/java/org/dice_research/squirrel/queue/MongoDBQueue.java b/squirrel.frontier/src/main/java/org/dice_research/squirrel/queue/MongoDBQueue.java
@@ -28,6 +28,7 @@
 import com.mongodb.client.MongoCollection;
 import com.mongodb.client.MongoCursor;
 import com.mongodb.client.MongoDatabase;
+import com.mongodb.client.model.DropIndexOptions;
 import com.mongodb.client.model.Indexes;
 
 @SuppressWarnings("deprecation")
@@ -164,15 +165,20 @@ protected List<CrawleableUri> getUris(IpUriTypePair pair) {
 
     		try {
     			while(uriDocs.hasNext()) {
-    				listUris.add( serializer.deserialize( ((Binary) uriDocs.next().get("uri")).getData()) );
+
+    				Document doc = uriDocs.next();
+
+    				listUris.add( serializer.deserialize( ((Binary) doc.get("uri")).getData()) );
+
     			}
 
     		}catch (Exception e) {
     			LOGGER.error("Error while retrieving uri from MongoDBQueue",e);
 			}
 
-//    		mongoDB.getCollection(COLLECTION_NAME).deleteOne(new Document("ipAddress",pair.ip.getHostAddress()).append("type", pair.type.toString()));
-//    		mongoDB.getCollection(COLLECTION_URIS).deleteMany(new Document("ipAddress",pair.ip.getHostAddress()).append("type", pair.type.toString()));
+    		mongoDB.getCollection(COLLECTION_NAME).deleteOne(new Document("ipAddress",pair.ip.getHostAddress()).append("type", pair.type.toString()));
+    		mongoDB.getCollection(COLLECTION_URIS).deleteMany(new Document("ipAddress",pair.ip.getHostAddress()).append("type", pair.type.toString()));
+
 
 	        return listUris;
 	}

diff --git a/...r/src/test/java/org/dice_research/squirrel/data/uri/filter/MongoDBKnownUriFilterTest.java b/...r/src/test/java/org/dice_research/squirrel/data/uri/filter/MongoDBKnownUriFilterTest.java
@@ -50,7 +50,7 @@ public class MongoDBKnownUriFilterTest {
     public void setUp() throws Exception {
 
 
-        filter = new MongoDBKnowUriFilter(MongoDBBasedTest.DB_HOST_NAME,MongoDBBasedTest.DB_PORT);
+        filter = new MongoDBKnowUriFilter(MongoDBBasedTest.DB_HOST_NAME,27017);
         filter.open();
         MongoDBBasedTest.tearDownMDB();
         MongoDBBasedTest.setUpMDB();