Merge pull request #37 from peterbencze/development

New version
peterbencze · Jul 18, 2017 · 7492f5f · 7492f5f
2 parents 59d925e + 42d95d4
commit 7492f5f
Show file tree

Hide file tree

Showing 17 changed files with 1,011 additions and 707 deletions.
diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@ Add the following dependency to your pom.xml:
 <dependency>
     <groupId>com.github.peterbencze</groupId>
     <artifactId>serritor</artifactId>
-    <version>1.1</version>
+    <version>1.2</version>
 </dependency>
 ```
 
@@ -26,38 +26,58 @@ See the [Wiki](https://github.com/peterbencze/serritor/wiki) page.
 BaseCrawler provides a skeletal implementation of a crawler to minimize the effort to create your own. First, create a class that extends BaseCrawler. In this class, you can customize the behavior of your crawler. There are callbacks available for every stage of crawling. Below you can find a sample implementation:
 ```java
 public class MyCrawler extends BaseCrawler {
-    
+
     public MyCrawler() {
-        config.addSeedAsString("http://yourspecificwebsite.com");
-        config.setFilterOffsiteRequests(true);
+        // Enable offsite request filtering
+        config.setOffsiteRequestFiltering(true);
+
+        // Add a crawl seed, this is where the crawling starts
+        CrawlRequest request = new CrawlRequestBuilder("http://example.com").build();
+        config.addCrawlSeed(request);
     }
 
     @Override
-    protected void onResponseComplete(HtmlResponse response) {
-        List<WebElement> links = response.getWebDriver().findElements(By.tagName("a"));
-        links.stream().forEach((WebElement link) -> crawlUrlAsString(link.getAttribute("href")));
+    protected void onResponseComplete(final HtmlResponse response) {
+        // Crawl every link that can be found on the page
+        response.getWebDriver().findElements(By.tagName("a"))
+                .stream()
+                .forEach((WebElement link) -> {
+                    CrawlRequest request = new CrawlRequestBuilder(link.getAttribute("href")).build();
+                    crawl(request);
+                });
     }
 
     @Override
-    protected void onNonHtmlResponse(NonHtmlResponse response) {
-        System.out.println("Received a non-HTML response from: " + response.getCurrentUrl());
+    protected void onNonHtmlResponse(final NonHtmlResponse response) {
+        System.out.println("Received a non-HTML response from: " + response.getCrawlRequest().getRequestUrl());
     }
-    
+
     @Override
-    protected void onUnsuccessfulRequest(UnsuccessfulRequest request) {
-        System.out.println("Could not get response from: " + request.getCurrentUrl());
+    protected void onUnsuccessfulRequest(final UnsuccessfulRequest request) {
+        System.out.println("Could not get response from: " + request.getCrawlRequest().getRequestUrl());
     }
 }
 ```
 That's it! In just a few lines you can make a crawler that extracts and crawls every URL it finds, while filtering duplicate and offsite requests. You also get access to the WebDriver, so you can use all the features that are provided by Selenium.
 
-By default, the crawler uses [HtmlUnitDriver](https://github.com/SeleniumHQ/selenium/wiki/HtmlUnitDriver) but you can also set your preferred WebDriver:
+By default, the crawler uses [HtmlUnit headless browser](http://htmlunit.sourceforge.net/):
 ```java
-config.setWebDriver(new ChromeDriver());
+public static void main(String[] args) {
+    MyCrawler myCrawler = new MyCrawler();
+
+    // Use HtmlUnit headless browser
+    myCrawler.start();
+}
 ```
+Of course, you can also use any other browsers by specifying a corresponding WebDriver instance:
+```java
+public static void main(String[] args) {
+    MyCrawler myCrawler = new MyCrawler();
 
-## Support
-The developers would like to thank [Precognox](http://precognox.com/) for the support.
+    // Use Google Chrome
+    myCrawler.start(new ChromeDriver());
+}
+```
 
 ## License
 The source code of Serritor is made available under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
diff --git a/pom.xml b/pom.xml
@@ -3,7 +3,7 @@
     <modelVersion>4.0.0</modelVersion>
     <groupId>com.github.peterbencze</groupId>
     <artifactId>serritor</artifactId>
-    <version>1.1</version>
+    <version>1.2</version>
     <packaging>jar</packaging>
 
     <name>Serritor</name>
@@ -61,12 +61,17 @@
         <dependency>
             <groupId>org.seleniumhq.selenium</groupId>
             <artifactId>selenium-java</artifactId>
-            <version>3.0.1</version>
+            <version>3.4.0</version>
         </dependency>
         <dependency>
             <groupId>org.seleniumhq.selenium</groupId>
             <artifactId>htmlunit-driver</artifactId>
-            <version>2.23.2</version>
+            <version>2.27</version>
+        </dependency>
+        <dependency>
+            <groupId>com.google.guava</groupId>
+            <artifactId>guava</artifactId>
+            <version>22.0</version>
         </dependency>
     </dependencies>
 
@@ -115,7 +120,7 @@
             <plugin>
                 <groupId>org.sonatype.plugins</groupId>
                 <artifactId>nexus-staging-maven-plugin</artifactId>
-                <version>1.6.7</version>
+                <version>1.6.8</version>
                 <extensions>true</extensions>
                 <configuration>
                     <serverId>ossrh</serverId>