Skip to content

Commit

Permalink
Merge pull request #37 from peterbencze/development
Browse files Browse the repository at this point in the history
New version
  • Loading branch information
peterbencze authored Jul 18, 2017
2 parents 59d925e + 42d95d4 commit 7492f5f
Show file tree
Hide file tree
Showing 17 changed files with 1,011 additions and 707 deletions.
52 changes: 36 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Add the following dependency to your pom.xml:
<dependency>
<groupId>com.github.peterbencze</groupId>
<artifactId>serritor</artifactId>
<version>1.1</version>
<version>1.2</version>
</dependency>
```

Expand All @@ -26,38 +26,58 @@ See the [Wiki](https://github.com/peterbencze/serritor/wiki) page.
BaseCrawler provides a skeletal implementation of a crawler to minimize the effort to create your own. First, create a class that extends BaseCrawler. In this class, you can customize the behavior of your crawler. There are callbacks available for every stage of crawling. Below you can find a sample implementation:
```java
public class MyCrawler extends BaseCrawler {

public MyCrawler() {
config.addSeedAsString("http://yourspecificwebsite.com");
config.setFilterOffsiteRequests(true);
// Enable offsite request filtering
config.setOffsiteRequestFiltering(true);

// Add a crawl seed, this is where the crawling starts
CrawlRequest request = new CrawlRequestBuilder("http://example.com").build();
config.addCrawlSeed(request);
}

@Override
protected void onResponseComplete(HtmlResponse response) {
List<WebElement> links = response.getWebDriver().findElements(By.tagName("a"));
links.stream().forEach((WebElement link) -> crawlUrlAsString(link.getAttribute("href")));
protected void onResponseComplete(final HtmlResponse response) {
// Crawl every link that can be found on the page
response.getWebDriver().findElements(By.tagName("a"))
.stream()
.forEach((WebElement link) -> {
CrawlRequest request = new CrawlRequestBuilder(link.getAttribute("href")).build();
crawl(request);
});
}

@Override
protected void onNonHtmlResponse(NonHtmlResponse response) {
System.out.println("Received a non-HTML response from: " + response.getCurrentUrl());
protected void onNonHtmlResponse(final NonHtmlResponse response) {
System.out.println("Received a non-HTML response from: " + response.getCrawlRequest().getRequestUrl());
}

@Override
protected void onUnsuccessfulRequest(UnsuccessfulRequest request) {
System.out.println("Could not get response from: " + request.getCurrentUrl());
protected void onUnsuccessfulRequest(final UnsuccessfulRequest request) {
System.out.println("Could not get response from: " + request.getCrawlRequest().getRequestUrl());
}
}
```
That's it! In just a few lines you can make a crawler that extracts and crawls every URL it finds, while filtering duplicate and offsite requests. You also get access to the WebDriver, so you can use all the features that are provided by Selenium.

By default, the crawler uses [HtmlUnitDriver](https://github.com/SeleniumHQ/selenium/wiki/HtmlUnitDriver) but you can also set your preferred WebDriver:
By default, the crawler uses [HtmlUnit headless browser](http://htmlunit.sourceforge.net/):
```java
config.setWebDriver(new ChromeDriver());
public static void main(String[] args) {
MyCrawler myCrawler = new MyCrawler();

// Use HtmlUnit headless browser
myCrawler.start();
}
```
Of course, you can also use any other browsers by specifying a corresponding WebDriver instance:
```java
public static void main(String[] args) {
MyCrawler myCrawler = new MyCrawler();

## Support
The developers would like to thank [Precognox](http://precognox.com/) for the support.
// Use Google Chrome
myCrawler.start(new ChromeDriver());
}
```

## License
The source code of Serritor is made available under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
13 changes: 9 additions & 4 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>com.github.peterbencze</groupId>
<artifactId>serritor</artifactId>
<version>1.1</version>
<version>1.2</version>
<packaging>jar</packaging>

<name>Serritor</name>
Expand Down Expand Up @@ -61,12 +61,17 @@
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.0.1</version>
<version>3.4.0</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>htmlunit-driver</artifactId>
<version>2.23.2</version>
<version>2.27</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>22.0</version>
</dependency>
</dependencies>

Expand Down Expand Up @@ -115,7 +120,7 @@
<plugin>
<groupId>org.sonatype.plugins</groupId>
<artifactId>nexus-staging-maven-plugin</artifactId>
<version>1.6.7</version>
<version>1.6.8</version>
<extensions>true</extensions>
<configuration>
<serverId>ossrh</serverId>
Expand Down
Loading

0 comments on commit 7492f5f

Please sign in to comment.