About

A simple crawler library for scraping the contents of websites. Written in Java using JDK 1.8. Not intended for any serious production use.

Usage

First the project has to be built locally with Maven. At the project root, use mvn clean install. Then add the following to your pom.xml

<dependency>
      <groupId>me.adeshina</groupId>
      <artifactId>site-crawler</artifactId>
      <version>1.0.0-SNAPSHOT</version>
</dependency>

To crawl a website, you'll need to first configure the crawling process, obtain a crawler for the site and start it. E.g

public class CrawlMySite {
    
    public void useCrawler() {
        
        CrawlConfig config = new CrawlConfig();
        config.maxPages(2);
        config.respectRobotsFile(false);
        config.crawlDelaySeconds(1);
                        
        // Only URLs in this domain will be visited
        String site = "https://twitter.com";
                
        SiteCrawler twitterCrawler = SiteCrawler.get(site, config);
                        
        // Blocking call
        Set<WebPage> pages = twitterCrawler.start();
    }
            
}

Dependecies

Jsoup
JUnit 5
Mockito

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Usage

Dependecies

About

Releases

Packages

Languages

License

Vishal-34535/site-crawler

Folders and files

Latest commit

History

Repository files navigation

About

Usage

Dependecies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages