Skip to content

scrapinghub/frontera

Folders and files

NameName
Last commit message
Last commit date
Nov 21, 2016
Nov 13, 2016
Nov 29, 2016
Nov 11, 2016
Nov 18, 2016
Jun 24, 2016
Apr 14, 2015
Dec 3, 2014
Aug 15, 2016
Aug 18, 2016
Apr 14, 2015
Apr 14, 2015
Aug 18, 2016
Aug 3, 2016
Nov 24, 2014
Nov 11, 2016
Sep 27, 2016
Nov 24, 2014

Repository files navigation

Frontera

Overview

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Main features

  • Online operation: small requests batches, with parsing done right after fetch.
  • Pluggable backend architecture: low-level storage logic is separated from crawling policy.
  • Three run modes: single process, distributed spiders, distributed backend and spiders.
  • Transparent data flow, allowing to integrate custom components easily using Kafka.
  • Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
  • RDBMS and HBase backends.
  • Revisiting logic with RDBMS.
  • Optional use of Scrapy for fetching and parsing.
  • 3-clause BSD license, allowing to use in any commercial product.
  • Python 3 support.

Installation

$ pip install frontera

Documentation

Community

Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and pull requests.