Name	Name	Last commit message	Last commit date
Latest commit sibiryakov Merge pull request #229 from scrapinghub/batched-create-requests Nov 29, 2016 9e94fc9 · Nov 29, 2016 History 570 Commits
docs	docs	adding SCORING_TOPIC in cluster-setup.rst	Nov 21, 2016
examples	examples	importing signals in bc.py	Nov 13, 2016
frontera	frontera	Merge pull request #229 from scrapinghub/batched-create-requests	Nov 29, 2016
requirements	requirements	forcing kafka 1.0.x series	Nov 11, 2016
tests	tests	test of crawling strategy components	Nov 18, 2016
.coveragerc	.coveragerc	Add codecov to Travis CI config	Jun 24, 2016
.gitattributes	.gitattributes	new project name: frontera, code changes.	Apr 14, 2015
.gitignore	.gitignore	Added emacs backup files to .gitignore	Dec 3, 2014
.travis.yml	.travis.yml	use scrapinghub docker repo and update happybase version	Aug 15, 2016
AUTHORS	AUTHORS	authors, setup.py and some fixes	Aug 18, 2016
LICENSE	LICENSE	new project name: frontera, code changes.	Apr 14, 2015
MANIFEST.in	MANIFEST.in	new project name: frontera, code changes.	Apr 14, 2015
README.md	README.md	authors, setup.py and some fixes	Aug 18, 2016
requirements.txt	requirements.txt	update w3lib version	Aug 3, 2016
setup.cfg	setup.cfg	Initial commit from previous bitbucket repo	Nov 24, 2014
setup.py	setup.py	get_offset interface update, python-snappy is removed from dependencies	Nov 11, 2016
tox.ini	tox.ini	fix s3 tests for python3.5	Sep 27, 2016
versioneer.py	versioneer.py	Add python-versioneer	Nov 24, 2014

Repository files navigation

Frontera

Overview

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Main features

Online operation: small requests batches, with parsing done right after fetch.
Pluggable backend architecture: low-level storage logic is separated from crawling policy.
Three run modes: single process, distributed spiders, distributed backend and spiders.
Transparent data flow, allowing to integrate custom components easily using Kafka.
Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
RDBMS and HBase backends.
Revisiting logic with RDBMS.
Optional use of Scrapy for fetching and parsing.
3-clause BSD license, allowing to use in any commercial product.
Python 3 support.

Installation

$ pip install frontera

Documentation

Community

Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and pull requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Frontera

Overview

Main features

Installation

Documentation

Community

About

Releases 22

Packages

Contributors 34

Languages

License

scrapinghub/frontera

Folders and files

Latest commit

History

Repository files navigation

Frontera

Overview

Main features

Installation

Documentation

Community

About

Resources

License

Stars

Watchers

Forks

Releases 22

Packages 0

Contributors 34

Languages

Packages