Skip to content

Latest commit

 

History

History
41 lines (27 loc) · 940 Bytes

README.md

File metadata and controls

41 lines (27 loc) · 940 Bytes

README

About

The founders of Google used python to fetch pages to build their search engine.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python"

I recommend reading the thesis: http://infolab.stanford.edu/~backrub/google.html

This project reflects my curiosity to create a web crawler as a programming challege.

The stack:

  • Python 3.6
  • Flask
  • BeautifulSoup
  • PyMysql

Progress Log

2019/07/16

In progress...

2017/4/28

Start URLs: https://www.reddit.com, https://nytimes.com, http://dmoztools.net Crawl time: Approx 4 hours Unique URLS: 1,003,156 Unique Domains: 79,302

2017/4/27

Start URL: https://www.reddit.com Crawl time: Approx 4 hours Unique URLS: 84,477 Unique Domains: 2,747