-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
74 lines (51 loc) · 2.65 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
=============
CrawlingBeast
=============
Back in 2007, in the eary years of my PhD I was enrolled in a
Information Retrieval course. Its main assignment, which was divided
into 3 "stages" was to build a full-blow search engine: from crawling,
to indexing and finally to ranking and answering search queries.
The assignment had the following restrictions:
* The only external library we were allowed to use was libCURL,
(http://curl.haxx.se/libcurl/), a HTTP/networking library.
Everything else had to be coded by ourselves.
* It had to be able to crawl at least 100K in a single day (or
something of this magnitude IIRC).
This is the result of this "assignment". Its main code is in C++
although there is a python prototype for the early stages of the code
(HTML parsing, URL normalization, crawling).
What pieces of reusable code you will find here
===============================================
* A lenient HTML push parser.
It will try its best to parse any malformed HTML as good as a
browser would. It was greatly inspired in Python's BeatufilSoup
(http://www.crummy.com/software/BeautifulSoup/) and in open source
browsers' parsing code.
It correctly handles HTML character entities, ignores javascript
code, STYLE and SCRIPT tags, finds text in "alt" attributes etc.
* Character set detection and conversion utilities.
Web is a jungle of Latin-1, Unicode, ISO-8859-X and whatnot. We try
to detect whatever we are dealing with and convert it to UTF-8
before processing.
* URL normalization code.
Follows RFCs as much as possible.
And that should be it. The rest of the code is not that reusable, but
"your mileage may vary".
What about style? Y U no KISS?
------------------------------
First, have in mind that I wrote python prototype before going to the
C++ version but that the majority of this code was being written while I
was having a blast reading Stroustrup's "The C++ Programming Language".
So you will find that not only the code style changes as the code
progresses from crawling to serving, but the constructions change as
well -- not necessarily for the better.
For instance, you will find an HTML parser wrapped inside a STL
iterator. Why? "Why not, it will be fun and elegant" -- or so I though.
A silly idea wrapped into a dreadful interface but it made perfect sense
to abort the KISS principle at that time and do the things "the
(supposed) C++ way".
What about tests?
-----------------
Some of this code has companion unit tests. In particular the HTML
parsing and URL normalization code have. You can take a look at the it
to notice what sort of crazy stuff this code handles -- or fails to.