siren是一套以配置为基础的爬虫系统,他的基本配置和解析系统是yaml。借助yaml的语法,他可以很轻松的定义爬虫,而不需要编写大量代码。
使用siren,你需要了解css或者xpath,能够用css或xpath表述你需要获得的内容。知道正则表达式,能够使用正则处理简单的过滤和替换。
要良好的使用siren,你还可能需要了解robots.txt协议相关的内容。遵循别人的意愿,礼貌的获取数据,做一只绅(bian)士(tai)的爬虫。
siren维护一个爬虫队列。在爬虫工作时,每次从队列中取出一个request。而后开始按照匹配规则进行匹配。
当匹配规则命中某个项目时,爬虫会执行一种action。例如把url下载下来,调用python代码处理。或者解析下载下来的html,再调用python代码。
siren的特殊之处在于,定义了一组预定义的爬虫处理程序。这组程序被称为parsers。通过配置,可以直接处理结果,而不需要编写python代码。
name: wenku8
timeout: 10
interval: 5
result: novel:result
output: output.txt
patterns:
- name: main
desc: table of content
parsers:
- css: a
attr: href
is: "[0-9]+\\.htm"
call: node
- name: node
desc: node
parsers:
- css: div#title
text: yes
result: title
- css: div#content
html2text: yes
result: content
细节请参考config。
请看guide。
-
do something
- bilibili
- bt.ktxp.com
- jd
-
regex
-
js runner
-
cookie在redis中保存:加速存取效率。
-
队列防回环(in redis):已经爬过的维护一份列表。
-
parser in css or xpath
Copyright (C) 2012 Shell Xu
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.