To get English doc, please scroll down the page.
-
基于 Python3 的 Scrapy 爬虫项目, 主爬取网站为 51Job, 次爬取网站为拉勾网
-
项目在 Ubuntu17.10 以及 Deepin 上开发, 在 Mac OS 上或其他 Linux 衍生系统上运行可能有少许命令上的不同. 不建议在 Windows 上运行本项目.可能会有一些奇怪的错误, 当然, 你喜欢我也阻止不了(逃
-
在项目最初的 Commit 中基于 Python2, 但 Python2 对中文编码不友好,且未来将失去很多模块支持.
-
基于存储速度的考量,在最新的版本中, 项目使用 MonogoDB 存储,你仍然可以输出
*.csv
文件. -
现已支持类似于增量爬取的功能, 利用与数据库最新一条数据与正在爬取数据的 id 对比, 相同则抛出异常终止爬取 注意: 由于 Scrapy 是多线程引擎, 在抛出异常后, 需要逐个关闭, 所以需要一定时间, 因此在终端会有正在爬取的网址输出 此方法较为适用于每天 0 点时爬取, 此时爬取不会漏掉数据更新. 对于数据高度匹配的文章等爬取, 此方法更为适用
-
在 pipeline 中添加去重代码, 使用
find_one()
函数来节省搜索时间, 对比字段是job_id
, 感谢 @Chen4089 -
现已支持 crontab 定时爬取功能, 目前设定为每周一三五 0 点爬取, 输出 当前日期.csv 文件至
data
文件夹 感谢 @Jlinka 测试定时爬取时在终端键入tail -f /var/log/cron.log
查看运行情况 注意如果爬虫或脚本出现错误时不会出现在该日志中, 请使用sh startup.sh
测试通过后再让 crontab 任务运行. 参考: crontab 定时任务 记录配置python爬虫定时任务crontab所踩过的坑
提供基于Django和HighCharts数据可视化项目, 详情请点击JobDataViewer
有问题欢迎邮箱([email protected])或issue,喜欢记得star
http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/overview.html
http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/install.html#scrapy
项目使用Python3.6, 可以使用以下连接安装, 将文中的3.5改为3.6即可. ubuntu14.04安装python3.5并且将其设置为python3默认启动
Virtualenv允许多版本Python同时在电脑上共存, 安装完Python3及pip后 终端键入
# 安装
pip3 install virtualenv
# 创建虚拟环境
virtualenv spider-env
# 激活虚拟环境
source spider-env/bin/activate
# 退出
deactivate
因为Scrapy依赖Python.h
,在安装库依赖前在终端键入
sudo apt-get install libpython3.6-dev
然后安装依赖, 如果失败, 请逐条尝试
# 在JobCrawler/JobCrawler目录下
pip install -r requirements.txt
参照以下连接安装 Install MongoDB Community Edition
终端cd
到项目根目录, 键入
# -o job.csv为可选参数, 加入则输出到指定文件中
scrapy crawl jobCrawler -o job.csv
Scrapy Project For Crawling Job Information on 51Job Based on Python3.
(You can check out the commits before to get project based on Python2, but it's not recommended due to Python2 doesn't support Chinese very well and many modules don't support Python2 any more.)
In the latest version, the project now use MongoDB to save data.
http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/overview.html
http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/install.html#scrapy
job.csv
include job data aboutpython
from51job
sudo add-apt-repository ppa:fkrull/deadsnakes
sudo apt-get update
sudo apt-get install python3.6
To make Python3.6 as default Python Version, read this arcticle
# Install
pip3 install virtualenv
# Create
virtualenv spider-env
# Activate
source spider-env/bin/activate
# Quit
deactivate
type this command first
sudo apt-get install libpython3.6-dev
Because scrapy require Python.h
Than typepip install -r requirements.txt
If Failed, try open the requirements.txt
, and type pip install
one by one.
Install MongoDB Community Edition
type cd
to the root directory of the project because of the file field.csv
scrapy crawl jobCrawler
# if u want to output the result to csv file, use this command instead:
scrapy crawl jobCrawler -o filename.csv
# u can change the spider name in /spider/spider.py
...
class JobSpider(Spider):
# input spider name here
name = 'jobCrawler'
...
def parse(self, response):
item = JobcrawlerItem()
jobs = response.xpath('//*[@id="resultList"]/div[@class="el"]')
for job in jobs:
loader = JobItemLoader(item=JobcrawlerItem(), selector=job)
# job_id field
item['job_id'] = job.xpath('.//p/input/@value').extract()
...
But I haven't found it any incompatible with old ver.
Source Code in JobCrawler/spiders/entrance.py
Also add new Scrapy field for the new spider, source code in JobCrawler/items.py
Here is what bugs fixed
- str to float error(because of the string Segmentation ERROR)
- all spiders share the same pipeline method
MongoDB is a NoSQL Database
If U want to see the difference between MongoDb and MySQL, check out the past Ver.
When use Mysql for saving data, the time for closing spider will be much longer than mongodb.
Here is the detail
In the past ver., scrapy Field create_time
is save like 03-09
without year because the website doesn't provide.
This bug cause other problems in my project I built recently.
Now I fixed it with the following code in pipeline.py
:
def process_item(self, item, spider):
import datetime
...
# sort out data
...
day = ''.join(item['create_time'])
day = datetime.datetime.strptime(day, '%m-%d')
day = day.replace(datetime.date.today().year)
item['create_time'] = day
...
And config in settings.py
...
MONGO_HOST = "127.0.0.1" #主机IP
MONGO_PORT = 27017 #端口号
MONGO_DB = "Spider" #库名
MONGO_COLL = "jobinfo" #collection名
# MONGO_USER = ""
# MONGO_PSW = ""
...
The project I built recently is a Django project for data visualization.
I need to reduce the compute time in that project as much as posssible to get faster when user visit the page.
Stop Crawling Duplicate Data by get the newest data from database, And then compare its datetime with the data crawling now. If equals, raise CloseSpider.
Note that scrapy is a muti-process engine, once you raise CloseSpider, It would stop the process one by one, so u may see some crawl message on the terminal.
# spider.py
...
def parse(self, response):
item = JobcrawlerItem()
jobs = response.xpath('//*[@id="resultList"]/div[@class="el"]')
for job in jobs:
client = MongoClient()
db = client['Spider']
coll = db.job
from scrapy.exceptions import CloseSpider
import datetime
item['create_time'] = job.xpath('.//span[@class="t5"]/text()').extract()
day = ''.join(item['create_time'])
day = datetime.datetime.strptime(day, '%m-%d')
day = day.replace(datetime.date.today().year)
if (coll.find_one()['create_time'] == day):
raise CloseSpider("Duplicate Data")
...
Thanks to @Chen4089
# pipeline.py
class jobCrawlerPipeline(object):
...
def process_item(self, item, spider):
# from scrapy.exceptions import CloseSpider
if spider.name == 'jobCrawler':
self.coll = self.db['job']
if (self.coll.find_one({"job_id": item['job_id']}) == None):
job_name = item['job_name']
salary = item['salary']
...
Thanks to @Jlinka
Check the following files for detail:
ubuntucron
startup.sh
run.py