Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

抓取任务列表中的【已抓取数量】和网站列表中的【资讯数】不一致 #17

Closed
nylqd opened this issue Feb 7, 2017 · 1 comment
Labels

Comments

@nylqd
Copy link

nylqd commented Feb 7, 2017

task在触发reachMax或者exceedRatio停止之后,CommonSpider onSuccess方法中log打印的【有效页面数】和抓取任务列表中的【已抓取数量】以及网站列表中的【资讯数】不一致均不一致

log

爬虫ID5e21c6bd-4878-413c-b0fa-a46a5c3376ac已处理31个页面,有效页面6个,最大抓取页数10,reachMax=false,exceedRatio=true,退出.

已抓取数量

任务名称 已抓取数量 抓取状态
www.163.com 9 STOP

资讯数

网站域名 资讯数
www.163.com 7

@gsh199449
Copy link
Owner

有两种情况可能导致上述状况:

  • 这些抓取的数据还未入库 稍等片刻即可

  • 可能这些数据在存储时出现一些异常,例如格式问题,与存储之间连接异常等等原因,没能入库,数据丢掉了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants