针对单个WordPress网站的网络爬虫程序
使用的开源类库如下:
Apache HttpComponents 4.3
HTML Parser 2.0
MySQL Connector/J 5.1.27
使用UTF-8编码以记录中文标签
使用XAMPP默认MySQL端口localhost:3306
需要本地XAMPP环境
下一次更新会加入统计每篇文章所使用的标签的功能
可以在我的博客内阅读详细原理:
http://johnhany.net/2013/11/web-crawler-using-java-and-mysql/
(博客空间是新近开通的,如果访问时出现问题烦请告知,我会想办法解决)
=========
a web crawler for single WordPress site
open source projects that I am using:
Apache HttpComponents 4.3
HTML Parser 2.0
MySQL Connector/J 5.1.27
Need XAMPP environment.
The program assume that there is a database called "crawler" in your localhost with port 3306.
Analyzing tags for each article will be added in the next update.
You can read about this in my blog:
http://johnhany.net/2013/11/web-crawler-using-java-and-mysql/
My blog is new and yet unstable. If you have any problems entering my blog, please notify me:)