Skip to content

LittleRedHat/moh

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

## 环境安装

安装虚拟环境 virtualenv venv
激活虚拟环境 source ./venv/bin/activate(mac/linux) windows上 venv/Scripts/activate
首次运行安装依赖包: pip install -r requirements.txt


## 爬虫运行
cd crawl
运行命令:
scrapy crawl moh -a domain=[NATION] -a debug=True
命令停止用ctrl+\
ctrl+z or ctrl+c无法终止scrapy,会导致scrapy在后台运行

##命令说明
domain对应国家的缩写,debug选项打开调试模式(此时不会将爬下来的网页写到数据库中,并会输出爬取得网页信息基本信息,用于查看时间,标题是否正确



## 爬取规则制定说明



## TODO
css 内部字体下载
过滤掉视频等不需要文件下载

 ## api 修改

添加title搜索和日期过滤 

示例如下
{
  "should": [
    [
      "cancer"
    ]
  ],
  "by": "title",
  "size": "20",
  "from": 0,
  "sort": "score",
  "filters": [
    {
      "name": "type",
      "value": [
        "html"
      ]
    },
    {
      "name": "publish",
      "value": [
        "2014-01-01",
        "2017-01-01"
      ]
    },
    {
      "name": "nation",
      "value": [
        "all"
      ]
    },
    {
      "name": "language",
      "value": [
        "all"
      ]
    }
  ]
}
by -> title 或者 content
日期过滤 -> 时间格式为 'yyyy-mm-dd'
{
    name:"date",
    "value":[开始时间,截止时间]
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published