DoubanSpider

A scrapy project scraping data from douban.

Prerequisites

PostgreSQL-11
Python3.6 or later
Scrapy
Psycopg2
Pillow

Configuration

Install PostgreSQL

The following instruction is for Ubuntu 18.04 only, otherwise please check the official doc.

Add repo source into /etc/apt/sources.list.d/pgdg.list

deb http://apt.postgresql.org/pub/repos/apt/ bionic-pgdg main

Import repo signing key and update

$ wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
$ sudo apt update

Install PostgreSQL

$ sudo apt install postgresql-11 libpq-dev

If you are in China, consider using tencent mirror.

Configure PostgreSQL

Create user, database, and table from sql file

$ sudo -i -u postgres
$ psql -a -f /path/to/book_init.sql

Edit /etc/postgresql/11/main/pg_hba.conf to add entry for our user, append:

host    all             donotban        127.0.0.1/32            md5

Change the IP if your spider runs somewhere else instead of local.

Edit /etc/postgresql/11/main/postgresql.conf, modify some settings

# recommended
client_encoding = 'UTF8'
# only affects postgres functions like now()
timezone = 'Asia/Shanghai'
# optional
default_transaction_isolation = SET_YOUR_LEVEL

Finally, restart postgresql

$ sudo service postgresql restart

Install Scrapy and Psycopg2

$ pip install scrapy psycopg2

Configure Psycopg2 Connection

Modify host, or also other parameters in douban/config/postgres.json

{
    "dbname" : "donotban",
    "user" : "donotban",
    "password" : "pleasedonotban",
    "host" : "127.0.0.1",
    "port" : 5432
}

Configure Scrapy Proxy

In douban/settings.py

# API that returns proxy address
# note that also the json key in middlewares.py need to be altered according to your API
PROXY_API = "http://127.0.0.1:5010/get/"
# static prxoy address
PROXY_URL = "http://username:password@yourproxyaddress:port"

If you are using Luminati service, check the json file under douban/config/luminati.json. A Luminati middleware is provided whose default behavior is changing IP on each request.

{
    "username": "username",
    "password": "password",
    "country": "cn"
}

Set up Email

Edit douban/config/email.json, so that when the spider terminated it will send an email to the receiver

{
    "sender" : "[email protected]",
    "receiver" : "[email protected]",
    "password" : "authcode",
    "server" : "smtp.example.com",
    "port" : 465
}

Start

# by using -s JOBDIR=/dir/ you can save the spider state when pause the spider
$ nohup scrapy crawl book -L INFO -s JOBDIR=/dir/to/save/> douban_spider.log 2>&1 &

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
douban		douban
.gitignore		.gitignore
README.md		README.md
book_init.sql		book_init.sql
douban book form.jpg		douban book form.jpg
scrapy.cfg		scrapy.cfg
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DoubanSpider

Prerequisites

Configuration

Install PostgreSQL

Configure PostgreSQL

Install Scrapy and Psycopg2

Configure Psycopg2 Connection

Configure Scrapy Proxy

Set up Email

Start

About

Releases

Packages

Languages

doubaniux/DoubanSpider

Folders and files

Latest commit

History

Repository files navigation

DoubanSpider

Prerequisites

Configuration

Install PostgreSQL

Configure PostgreSQL

Install Scrapy and Psycopg2

Configure Psycopg2 Connection

Configure Scrapy Proxy

Set up Email

Start

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages