A scrapy project scraping data from douban.
- PostgreSQL-11
- Python3.6 or later
- Scrapy
- Psycopg2
- Pillow
The following instruction is for Ubuntu 18.04 only, otherwise please check the official doc.
Add repo source into /etc/apt/sources.list.d/pgdg.list
deb http://apt.postgresql.org/pub/repos/apt/ bionic-pgdg main
Import repo signing key and update
$ wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
$ sudo apt update
Install PostgreSQL
$ sudo apt install postgresql-11 libpq-dev
If you are in China, consider using tencent mirror.
Create user, database, and table from sql file
$ sudo -i -u postgres
$ psql -a -f /path/to/book_init.sql
Edit /etc/postgresql/11/main/pg_hba.conf
to add entry for our user, append:
host all donotban 127.0.0.1/32 md5
Change the IP if your spider runs somewhere else instead of local.
Edit /etc/postgresql/11/main/postgresql.conf
, modify some settings
# recommended
client_encoding = 'UTF8'
# only affects postgres functions like now()
timezone = 'Asia/Shanghai'
# optional
default_transaction_isolation = SET_YOUR_LEVEL
Finally, restart postgresql
$ sudo service postgresql restart
$ pip install scrapy psycopg2
Modify host, or also other parameters in douban/config/postgres.json
{
"dbname" : "donotban",
"user" : "donotban",
"password" : "pleasedonotban",
"host" : "127.0.0.1",
"port" : 5432
}
In douban/settings.py
# API that returns proxy address
# note that also the json key in middlewares.py need to be altered according to your API
PROXY_API = "http://127.0.0.1:5010/get/"
# static prxoy address
PROXY_URL = "http://username:password@yourproxyaddress:port"
If you are using Luminati service, check the json file under douban/config/luminati.json
. A Luminati middleware is provided whose default behavior is changing IP on each request.
{
"username": "username",
"password": "password",
"country": "cn"
}
Edit douban/config/email.json
, so that when the spider terminated it will send an email to the receiver
{
"sender" : "[email protected]",
"receiver" : "[email protected]",
"password" : "authcode",
"server" : "smtp.example.com",
"port" : 465
}
# by using -s JOBDIR=/dir/ you can save the spider state when pause the spider
$ nohup scrapy crawl book -L INFO -s JOBDIR=/dir/to/save/> douban_spider.log 2>&1 &