-
Notifications
You must be signed in to change notification settings - Fork 1
How to scrap data
Samuel Depardieu edited this page Dec 13, 2017
·
5 revisions
Refer to How to set up guide
The project actually have two working spiders:
- who_iris: Search articles and get all documents until the last page
- who_iris_single_page : Get all document of a single research page
source env/bin/activate
scrapy crawl [name]
(e.g. scrapy crawl who_iris)
If settings are in their initial state, this will output a json file in the results folder.
scrapy crawl [name] -s [setting to change]
Settings available to change:
# Changes Results per page number:
scrapy crawl [name] -s WHO_IRIS_RPP = 1000
# Change the job folder (to start again a scrap from the beginning eg):
scrapy crawl [name] -s JOBDIR = 'crawl/[job_name]'
# Change the logging settings:
scrapy crawl [name] -s LOG_LEVEL = '[INFO, WARNING, DEBUG, ERROR]'
scrapy crawl [name] -s LOG_ENABLED = '[True/False]'
(Notice: You should clone the settings file and modify the clone. The settings file used by scrapy can be changed in the scrapy.cfg file.)
# Change the output method to an AWS S3 bucket:
First, change the values of the fields AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and FEED_URI to match yours
Then, change FEED_CONFIG to 'S3'