Skip to content

How to scrap data

Samuel Depardieu edited this page Dec 13, 2017 · 5 revisions

If you don't have a functional dev environment

Refer to How to set up guide

If you have your dev environment ready

The project actually have two working spiders:

  • who_iris: Search articles and get all documents until the last page
  • who_iris_single_page : Get all document of a single research page

Activate the virtual env and run a spider:

source env/bin/activate
scrapy crawl [name]

(e.g. scrapy crawl who_iris)
If settings are in their initial state, this will output a json file in the results folder.

Changing the settings inline:

scrapy crawl [name] -s [setting to change]

Settings available to change:

# Changes Results per page number:
scrapy crawl [name] -s WHO_IRIS_RPP = 1000

# Change the job folder (to start again a scrap from the beginning eg):
scrapy crawl [name] -s JOBDIR = 'crawl/[job_name]'

# Change the logging settings:
scrapy crawl [name] -s LOG_LEVEL = '[INFO, WARNING, DEBUG, ERROR]'
scrapy crawl [name] -s LOG_ENABLED = '[True/False]'

Settings to change in the settings file:

(Notice: You should clone the settings file and modify the clone. The settings file used by scrapy can be changed in the scrapy.cfg file.)

# Change the output method to an AWS S3 bucket:
First, change the values of the fields AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and FEED_URI to match yours
Then, change FEED_CONFIG to 'S3'
Clone this wiki locally