assessor-scraper

The goal of this project is to transform the data from the Orleans Parish Assessor's Office website into formats that are better suited for data analysis.

development environment setup

prerequisites

You must have Python 3 installed. You can download it here.

first setup a python virtual environment

python3 -m venv .venv
. .venv/bin/activate

install the dependencies with pip

pip install -r requirements.txt

Getting started

Set up the database

By default, the scraper is setup to load data into a PostgreSQL database. Docs on setting up and making changes to the database are here. You can quickly get the database running locally using Docker.

docker-compose up -d db

If you want to explore how to extract data using scrapy, use the scrapy shell to interactively work with the response.

For example,

scrapy shell http://qpublic9.qpublic.net/la_orleans_display.php?KEY=1500-SUGARBOWLDR
owner = response.xpath('//td[@class="owner_value"]/text()').get()
total_value = response.xpath('//td[@class="tax_value"]/text()')[3].get().strip()
next_page = response.xpath('//td[@class="header_link"]/a/@href').get()

Get all the parcel ids

Getting a list of parcel ids allows us to build urls for every property so we can scrape the data for that parcel. These parcel ids are used in the url like http://qpublic9.qpublic.net/la_orleans_display.php?KEY=701-POYDRASST, where 701-POYDRASST is the parcel id.

Running the parcel_id_extractor.py script will cleverly use the owner search to extract all available parcel ids, then save them in a file parcel_ids.txt.

The file is checked in to the repo, but if you want to run it yourself to update it with the latest, run

python parcel_id_extractor.py

Running the spider

Running the spider from the command line will crawl the assessors website and output the data to a destination of your choice.

By default, the spider will output data to a postgres database, which is configured in scraper/settings.py. You can use a hosted postgres instance or run one locally using Docker:

Important Note: Scraping should always be done responsibly so check the robots.txt file to ensure the site doesn't explicitly instruct crawlers to not crawl. Also when running the scraper, be careful not to cause unexpected load to the assessors website - consider running during non-peak hours or profiling the latency to ensure you aren't overwhelming the servers.

To run the spider,

scrapy runspider scraper/spiders/assessment_spider.py

Warning: this will take a long time to run...you can kill the process with ctrl+c.

To run the spider and output to a csv

scrapy runspider scraper/spiders/assessment_spider.py -o output.csv

Running on Heroku

Set required environment variables:

heroku config:set DATABASE_URL=postgres://user:pass@host:5432/assessordb

You can run the scraper on Heroku by scaling up the worker dyno:

heroku ps:scale worker=1

See the Heroku docs for more info on how to deploy Python code.

Running in aws with Terraform

Install terraform
cd terraform
terraform init
terraform plan
terraform apply
ssh ubuntu@{public_dns}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
alembic		alembic
scraper		scraper
terraform		terraform
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
Procfile		Procfile
README.md		README.md
docker-compose.yml		docker-compose.yml
install.sh		install.sh
parcel_id_extractor.py		parcel_id_extractor.py
parcel_ids.txt		parcel_ids.txt
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

assessor-scraper

development environment setup

prerequisites

first setup a python virtual environment

install the dependencies with pip

Getting started

Set up the database

Get all the parcel ids

Running the spider

Running on Heroku

Running in aws with Terraform

About

Releases

Packages

Contributors 3

Languages

License

codefornola/assessor-scraper

Folders and files

Latest commit

History

Repository files navigation

assessor-scraper

development environment setup

prerequisites

first setup a python virtual environment

install the dependencies with pip

Getting started

Set up the database

Get all the parcel ids

Running the spider

Running on Heroku

Running in aws with Terraform

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages