Template project for downloading a site with Scrapy. Crawls, scrapes, and saves HTML files from a given website, domain, and URL filters.
- Clone this repository and
cd
into it - Install the dependencies using the following command:
pip install -r requirements.txt
- Configure the
crawler/spiders/site.py
file for the site you want to crawl - Start the downloader using the following command (be sure to run this from
the repository root!):
scrapy crawl site
- Refer to the Scrapy documentation for best practices and other configuration options
- When the crawler finishes, the HTML files will be located in the
/html
directory