This project is a web scraper built using FastAPI that scrapes product information (title, price, and image) from a target e-commerce website. It provides configurable settings, supports in-memory caching with Redis, and allows data storage in local JSON files. The project follows a clean architecture pattern and leverages * object-oriented programming (OOP)* principles for easy extensibility and maintainability.
Prerequisites Before you begin, ensure that the following software is installed on your system:
- Docker
- Docker Compose
Steps to Run the Docker Image
Clone the Repository
git clone <repository_url>
cd <repository_name>
After cloning Run
docker-compose up -d --build
For logs use, docker logs <container_id> to check logs
- Product Information Scraping: Scrapes product title, price, and image from a target website.
- Configurable Settings: Easily configurable scraping parameters such as
pages_limit
,proxy
, etc., with defaults and validation. - Redis Caching: Uses Redis for in-memory caching to improve scraping performance by avoiding redundant requests.
- JSON Storage: Stores scraped product data in local JSON files for persistence.
- Health Check Endpoint: A simple
/health
endpoint to check the service's status. - Authorization: API authentication is supported to ensure secure access to the scraping functionality.
- Extensible Architecture: Designed using clean architecture principles to support future enhancements and easy integration of new features.
- Environment Configuration: Settings like API tokens and URLs are externalized in an
.env
file for easy configuration and security.
app/
: The main directory containing all application code.api/
: Contains the API logic for routing and dependencies.__init__.py
: Initializes theapi
module.dependencies.py
: Defines dependencies for FastAPI endpoints (e.g., scraper dependency).endpoints.py
: Contains API route definitions and logic for scraping products.
core/
: Contains the core functionality of the application.__init__.py
: Initializes thecore
module.auth.py
: Handles authentication and token validation logic.cache.py
: Manages caching mechanisms (e.g., Redis).config.py
: Holds configuration settings for the application.database.py
: Handles database connections and models.interfaces.py
: Defines interfaces or abstract classes for common services.notifications.py
: Contains logic for sending notifications.
tests/
: Contains test cases for the application.__init__.py
: Initializes thetests
module.test_scraper.py
: Contains test cases for the scraper logic.
__init__.py
: Initializes the main application module.main.py
: Entry point for the FastAPI application, where the app is instantiated.models.py
: Contains data models used in the application (e.g., product, scraper).scraper.py
: Contains the logic for scraping product data from the target website.utils.py
: Contains utility functions for common tasks (e.g., parsing, data handling)..gitignore
: Specifies files and directories to be ignored by Git.README.md
: Documentation for testing the application.
This project requires some environment variables for configuration. Create a .env
file in the root of the project with
the following variables:
AUTH_TOKEN
: A secret token used for API authentication.BASE_URL
: The base URL for the website you want to scrape.
Example .env
file:
AUTH_TOKEN=your-secret-token
BASE_URL=https://dentalstall.com/shop/
To improve performance, the project uses Redis for in-memory caching. You will need to have a Redis server running locally or use a cloud Redis service. If you're running Redis locally, make sure the default port 6379 is accessible.
- Endpoint:
GET /health
- Description: This endpoint returns a simple health status of the service.
This API endpoint initiates the scraping process and returns the scraped product data.
- Method:
POST
- URL:
/scrape/
- pages_limit: The number of pages to scrape (default: 5, range: 1–100).
- proxy: (Optional) The proxy string to use for requests.
All scraping routes require authentication. Include a Bearer token in the Authorization header for secure access:
To initiate the scraping process, you can use the following example request:
POST /scrape/?pages_limit=10&proxy=http://yourproxy.com Authorization: Bearer your-secret-token
A successful response will look like this:
{ "status": "success", "message": "50 products scraped and saved successfully." }
You can test the API using tools like Postman or curl. Below is an example of how to test the scrape endpoint using curl:
curl -X POST "http://127.0.0.1:8000/scrape/?pages_limit=10" -H "Authorization: Bearer your-secret-token"