Table of Contents
Details
This Data Warehouse project is engineered to facilitate extensive data handling capabilities for financial and commodities data. It employs advanced Python data engineering techniques, leveraging ORM for efficient data interactions and providing a RESTful API for data access.Details
The architecture is built around Python and Cassandra, with Docker ensuring container management. The integration of Python ORM simplifies database interactions, converting complex SQL into manageable Python code, enhancing maintainability and scalability.Details
- Python 3.10 or later
- Docker and Docker Compose
- Cassandra
- Virtualenv or any environment management tool
-
Clone the repository:
git clone https://yourrepository.com/data-warehouse.git cd data-warehouse
-
Set up the virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Launch Docker containers:
docker-compose up -d
-
Database Initialization: Execute scripts to configure the database schema and seed it with initial data.
Details
src/
: Contains all source files.clients/
: API clients for data sources.commodities_api_client.py
: Retrieves commodities data.nasdaq_api_client.py
: Fetches NASDAQ data.
config/
: Application configurations.settings.py
: Central config file.
data/
: Handles database operations.database.py
: Manages database connections.models.py
: Defines ORM models.
ingestion/
: Manages data loading and processing.load.py
: Ingests data into the database.transform.py
: Transforms data as needed.
init_scripts/
: Database initialization scripts.populate_commodities_data.py
: Seeds commodities data.populate_sp500_data.py
: Seeds S&P 500 data.
utils/
: Utility scripts.log_helper.py
: Provides logging functions.
Details
This project uses Docker to containerize and manage the Cassandra database cluster, ensuring consistency and scalability in the development and deployment environments. The Docker setup is defined in the `docker-compose.yml` file, which specifies the configuration for a multi-node Cassandra cluster along with Portainer for container management.The docker-compose.yml
file defines the services and their configurations as follows:
version: '3'
services:
# Node 1 Configuration
DC1N1:
image: cassandra:3.10
command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 0; fi && /docker-entrypoint.sh cassandra -f'
networks:
- dc1ring
volumes:
- ./n1data:/var/lib/cassandra
environment:
- CASSANDRA_CLUSTER_NAME=dev_cluster
- CASSANDRA_SEEDS=DC1N1
expose:
- 7000 # Cluster communication
- 7001 # SSL Cluster communication
- 7199 # JMX
- 9042 # CQL
- 9160 # Thrift service
ports:
- "9042:9042"
ulimits:
memlock: -1
nproc: 32768
nofile: 100000
# Node 2 Configuration
DC1N2:
image: cassandra:3.10
command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 60; fi && /docker-entrypoint.sh cassandra -f'
networks:
- dc1ring
volumes:
- ./n2data:/var/lib/cassandra
environment:
- CASSANDRA_CLUSTER_NAME=dev_cluster
- CASSANDRA_SEEDS=DC1N1
depends_on:
- DC1N1
expose:
- 7000
- 7001
- 7199
- 9042
- 9160
ports:
- "9043:9042"
ulimits:
memlock: -1
nproc: 32768
nofile: 100000
# Node 3 Configuration
DC1N3:
image: cassandra:3.10
command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 120; fi && /docker-entrypoint.sh cassandra -f'
networks:
- dc1ring
volumes:
- ./n3data:/var/lib/cassandra
environment:
- CASSANDRA_CLUSTER_NAME=dev_cluster
- CASSANDRA_SEEDS=DC1N1
depends_on:
- DC1N1
expose:
- 7000
- 7001
- 7199
- 9042
- 9160
ports:
- "9044:9042"
ulimits:
memlock: -1
nproc: 32768
nofile: 100000
# Portainer Configuration
portainer:
image: portainer/portainer
networks:
- dc1ring
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./portainer-data:/data
ports:
- "9000:9000"
networks:
dc1ring: { }
-
Cassandra Nodes:
- DC1N1, DC1N2, DC1N3:
- Each service represents a Cassandra node in the cluster.
- The
image
specifies the Docker image used. - The
command
ensures that the node waits if the data directory is empty, then starts Cassandra. networks
configures the internal network (dc1ring
) for the cluster.volumes
maps the host directory to the container directory for persistent storage.environment
variables set cluster configurations such asCASSANDRA_CLUSTER_NAME
andCASSANDRA_SEEDS
.ports
exposes necessary ports for communication and management.ulimits
sets resource limits for the container.
- DC1N1, DC1N2, DC1N3:
-
Portainer:
- The Portainer service provides a web-based interface for managing Docker containers.
- It is configured to use the same
dc1ring
network and has access to the Docker socket for control.
Details
The API is structured around resources representing financial data and commodities. It supports operations for retrieving data based on asset identifiers and includes pagination capabilities.For easier use, a Postman collection is provided. You can download it here.
- GET /api/v1/data/{asset_id}
- Retrieves financial data for a specified asset.
- Parameters:
asset_id
: UUID of the asset.limit
: Number of records to return.offset
: Pagination offset.
- Example:
http://127.0.0.1:8000/api/v1/data/AAPL?limit=20&offset=0
- GET /api/v1/commodities/{commodity_id}
- Fetches commodity data.
- Parameters:
commodity_id
: Identifier for the commodity.limit
: Controls the size of the returned data set.offset
: Specifies the pagination offset.
- Example:
http://127.0.0.1:8000/api/v1/commodities/brent?limit=20&offset=0
- GET /api/v1/assets
- Retrieves a list of asset names.
- Parameters:
offset
: The number of records to skip from the beginning.limit
: The number of records to return.
- Example:
http://127.0.0.1:8000/api/v1/assets?offset=0&limit=20
-
GET /api/v1/data_sources
- Retrieves a list of all data sources.
- Example:
http://127.0.0.1:8000/api/v1/data_sources
-
GET /api/v1/data_sources/{source_id}
- Retrieves details of a specific data source.
- Parameters:
source_id
: UUID of the data source.
- Example:
http://127.0.0.1:8000/api/v1/data_sources/{source_id}
# Fetch financial data for a specific asset
curl -X GET "http://localhost:8000/api/v1/data/AAPL?limit=10&offset=0"
# Retrieve commodity data
curl -X GET "http://localhost:8000/api/v1/commodities/brent?limit=5&offset=0"
# Get a list of assets
curl -X GET "http://localhost:8000/api/v1/assets?offset=0&limit=20"
# Get a list of data sources
curl -X GET "http://localhost:8000/api/v1/data_sources"
# Get details of a specific data source
curl -X GET "http://localhost:8000/api/v1/data_sources/{source_id}"