This project involves designing and implementing an ETL (Extract, Transform, Load) pipeline to handle various data-related operations. The pipeline includes importing CSV data into a MySQL database, calculating aggregate data and storing it in a PostgreSQL database, Dockerizing the ETL application, and using Docker Compose to deploy both MySQL and PostgreSQL databases and run the ETL code within containers. Additionally, we created optional components like Grafana for monitoring.
- Project Overview
- Project Structure
- Getting Started
- Grafana Integration
- Alternative approaches and Suggestions
- Conclusion
The main objectives of this project are:
- Import CSV data into a MySQL database.
- Calculate aggregate data and store it in a PostgreSQL database, including weekly average activity and total activity for each email domain, while filtering out invalid sessions.
- Containerize the ETL application using Docker.
- Use Docker Compose to deploy MySQL and PostgreSQL databases and run the ETL code within containers.
- Monitore some Postgres metrics using Grafana
Here's a structured overview of the project directory with descriptions for each file and directory:
- InSpace
-
requirements.txt
: Contains a list of Python packages required for project. -
requirements.dev.txt
: Contains a list of Python packages required for development. -
README.md
: Documentation file for project. -
InSpace_ Data Engineer Task.pdf
: A PDF document with the data engineer task instructions. -
Dockerfile
: Configuration file for building a Docker image for application. -
docker-compose.yml
: Configuration for Docker Compose, defining services and containers. -
docker_cleanup.sh
: A shell script for cleaning up Docker containers and resources. -
DATADESCRIPTION.md
: Documentation describing the data used in the project. -
.gitignore
: A file specifying which files and directories to ignore when using Git. -
.dockerignore
: A file specifying which files and directories to ignore when using docker. -
.env
: Environment file for storing configuration variables. -
data: Directory for storing data files.
user.csv
: CSV file containing user data.space_session_info.csv
: CSV file containing space session information.space_attendee.csv
: CSV file containing attendee data.
-
configs: Directory for configuration files.
- grafana: Directory for Grafana-related configurations.
-
app: Directory containing application's source code.
queries.py
: Python module for defining database queries.main.py
: Main application file.connections.py
: Python module for managing database connections.configs.py
: Configuration file for application.__init__.py
: Python package initialization file.
-
To get started with this project, follow these steps:
- Install git https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
- Clone the project repository.
- Change to the project diractory (InSpace).
- Install docker https://docs.docker.com/engine/install/
- Install docker-compose https://docs.docker.com/compose/install/
- Run etl process
docker-compose up -d --build
- To check the success of ETL you need to run
docker logs app_container
To see Postgres metrics in the Grafana dashboard, you need to go to
http://localhost:3000/d/postgresmetrics or http://localhost:3000/d/postgresmetrics_v2
When working on your ETL project using a custom Python application, there are several alternative approaches and suggestions to consider for improving your development workflow and project management. Here are some recommendations:
Poetry is a modern Python packaging and dependency management tool that can simplify the management of project's dependencies. Consider using Poetry instead of a traditional requirements.txt
file. Poetry provides a more structured and Pythonic way to define and manage dependencies, making it easier to create a reproducible environment for ETL project.
To get started with Poetry:
- Install Poetry by following the instructions on their website.
- Create a
pyproject.toml
file in project's root directory and define project dependencies using Poetry'spyproject.toml
format. - Use Poetry commands to manage project's virtual environment, dependencies, and packaging.
By using Poetry, we can ensure a cleaner, more consistent, and more maintainable approach to managing our project's dependencies.
MyPy is a static type checker for Python that can help us catch type-related errors and improve code quality. By adding type hints to our custom Python ETL application and integrating MyPy into our workflow, we can enhance code readability and reduce potential bugs.
To integrate MyPy:
- Add type hints to our Python code using Python's type annotation syntax.
- Install MyPy using poetry/pip.
- Run MyPy against our codebase to check for type-related issues and enforce type checking in our development process.
MyPy can provide valuable feedback and make your ETL codebase more robust and maintainable, especially as our project grows.
Pre-commit is a tool that helps us set up and manage pre-commit hooks for code formatting, linting, and other checks. By configuring pre-commit hooks for our ETL project, we can automate code quality checks, ensuring that all code commits meet specific standards before they are accepted into our repository.
To set up pre-commit:
- Install pre-commit using poetry/pip.
- Create a
.pre-commit-config.yaml
file in our project's root directory to define the hooks we want to run. - Install and configure the hooks we specified in our configuration file.
By integrating pre-commit into our workflow, we can maintain code consistency, catch issues early, and improve collaboration among team members.
As our ETL process becomes more complex, consider transitioning from custom Python scripts to a workflow orchestration tool like Apache Airflow. Airflow provides a platform for managing, scheduling, and monitoring ETL workflows, making it easier to handle dependencies, retries, and error handling in a more robust manner.
To use Airflow:
- Install and set up Apache Airflow following its documentation.
- Define our ETL tasks as Airflow operators and create a directed acyclic graph (DAG) that represents our ETL workflow.
- Schedule and monitor our ETL processes using Airflow's web interface.
Airflow offers advantages such as task parallelization, scheduling flexibility, and better visibility into our ETL pipeline's status and performance. This can significantly improve the reliability and scalability of our ETL process.
Incorporating these alternative approaches and suggestions into our ETL project can lead to better code quality, dependency management, automation, and scalability, ultimately making our ETL process more efficient and maintainable.
This project successfully implements an ETL pipeline for data processing, Dockerizes the application, and uses Docker Compose to deploy databases and run the ETL code within containers. Optional components like Grafana integration added for monitoring purposes.