This repository aims to demonstrate how to build data pipelines and systems, providing a better understanding of concepts such as ETL, data lakes, and their roles in a data system. The core technologies used are Mage and Docker, upon which we will build and integrate other services to enhance our exploration and understanding.
- Mage: This directory contains all the files and scripts necessary to execute the pipelines. For installation instructions, refer to the official Mage documentation or the first tutorial, which provides a detailed guide on installing Mage.
- Dockerfile: We use this file to basically run Mage -Note that it contains few Spark specific commands that are not necessary for projects without Spark interactions
- Makefile: This is where all the commands that we will use commonly (you can add yours)
- Docker-Compose: This is the file we use to include the services we want to run every time. At the moment it contains all services I use but you can adjust it based on your needs.
To get full understanding of how to build the repository from scratch you can check the turotial here or you can simply clone the repo and start from there.
In the first tutorial/project, I guide you through building the repository using Mage as the main orchestrator. We will leverage various technologies to create your local data lake with Iceberg and query your data using StarRocks.
You can find the relevant article with a detailed guide here: Medium blog
The isolated code for that project is here: SparkDataLake
In the second tutorial/project, we leverage the structure we implemented in the first tutorial and utilize Nessie catalogue to build an end to end pipeline using the medallion architecture for structuring our data.
You can find the relevant article with a detailed guide here: Medium blog
The isolated code for that project is here: IcebergNessie
A small practicality for this project is that you can either run the pipelines one by one or just trigger the bronze one and if successfull it will trigger silver and gold. Then you can see the results in MinIO, Nessie and query them from any SQL engine you like.
3. DevOps for Data Engineers Part 1:Setting Up CI/CD Pipelines with Docker, Semantic Release, and Trunk-Based Development.
In the third tutorial we go a bit beyond the core Data Engineering projects and we focus on how we set up a proper CI/CD pipeline. We utilize github actions for execution and leverage concepts such as semantic versioning and conventional commits to creates a robust framework for managing code changes, versioning, and communication within our development workflow.
You can find the relevant article with explanations here: Substack
There is no isolated branch for this project since all the code can be found under the .github folder.