This project is a data platform designed for analyzing European soccer data. It leverages a modern data stack including Apache Airflow, Apache Spark, Confluent Kafka, and Apache Flink, all orchestrated using Docker and GCP. The platform is built with Python and aims to provide a scalable and efficient solution for processing and analyzing soccer data.
The platform ingests live soccer data from external APIs, processes it using Kafka and Flink, and stores it in a database/warehosue. The data is then transformed and analyzed using Spark, with results visualized in Grafana dashboards.
The architecture consists of several key components:
- Kafka: Used for real-time data streaming and message brokering.
- Flink: Processes streaming data from Kafka.
- Spark: Performs batch processing and data transformation.
- Airflow: Orchestrates ETL workflows.
- PostgreSQL: Stores processed data.
- Grafana: Visualizes data through dashboards.
- Docker
- Docker Compose
- Python 3.10+
- Java 11 (for Flink)
-
Clone the Repository
git clone https://github.com/yourusername/soccer-data-platform.git cd soccer-data-platform
-
Environment Configuration
Set up your environment variables in a
.env
file. Refer to the.env.example
for required variables. -
Build and Start Services
Use Docker Compose to build and start all services:
docker-compose up --build
-
Initialize Database
Ensure the PostgreSQL database is initialized with the necessary tables:
startLine: 1 endLine: 34
- Airflow: Access the Airflow web UI at
http://localhost:<AIRFLOW_PORT>
. - Grafana: Access Grafana dashboards at
http://localhost:<GRAFANA_PORT>
.
- Kafka Producer: Produces live soccer data to Kafka topics.
- Flink Job: Processes data from Kafka and writes results back to Kafka.
- Spark Job: Transforms data and writes it to Google Cloud Storage.
-
Start Kafka Producer:
docker-compose exec kafka-producer python producer.py
-
Run Spark Job:
docker-compose exec spark spark-submit src/spark/jobs/main_job.py
Contributions are welcome! Please read the contributing guidelines before submitting a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.