European Soccer Data Platform

This project is in progress. Need to add modules for airflow (issues with dependancies)

This project is a data platform designed for analyzing European soccer data. It leverages a modern data stack including Apache Airflow, Apache Spark, Confluent Kafka, and Apache Flink, all orchestrated using Docker and GCP. The platform is built with Python and aims to provide a scalable and efficient solution for processing and analyzing soccer data.

Project Overview

The platform ingests live soccer data from external APIs, processes it using Kafka and Flink, and stores it in a database/warehosue. The data is then transformed and analyzed using Spark, with results visualized in Grafana dashboards.

Architecture

The architecture consists of several key components:

Kafka: Used for real-time data streaming and message brokering.
Flink: Processes streaming data from Kafka.
Spark: Performs batch processing and data transformation.
Airflow: Orchestrates ETL workflows.
PostgreSQL: Stores processed data.
Grafana: Visualizes data through dashboards.

Setup and Installation

Prerequisites

Docker
Docker Compose
Python 3.10+
Java 11 (for Flink)

Installation Steps

Clone the Repository

git clone https://github.com/yourusername/soccer-data-platform.git
cd soccer-data-platform

Environment Configuration

Set up your environment variables in a .env file. Refer to the .env.example for required variables.
Build and Start Services

Use Docker Compose to build and start all services:
```
docker-compose up --build
```
Initialize Database

Ensure the PostgreSQL database is initialized with the necessary tables:
```
startLine: 1
endLine: 34
```

Usage

Running the Platform

Airflow: Access the Airflow web UI at http://localhost:<AIRFLOW_PORT>.
Grafana: Access Grafana dashboards at http://localhost:<GRAFANA_PORT>.

Data Ingestion and Processing

Kafka Producer: Produces live soccer data to Kafka topics.
Flink Job: Processes data from Kafka and writes results back to Kafka.
Spark Job: Transforms data and writes it to Google Cloud Storage.

Example Commands

Start Kafka Producer:

docker-compose exec kafka-producer python producer.py

Run Spark Job:

docker-compose exec spark spark-submit src/spark/jobs/main_job.py

Contributing

Contributions are welcome! Please read the contributing guidelines before submitting a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
jars		jars
scripts		scripts
src		src
.bashrc		.bashrc
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
main.py		main.py
setup.py		setup.py
start_services.sh		start_services.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

European Soccer Data Platform

This project is in progress. Need to add modules for airflow (issues with dependancies)

Table of Contents

Project Overview

Architecture

Setup and Installation

Prerequisites

Installation Steps

Usage

Running the Platform

Data Ingestion and Processing

Example Commands

Contributing

License

About

Releases

Packages

Languages

evanrosa/streaming-and-batch-project

Folders and files

Latest commit

History

Repository files navigation

European Soccer Data Platform

This project is in progress. Need to add modules for airflow (issues with dependancies)

Table of Contents

Project Overview

Architecture

Setup and Installation

Prerequisites

Installation Steps

Usage

Running the Platform

Data Ingestion and Processing

Example Commands

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages