Skip to content

hassonor/kafka-spark-data-engineering

Repository files navigation

E2E Data Engineering Project: High-Performance Financial Transaction Processing

🚧 Project Status: Under Active Development 🚧

Java Apache Kafka Apache Spark Docker

A production-grade data engineering pipeline designed to handle massive-scale financial transaction processing with high throughput, fault tolerance, and real-time analytics capabilities.

🚀 System Capabilities

  • Processing Volume: 1.2 billion transactions per hour
  • Storage Capacity: Handles up to 1.852 PB of data (5-year projection with 20% YoY growth)
  • Compression Ratio: 5:1 using Snappy compression
  • High Availability: Triple replication for fault tolerance
  • Real-time Processing: Sub-second latency for transaction analytics
  • Scalable Architecture: Horizontally scalable Kafka-Spark infrastructure

📊 Data Specifications

Transaction Schema

case class Transaction (
    transactionId: String,      // 36 bytes
    amount: Float,              // 8 bytes
    userId: String,             // 12 bytes
    transactionTime: BigInt,    // 8 bytes
    merchantId: String,         // 12 bytes
    transactionType: String,    // 8 bytes
    location: String,           // 12 bytes
    paymentMethod: String,      // 15 bytes
    isInternational: Boolean,   // 5 bytes
    currency: String            // 5 bytes
)                              // Total: 120 bytes

Storage Requirements

Time Period Raw Data With 3x Replication & Compression
Per Hour 144 GB 86.4 GB
Per Day 3.456 TB 2.07 TB
Per Month 103.68 TB 62.1 TB
Per Year 1.244 PB 745.2 TB

🛠 Technology Stack

  • Apache Kafka 3.8.1: Distributed streaming platform
    • 3 Controller nodes for cluster management
    • 3 Broker nodes for data distribution
    • Schema Registry for data governance
  • Apache Spark 3.5.0: Real-time data processing
    • 1 Master node
    • 3 Worker nodes
    • Structured Streaming for real-time analytics
  • Additional Components:
    • RedPanda Console for cluster monitoring
    • Docker & Docker Compose for containerization
    • Java 17 for transaction generation
    • Python for Spark processing

🚀 Getting Started

Prerequisites

  • Docker and Docker Compose
  • Java 17 or later
  • Maven
  • Python 3.8 or later

Quick Start

  1. Clone the repository:
git clone https://github.com/yourusername/kafka-spark-data-engineering.git
cd kafka-spark-data-engineering
  1. Start the infrastructure:
docker-compose up -d
  1. Build and run the transaction producer:
mvn clean package
java -jar target/transaction-producer.jar
  1. Start the Spark processor:
python spark_processor.py

Accessing Services

🏗 Architecture

graph LR
    subgraph "Producers"
        java[Java]
        python[Python]
    end

    subgraph "Kafka Ecosystem"
        kc[Kafka Controllers]
        kb[Kafka Brokers]
        sr[Schema Registry]
    end

    subgraph "Processing"
        master[Spark Master]
        w1[Worker]
        w2[Worker]
        w3[Worker]
    end

    subgraph "Analytics Stack"
        es[Elasticsearch]
        logstash[Logstash]
        kibana[Kibana]
    end

    %% Connections
    java -- Streaming --> kc
    python -- Streaming --> kc
    kc -- Streaming --> kb
    kb --> sr
    kb -- Streaming --> master
    master --> w1
    master --> w2
    master --> w3
    w1 --> es
    w2 --> es
    w3 --> es
    es --> logstash
    logstash --> kibana
Loading

🔧 Configuration

Kafka Configuration

  • Bootstrap Servers: localhost:29092,localhost:39092,localhost:49092
  • Topics:
    • financial_transactions: Raw transaction data
    • transaction_aggregates: Processed aggregations
    • transaction_anomalies: Detected anomalies

Spark Configuration

  • Master URL: spark://spark-master:7077
  • Worker Resources:
    • Cores: 2 per worker
    • Memory: 2GB per worker
  • Checkpoint Directory: /mnt/spark-checkpoints
  • State Store: /mnt/spark-state

📈 Performance Optimization

Producer Optimizations

  • Batch Size: 64KB
  • Linger Time: 3ms
  • Compression: Snappy
  • Acknowledgments: 1 (leader acknowledgment)
  • Multi-threaded production: 3 concurrent threads

Consumer Optimizations

  • Structured Streaming for efficient processing
  • Checkpointing for fault tolerance
  • State store for stateful operations
  • Partition optimization for parallel processing

🔍 Monitoring

  • Transaction throughput logging
  • Kafka cluster health monitoring via RedPanda Console
  • Spark UI for job tracking and performance metrics
  • Producer performance metrics with throughput calculation

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

Releases

No releases published

Packages

No packages published