Skip to content

RajneeshOps/Data-Pipeline-Kafka

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

End-to-End 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐏𝐢𝐩𝐞𝐥𝐢𝐧𝐞 designed to handle real-time stock market data! 📈💼

Introduction

In this Proof of Concept (PoC), we will demonstrate an end-to-end data engineering pipeline for real-time stock market data using Apache Kafka. The project uses a stock market app simulation to produce data in real-time, which is then consumed by an AWS-based ecosystem that stores, catalogs, and queries the data for further analysis.

This document walks through the technical implementation steps, from setting up Kafka on AWS EC2 instances to integrating with AWS Glue and Athena for querying the ingested data.


Architecture Overview

The architecture for this project consists of the following components:

  • Stock Market Data Simulator (Producer): A Python-based stock market simulation app generates real-time stock market data and uses Kafka to produce this data.
  • Kafka (Message Broker): Kafka acts as the message broker running on an EC2 instance, which processes and stores real-time streams of data from the producer.
  • AWS (Consumers): The real-time stock data is consumed and stored in Amazon S3, cataloged with AWS Glue, and queried using Amazon Athena.

Architecture Diagram


Project Prerequisites

Environment Setup

  • EC2 Instance (Ubuntu/ Amazon Linux): An AWS EC2 instance for running Kafka.

  • Java: Ensure that Java is installed (required by Kafka).

    sudo apt install java-1.8.0-openjdk
    java -version
  • Apache Kafka: Download and install Kafka on your EC2 instance.

    wget https://downloads.apache.org/kafka/3.8.0/kafka_2.13-3.8.0.tgz
    tar -xvf kafka_2.13-3.8.0.tgz
    cd kafka_2.13-3.8.0

Kafka Setup

Step 1: Starting ZooKeeper

Kafka requires ZooKeeper for managing its clusters. Start ZooKeeper with the following command:

bin/zookeeper-server-start.sh config/zookeeper.properties

Step 2: Starting Kafka Broker

bin/kafka-server-start.sh config/server.properties

Make sure to modify server.properties to point to your public EC2 IP for external access.

sudo vi config/server.properties
# Modify 'ADVERTISED_LISTENERS' to the public IP of the EC2 instance

Kafka Topic Creation

To create a topic for our stock market data, use the following command:

bin/kafka-topics.sh --create --topic stock_market_data --bootstrap-server {EC2_Public_IP:9092} --replication-factor 1 --partitions 1

Producer Setup

# Command to start the producer
bin/kafka-console-producer.sh --topic stock_market_data --bootstrap-server {EC2_Public_IP:9092}

Consumer Setup

bin/kafka-console-consumer.sh --topic stock_market_data --bootstrap-server {EC2_Public_IP:9092}

AWS Integration for Data Processing

Amazon S3 (Storage) The consumer will push the stock market data into an S3 bucket for further analysis.

AWS Glue (Data Cataloging)

Set up an AWS Glue crawler to scan the S3 bucket and catalog the stock market data. This enables us to query the data using AWS Athena.

Amazon Athena (Query Engine)

Use Amazon Athena to query the stock market data stored in S3. With Glue providing the schema, Athena allows us to run SQL queries on the ingested real-time data.

Conclusion

This PoC demonstrates the real-time data pipeline for stock market data using Kafka as the backbone for streaming data and AWS services for storage, cataloging, and querying. By leveraging Kafka, we can ensure high-throughput, fault-tolerant data streams, while AWS Glue and Athena enable scalable and serverless data analytics.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published