diff --git a/README.md b/README.md index 5d0fcb9..60a57d8 100644 --- a/README.md +++ b/README.md @@ -9,19 +9,23 @@ The Coinbase Real-time Data Pipeline is designed to retrieve real-time cryptocur ## Architecture -![architecture](https://i.imgur.com/Be7RcI2.jpeg) +![architecture](https://i.imgur.com/w4dNGpx.png) The real-time data pipeline project facilitates the collection, processing, storage, and visualization of cryptocurrency market data from Coinbase. It comprises key components: - **Coinbase WebSocket API**: This serves as the initial data source, providing real-time cryptocurrency market data streams, including trades and price changes. - **Java Kafka Producer Microservice**: To efficiently manage data, a Java-based microservice functions as a Kafka producer. It collects data from the Coinbase WebSocket API and sends it to the Kafka broker. - **Kafka Broker**: Kafka, an open-source distributed event streaming platform, forms the core of the data pipeline. It efficiently handles high-throughput, fault-tolerant, real-time data streams, receiving data from the producer and making it available for further processing. +- **Go Kafka Consumer**: Implemented in Go, the Kafka consumer pulls raw data from Kafka topics and stores them directly into HDFS (Hadoop Distributed File System). This step ensures robust and scalable storage of raw data. +- **HDFS for Raw Data Storage**: HDFS provides reliable storage for large volumes of raw data. It is ideal for streaming data applications due to its fault tolerance, high throughput, and scalability. The Go Kafka consumer writes raw data directly to HDFS, enabling seamless integration into the pipeline. - **Spark Structured Streaming for Data Processing**: Apache Spark, a powerful in-memory data processing framework, is chosen for real-time data processing. With Spark Structured Streaming, real-time transformations and computations are applied to incoming data streams, ensuring data is ready for storage. - **Cassandra Database**: For long-term data storage, Apache Cassandra, a highly scalable NoSQL database known for its exceptional write and read performance, is employed. Cassandra serves as the solution for storing historical cryptocurrency market data. - **Grafana for Data Visualization**: To make data easily understandable, Grafana, an open-source platform for monitoring and observability, is utilized. Grafana queries data from Cassandra to create compelling real-time visualizations, providing insights into cryptocurrency market trends. ## Deployment + To ensure the smooth orchestration and management of all project components on a single machine, Kubernetes comes into play. Leveraging Minikube, a lightweight Kubernetes distribution tailored for local development and testing, we can simulate a production-like environment right on our local system. This deployment approach simplifies the process of setting up and experimenting with our real-time data pipeline. Moreover, with minor adjustments, this configuration can be smoothly migrated to actual Kubernetes clusters, like GKE or EKE, offering scalability and making it production-ready.