An end-to-end real-time data streaming pipeline that leverages Kafka and Spark Streaming to analyze social media sentiment trends.
Project still in progress...
-
Data Sources:
- Twitter and Reddit are the data sources. But more can be added.
-
Stream Data Ingestion:
- Apache Kafka is used to handle incoming data streams. Each source gets sent to its respective topic (
twitter-topic
andreddit-topic
). - Apache ZooKeeper manages Kafka brokers.
- Apache Kafka is used to handle incoming data streams. Each source gets sent to its respective topic (
-
Stream Processing:
- Apache Spark processes the streaming data using Spark Structured Streaming.
- Processed data is stored in:
- Object Storage for long-term persistence.
- Redis for short-term persistence.
-
Real-Time Dashboard:
- Redis serves as a short-term cache for fast data retrieval.
- Flask handles backend operations for the dashboard.
- Frontend is in HTML, CSS, and JavaScript.
-
End Users:
- Users access the dashboard via a web interface served by Nginx.
- Ensure Docker is installed and running.
- Run
docker-compose up --build
file.