It is a real-time subreddit text analysis dashboard.
Implemented Event Driven Microservice Architecture to handle the streaming of subreddit's data ingested by Kafka, then to Spark to be processed, then stored in Cassandra as the batch storage, and to Redis as the speed layer to be analyzed in Dash. Each component is its own microservice.To be able to keep up with trending hashtags and topics, a dashboard is used to get keywords, entities, subreddit' sentiment, subreddit' emotions, and frequent words from a given hashtag/topic.
-
SparkStream is a python package (SparkStream-pypi). A simple spark streaming handler; it listens to a kafka topic, process the data, and store it into cassandra and redis. Accessible via an API and deployed in a docker container. SparkStream-github
-
Named-Entity-Recognition is a service for extracting NERs from text by spacy. Accessible via an API and deployed in a docker container. NER-github
-
Keyword-Extraction is a service for extracting keywords from text by yake. Accessible via an API and deployed in a docker container. Keyword-github
-
Sentiment-Model is a service for predicting tweet's sentiment. Developed by tensorflow extended and deployed with tensorflow-serving. Sentiment-github
-
Emotion-Model is a service for predicting tweet's emotions. Developed by tensorflow extended and deployed with tensorflow-serving. Emotion-github
-
Dashboard GUI for graphs and text analysis by Dash. Dashboard-github
Technologies:
- Asyncpraw
- Apache Kafka
- Apache Spark
- Redis
- Dash
- TenorFlow extended
- FastAPI
- Spacy
- NLTK
- Yake
- Docker
Data:
- Trending subreddits are from the trend places endpoint of the Praw API.
- Subreddit's streaming data are from the stream endpoint of the Asyncpraw API.