Skip to content

Latest commit

 

History

History
116 lines (76 loc) · 4.13 KB

README.md

File metadata and controls

116 lines (76 loc) · 4.13 KB

Starbucks Twitter Sentiment Analysis

Technologies used: Apache Kafka, Spark Structured Streaming, Confluent Cloud, Databricks, Delta Lake, Spark NLP

All details of the project is described in HERE.

1. Aim

The aim of the Starbucks Twitter Sentimental Analysis project is to build end-to-end twitter data streaming pipeline to analyze brand sentiment analysis.

2. Environment Setup

  • Set up the Virtual Environment
pip install virtualenv
virtualenv --version # test your installation 
virtualenv ccloud-venv

Step 1. Twitter API Credentials

As we performed in the previous post, we need to get Twitter API Credentials. After getting it, we save these credential information in .env file. Make sure to include .env file in .gitignore to be ignored in the future.

# .env
CONSUMER_KEY = "<api key>"
CONSUMER_SECRET = "<api secret>"
ACCESS_TOKEN_KEY = "<access key>"
ACCESS_TOKEN_SECRET = "<access secret>"

Step 2. Confluent Cloud

Confluent Cloud is a resilient, scalable streaming data service based on Apache Kafka®, delivered as a fully managed service - Confluent Cloud. It offers users to manage cluster resources easily.

2-1. Create a Confluent Cloud account and Kafka cluster

First, create a free Confluent Cloud account and create a kafka cluster in Confluent Cloud. I created a basic cluster which supports single zone availability with aws cloud provider.

2-2. Create a Kafka Topic named tweet_data with 2 partitions.

From the navigation menu, click Topics, and in the Topics page, click Create topic. I set topic name as tweet_data with 2 partitions, the topic created on the Kafka cluster will be available for use by producers and consumers.

Step 3. Confluent Cloud API credentials.

API keys

From the navigation menu, click API keys under Data Integration. If there is no available API Keys, click add key to get a new API keys (API_KEY, API_SECRET) and make sure to save it somewhere safe.

HOST: Bootstrap server

From the navigation menu, click Cluster settings under Cluster Overview. You can find Identification block which contains the information of Bootstrap server. Make sure to save it somewhere safe. It should be similar to pkc-w12qj.ap-southeast-1.aws.confluent.cloud:9092

HOST = pkc-w12qj.ap-southeast-1.aws.confluent.cloud

Save those at $HOME/.confluent/python.config

vi $HOME/.confluent/python.config

Press i and copy&paste the file below !

#kafka
bootstrap.servers={HOST}:9092 
security.protocol=SASL_SSL
sasl.mechanisms=PLAIN
sasl.username={API_KEY}
sasl.password={API_SECRET}   

Then, replace HOST, API_KEY, API_SECRET with the values from Step 3. Press :wq to save the file.

Step 4. Create a Databricks Cluster

Check HERE FOR the procedure of creating a Databricks Cluster

Step 5. Some modifications are needed for twitter data ingestion

# Dockerfile

FROM python:3.7-slim

COPY requirements.txt /tmp/requirements.txt
RUN pip3 install -U -r /tmp/requirements.txt

COPY producer/ /producer

CMD [ "python3", "producer/producer.py", 
  "-f", "/root/.confluent/librdkafka.config", 
  "-t", "<your-kafka-topic-name>" ]

Build and run the Docker Container

# cd <your-project_folder> 
# source ./ccloud-venv/bin/activate

bash run.sh

Final Sentimental Analysis

Click here to check the presentation file