Real-time Twitter Data Analysis using Flume, Kafka and Spark

Authors

Flume: Flume is used to connect to twitter and get the streaming data. Then, this data is cleaned and sent to Kafka.

Kafka: This holds the messages for consumption by Spark.

Spark Streaming: Consumes the messages from Kafka and process them, and sends them to Flask server.

Flask: Python web framework, which receives the data from Spark and shows dashboards.

Needed packages:

Install all the packages from requirements.txt

Kafka:

Go to the Kafka directory.
Run zookeeper using command: nohup bin/zookeeper-server-start.sh config/zookeeper.properties > ~/zookeeper-logs &
Run Kafka using the command: nohup bin/kafka-server-start.sh config/server.properties > ~/kafka-logs &

Flume:

Go to the Flume directory (For example, cd apache-flume-1.9.0-bin/)
Run flume agent using the command: bin/flume-ng agent --conf conf --conf-file "/home/ubuntu/flume_twitter_to_kafka.conf" --name agent1

Spark streaming:

Download spark-streaming version 2.4.5
Unzip the tar file in the local workspace.
Set this directory path as SPARK_HOME in environmental variables.
Set the same path as HADOOP_HOME in environmental variables.
Add SPARK_HOME/bin to the PATH variable.
Make sure JAVA_HOME is set to JDK version 1.8
Download the spark-streaming-kafka-assembly_2.11-1.6.0.jar file in this project to the local workspace
Use the following example command to run: bin\spark-submit --jars spark-streaming-kafka-assembly_2.11-1.6.0.jar D:\Spring2020\csce678\project\code\cloudproject\SparkStreaming\spark-kafka.py 3.22.26.9:9092 twitter_stream_new D:\Spring2020\csce678\project\code\cloudproject\geo_tweets.txt

Flask: