PB_Twitter Analysis_Spring-2017

Phase1:

Main Requirements:

• Collect tweets in JavaScript Object Notation (JSON) format (at least 100K record).

• Find the list of top ten used hashtags in your collection.

• Create a directory in HDFS for each hashtag from the top ten hashtag list.

• Create additional two directories: “Others” and “None”, Store the tweets on files in HDFS If a tweet contains a hashtag from the top ten list, store the tweet in that hashtag’s directory. If a tweet contains one or more hashtags, but none of the hashtags are in the top ten list, store the tweet in the “Others” directory. If a tweet does not contain a hashtag, store it in the “None” directory.

Extra Requirement:

• Implement a function that counts the number of times a keyword appears in one of two tweet JSON attributes (text and hashtags) in all of 12 directories that were created on HDFS: i nt count Word (String keyword, String attr)

PPT: https://github.com/saijyothi9/PB_Twitter-Analysis_Spring-2017/blob/master/Documentation/Principles%20of%20Big%20Data%20Project-Team14.pdf

Phase2:

Main Requirements:

Using the collection of tweets from Project 1 (or collect a new set), implement MapReduce programs to determine the vocabulary uniqueness of your dataset:

• M/R: Find the list of words that have duplicates in the tweets’ text.

• M/R: Find the list of words that are unique in the tweets’ text.

• Store the lists in two text files: dups.txt and uniqs.txt

• Print the ratio of the number of unique words to the number of words with duplicates.

Extra Requirement:

Implement a MapReduce program to determine the best time to post a tweet.

• Propose the metric/criterion of your choice based on the tweet JSON format.

• Run your program and return the top ten best times to post a tweet on twitter.

PPT: https://github.com/saijyothi9/PB_Twitter-Analysis_Spring-2017/blob/master/Phase2/Documentation/PB_phase2.pdf

Phase3:

Main Requirements:

Using a collection of tweets, implement three analytical tasks. A single task could consist of multiple analytical queries.

• One task must be implemented via RDD transformations and actions only.

• Must NOT be a simple word count (e.g. most used language).

• The other two tasks must be implemented via Spark SQL and DFs.

• One of your analytical queries must use the input file trends.txt.

Extra Requirement:

• Implement a graphical user interface that enable the user to dynamically execute your analytical tasks and provide a visual representation of the results.

• Flow: the user selects an analytical task, the task is executed in the backend, the results are returned and displayed to the user in a visual representation (e.g. pie chart).

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Phase1		Phase1
Phase2		Phase2
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PB_Twitter Analysis_Spring-2017

About

Releases

Packages

Languages

SaiJyothiGudibandi/PB_Twitter-Analysis_Spring-2017

Folders and files

Latest commit

History

Repository files navigation

PB_Twitter Analysis_Spring-2017

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages