TwitterCleaner

Solution for the InsightDataEngineering coding-challenge

The Main.py contains the main program, while tweetprocessor.py contains the subroutine to process the tweets (i.e. feature 1), and to count the rolling average of the graph nodes (i.e. feature 2)

Upon calling run.sh, the code will ask whether user wants to use the example file [ex ] (tweets.txt in tweet_input folder), or the live stream data from Twitter [st], using the twitter API.

a .twitter-example file with the credentials need to exist in the src folder for the twitter API to work.

Use streaming data [st], or example file [ex]? [st/ex] :

When choosing to use the live streaming data, the software will ask whether user wants to append the streams into the tweets.txt file:

Store tweet streams to tweets.txt? [y/n]:

Modules Imported:

Except for tweepy, these imported modules are usually standard in a python distribution

re
datetime
time
json
collections
copy
os
tweepy [Please refer to tweepy website for installation]

Solution

Feature 1

The solution for cleaning the tweets [feature 1] employs two strategies:

Searching for the text using internal string module
Searching for unicodes, hashtags and escape sequences using Regular Expression (re module)

While one can employ json module to extract the text, I believe that string provides better performance

Feature 2

The graph is stored as dictionary in python. In addition, I keep the number of hashtags associated with the connected nodes in the dictionary. Adding and removing the hashtags will increment or decrement the counter, and once it reaches zero, the connected node will be removed.

Another counter is used to keep tabs of the graph nodes, and the total sum of the connected vertices.

This way, the subroutine avoids recalculating the graph for every tweets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TwitterCleaner

Solution

Files

README.md

Latest commit

History

README.md

File metadata and controls

TwitterCleaner

Solution