I analyse the Twitter API dataset made available to me through a Data Science class at UChicago. My goal is to identify whether Twitter is a credible source of information on key emerging trends or topics in education. Below, I have conducted preliminary Exploratory Data Analysis to identify relationships and patterns among variables, and understand what types of features drive popularity of a "Twitterer" or Topic in Education.
In future repositories, I extend this analysis to conduct Graph Network Analysis on tweets and user descriptions.
Final results are documented in these Google Slides
My approach has been to fail fast with code errors on small subsamples of the data, and then gradually scale up my code. These tweets are collected on the topics of education, but I have filtered for the hashtags related to primary, secondary or higher education: ['k12', 'highered', 'bookban', 'teachersunion', 'remoteteaching', 'remotelearning', 'schoolreopen']. We will analyse the full set of education-related tweets in the following assignments.
When scaling up, I will:
- Work from parquet files: After conducting basic ETL and subsetting the data, I wrote the files to parquet in my cloud storage bucket, and proceeded to read from there in following sessions. For this graph-like Twitter data, Parquet offers the following benefits:
Columnar storage of Parquet files enables more efficient compression and better query performance as I include more Tweet records into the analysus Schema evolution: Parquet files have built-in support for schema evolution, allowing me to add, remove, or modify columns without rewriting the entire dataset. This flexibility is particularly useful for Twitter data which has a vast, rich and nested schema which must neccesarily be loaded piecemeal for fast analysis. Predicate pushdown is supported. When executing queries, only relevant columns and rows are read from the storage, minimizing data movement and improving query performance.
- Consider domain partitioning: For this assignment, I have restricted myself to education tweets and filtered on specific hashtags to reduce the size of the problem. Random sampling, which works with other big datasets, would not have worked here because of the interconnectedness of the data (tweets, retweets, follows, mentions, likes).
Going forward, I will use MLIB's topic modeling to partition the data into Twitter account types (e.g. non profits, govts, influencers and businesses) and topic types (politics & poicy, edtech, other) in order to understand if and where credible information on emerging education topics exists on Twitter.
Many problems of interest involve looking at the top decile of retweets or acitvity vs the rest - another potential partition and parallization option. Lastly, agressively grouping and aggregating the data before plotting has been a time-saver.
-
Wherever possible, I have refrained from moving between RDD, Pyspark and especially pandas.
-
Make strategic use of caching: Caching of objects to be immediately used in plotting has been a time-saver. Deliberate caching prior to Pyspark sql statement execution might reduce time as well .
The time period: 5th April, 2022 - 25th February, 2023
root
|-- date_created: timestamp (nullable = true)
|-- user_name: string (nullable = true)
|-- user_id: long (nullable = true)
|-- user_description: string (nullable = true)
|-- followers: long (nullable = true)
|-- user_location: string (nullable = true)
|-- hashtag: string (nullable = true)
|-- tweet_id: long (nullable = true)
|-- tweet_text: string (nullable = true)
|-- retweeted_from: string (nullable = true)
|-- mentions_g: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- id_str: string (nullable = true)
| | |-- indices: array (nullable = true)
| | | |-- element: long (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- screen_name: string (nullable = true)
|-- retweeted_encoded: integer (nullable = true)
|-- tweet_count: long (nullable = true)
In our sample of education related tweets, the most active Twitterers posting original content include individual infuencers like Adam Sanford and Dr. Karen Connaghan who post about teaching techniques and edtech tools, posting from unverified accounts. Interestingly, the Clark College Job Board is the second most active producer of original content in our sample, indicating that some original content can be noice.
Non-verified accounts both create ~20x the amount of original content as blue-verified accounts, and ~3x the amount of retweets. One would have expected news journalism accounts, non-profits and government accounts to create original content, posting news and generating press. This might still be the case, but from the chart below, it seems 'verified_status' is not a good identifier of accounts like these. I will try PySpark topic modeling on the user_descriptions to categorize org_type going forward.
While individual influencers are the most active tweeters, non-profits, edtech magazines and edtech-focused authors are the most followed. This also indicates that edtech might be the most followed topic. In the future, we will see if this trend holds up when segmenting the data by influencers, verified govt accounts, and other verified accounts (people and businesses).
The data here is in line with the power law, where a majority of Twitter accounts have less than 12.5K followers, and a fraction of accounts are "influential" in attracting followers. This provides potential thresholds we could use to tag influential accounts. Excluding verified accounts, which I am assuming our specific sample of tweets has missed, non-verified accounts with more than 25K followers could be analysed as "influential". In further analyses, I would want to know: What has more reach? The many users with few followers (perhaps running niche audiences) or the right side of this graph - the few influential users, with millions of followers (but at least 12.5K followers)?
The top one/third most mentioned accounts are also, by far, the most active Twitters. This group averages 100K tweets during this approximately one year time period, while the rest of the sample averages between 15K-20K annual tweets. Surprisingly, the bottom third of mentioned Twitters are slightly more active than the middle group. In the futrure, analysing the type of organizations or people would be insightful: Are news events, for example, characterized by mentions of political personas? Does mentioning a known influencer drive the reach of a tweet? being most frequently mentioned would provide insight into whether a tweet reflects a news event, a long-trending topic or
In the following assignments, we will use Graphframes, Pyspark MLIB's Topic Modeling and Jaccard Simmilarity to answer the questions:
-
Is the content mostly unique? Or is it usually people copy-pasting the same text? Does the pattern of text similarity vary between official verified accounts and non-official accounts?
-
Who is retweeting who? Who is the origin of the original content? We will use Pyspark Graphframes to find the nodes (original tweeters) and edges (retweeters) for a politicized topic (Black History Courses & critical race theory) and an influencer-driven topic (edtech solutions), and a neutral education topic (pandemic-driven test score declines)
-
What topics are dominating the conversation among verified govt accounts? among other verified accounts? among influencers (non-verified accounts with more than 25K followers)?
The data set represents 100 million tweets called from the Twitter API data set and analyzed previously in Assignment 7 of this class. The data has been cleaned, with a focus on removing bots and fake accounts, keeping English language tweets and removing troublesome null values. This brought us to a subset of ~600K tweets, from which I sought to create a balanced dataset of tweets that were retweeted, and those that were not. I then sampled 10% of the dataset, so we are working with ~60K tweets, or 18 MB of data which can comfortably fit in cache.
The goal of this exercise is to predict whether a tweet will be retweeted using logistic regression.
The three features engineered for this analysis include 1) 'org_type': the type of account 2) 'statuses_count': the number of tweets + retweets made by the user, and 3) the Jaccard distance or 'distCol': a measure of tweet originality. I explain each:
As seen in the chart below, Overall activity differs greatly across the different user groups within the education domain. Educational leaders for example tend to tweet heavily when we look at the overall data, but when looking at a subset of political tweets related to “critical race theory”, politicians and institutions become more vocal.
Statuses count represents the total number of tweets and retweets associated with a user, indicating user activity. According to the chart above, the most active users are also the most retweeted. I did not use the top followed users for this purpose since the top followed in the education domain were celebrities from outside education. For example, Michael Bublé has an education foundation, and is in the top 20 followed, but is barely ever retweeted by those within this niche.
The hypothesis here is that more original tweets are more likely to be retweeted. We used the Jaccard distance measure on hashed tweet text to measure how far apart two tweets are, in other words how unique a tweet is from the rest of the data set. Hashing in this case is used to minimize the size of the big data problem, by comparing A subset of tweets that are likely most likely to be duplicates. we used a Jaccard distance threshold of 0.6. The lower the threshold come on the more Tweets the algorithm will classify as duplicates. Overall it found 4332 duplicates among 30,922 tweets.
Our predictive accuracy is .673 on the training data and .670 on the testing data, indicating good generalizability of the model to new data, and an accurancy that is better than random chance. There is some predictive power in our variables, but perhaps this could be improved by considering topic and content related attributes of tweets, as opposed to just user-level attributes as we have done in this analysis.
The FPR of 0.643 for 'original tweets' means that approximately 64.34% of the tweets that were not retweeted are incorrectly classified as retweeted. This indicates a relatively high rate of false positives for the negative class. The FPR of 0.173 for retweeted tweets means that around 17.32% of the retweeted tweets are incorrectly classified as not retweeted. This indicates a lower rate of false positives for the positive class. Overall, the model is more discerning of retweeted tweets than it is of non-retweeted tweets.