- Problem and Data
- Predict Rumour Tweets
- What Are Rumour Like? Rumour Analysis
- Conclusion
- Potential Issues
This project is also involved in a kaggle competition involving 400+ students.
Our model achieves 3rd position in BOTH public and private leaderboard.
As the Covid-19 pandemic unfolded, citizens and organizations of many countries took to social networking platforms to spread their knowledge surrounding the phenomenon.
Some of this knowledge was based on substantiated facts such as data from government sources like the World Health Organization. However there was also a large number of tweets making statements that were based on hearsay, or personal opinion presented as fact.
These statements, commonly referred to as “rumours” are an important class of social media objects to understand, as they could range from harmless assertions, to potentially deadly recommendations.
This project details processing and classifier training on a set of ‘tweets’ from the popular online platform ‘Twitter’. The project is divided into two part:
- identifying tweets and
- analyse rumours
- a text file containing IDs of COVID-19 Related Tweets and IDs of their replies.
- a ground truth file containing IDs of Rumour and Non-Rumour Tweets.
- Total Number of Tweets Around ~30K.
Two approaches have been used in this project, purely based on linguistic features vs based on other attribute features.
-
Meta Data approach Use traditional data attributes, in this case, author information to do the prediction using classical Machine Learning Model.
-
NLP Transformer approach: Use words and sentences as information, making use of Transformer model to do the predicition.
-
Get full tweets of provided IDs using Twitter API V1.1 and V2.
-
Group the IDs of replies and their source IDs.
-
Get data for their authors.
-
Gather their label (0: non-rumour, 1: rumour)
-
Keep attributes and transform text data into machine readable structure:
- Logistic Regression (LR)
- Support Vector Classifier (SVC)
- Multilayer Perceptron (MLP)
- Long Short Term Memory Recurrent Neural Network (LSTM)
However, after doing classifications based on either bag of words or meta data (author information), the algorithm doesn't perform that well (around 0.7 accuracy).
- Get full tweets of provided IDs using Twitter API V1.1 and V2.
- Group the IDs of replies and their source IDs.
- Sort the replies based on their publish time.
- Gather their label (0: non-rumour, 1: rumour)
- Keep text data only.
- Bidirectional Encoder Representations from Transformers (BERT)
- Multilayer Perceptron Classifer (After BERT)
- We use a bertweet-base model from hugging face. The model is trained on a huge tweet database(6B word tokens ~ 80GB data).
BERT is a transformer model that used NOT on classification tasks, but on extracting information/attribute from a sentence or document. Bidirectional model means to capture the context of the text while the Encoder part means by extracting latent information from the text, the transformer output a contextual embedding layer representing the meaning of the whole sentence. (Like archive a file from file.txt -> file.zip)
With empirical experiments conducted by other researchers, it seems like the more abstraction (transformation) we have, the better the embedding captures the whole contextual information.
Additionally, with the output from the encoder, we have a embedding layer representing the meaning of the whole sentence ready to use.
By adding a classification layer right afther the embedding layer, we can output an actual label.
AFter running a bunch of tests, the performance comparison among models are shown in figure below
BERT is currently state of the art in this project.
Intuitively, to measure the performance of the mode, we might have used vanilla "accuracy" metric.
Which is simply calculate the division between number of correct predictions and all data rows.
However, it would cause some problems given we have an extremely unbalanced dataset with non-rumour tweets 4 times more than rumour tweets.
Therefore, we used a better evaluation metric using F-beta Score, with beta set to 2.
From the formula, it can be seen that higher beta leads to higher proportion of "Recall" in our metric. F-1 score is just when beta = 1, precision and recall weights the same.
Briefly mentioning, "Recall" measures how many underlying "rumours" have been identified. "Precision" measure the quality of how many guesses made to identify the rumour.
Since most tweets are non-rumour, if we have a machine learning model simply output label "non-rumour", we can still get a high accuracy. But the model learnt nothing and it would be useless.
Therefore, we adopt f-beta metric with beta = 2 to make sure our model predict rumours as expected.
Additionally, we use Macro Averaging rather than weighted averaging to address the importqance of our minority label. More information about Macro vs Weighted can be found here
And it seems like the traditional models doesn't perform well on the validation set and the test data set.
To further Address the issue, we analyse the model from the essence and generate a "unseen word" problem plot.
- The model is trained on bag of words vectors.
- If a word is not seen in the training set, the model could not capture its contained information.
- If too many words not occured in the model, the performance surely goes down.
NonRumour Hashtags | Rumour Hashtags |
---|---|
“What kind of hashtags are used by rumour and nonrumour tweets, and what is the overlap/difference between them?”
As hashtags are used to spread tweets that share a common theme, examining them gives us insight into what sort of things are being expressed by rumour and nonrumour tweeters. The hashtags are extracted using regex, lowercased, and counted.
As is visible from the Figure, the hashtags tweeted in rumours and nonrumours are quite similar, but do have some differences in content. The rumour hashtags contain more references to former American president Donald Trump, whereas the nonrumours have #stayhome and #stayhomesavelives. This suggests rumour tweet’s topics are more often related to Donald Trump, and nonrumour tweet’s topics are more often related to preventing covid spread.
A similar result is observed later in the topic analysis section.
One other point of note is that while the nonrumour class has only 7 times as many tweets as the rumour class, the nonrumour tweets contain over 25 times as many #covid19 instances compared to the rumour tweets. A similar situation occurs with #coronavirus.
This suggests that nonrumour tweets contain more references to the actual virus than rumours, although the reason as to why this is the case is unclear
NonRumour Topics | Rumour Topics |
---|---|
As the above figure suggested, rumours are more often concerned with former American president Donald Trump, and nonrumours contain topics like the vaccine and masks. There is also some overlap, as both Topic 3s seem to be about the number of new cases.
The topic trending for rumour vs non rumour was extracted using Latent Drichlet Allocation (LDA) techniques.
LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, documents are represented as a mixture of topics and a topic is a bunch of words. Those topics reside within a hidden, also known as a latent layer, and the algorithm is meant to extract this information.
The general sentiment of COVID-19 rumours is negative, but rumours are on average more negative than nonrumours.
We assign -1 to mean a negative sentiment, and +1 to be positive sentiment.
Applying the central limit theorem we can assume the data follows a normal distribution, and after doing some calculations, the results are shown below.
- Model used: pre-trained transformer model on tweets data from January 2018 to December 2021 with the COVID-19 pandemic period data covered (cardiffnlp, 2022).
From the second Table, the source of the rumour tweets tend to deliver negative sentiment related information (-0.480) and they receive more negative public replies (-0.594). Unfortunately, our analysis can’t determine the reason for more negative replies on rumour tweets. For example, it could be that the rumour tweets incite hatred in the general public, or it could be the general public criticising the rumours.
The general negative sentiment of both classes is understandable. Most of the topics discussed in tweets concern things such as the number of deaths and various lock-downs, so it is reasonable that the general sentiments are towards negative aspects.
- Machine Learning is all about preprocessing and feature engineering, with powerful transformer BERT to extract features, it completely beats the classical bag of words feature.
- Transformer BERT can capture the underlying meanings of the whole sentences while frequency counting method like bag of words can only capture the count pattern.
- Rumours generally deliver more negative sentiments towards the general public than non-rumours.
- Rumours Identification may not purely on author's information as the classifier doesn't perform that well.
- No geo related information has been analysed due to the nature that we only keep tweets in English.
- The Non-Rumour tweets might deliver useful informations such as vaccinations, masks to the general public.
- Some tweets have been taken down / deleted so we can not get all data to work on.
- Our prediction Model sizes around 1.5GB, so it can not be saved and have to be retrained.
Link: https://www.kaggle.com/competitions/rumour-detection-and-analysis-on-twitter/leaderboard?
Link: https://www.overleaf.com/read/bbxchdzbpvtv
You will want to extract several compressed file to its current location, for example: extract /full_data/data_storage/full_dev_train.zip to its current location /full_data/data_storage/full_dev_train.json
./task1_BERT/* BERT model used on Google Colab platform, with batch size tuning to 20, number of epochs to 15, and hidden dropout prob set to 0.1, using train+dev set to train, with truncated tweets used (yeah it outperforms using full text tweets) we get our best submission results. ./task1_BERT/bert_data/* data used to train bert model
./task2_Analysis/Predict.ipynb Predict the label using the model (the prediction part is in fact done on Google Colab, this notebook just concatenate the result) other files: Topic, Hashtag and sentiment analysis
./deprecated/tweetgetter/* Twitter APIs and Crawlers used through out the project, get twitter data by their IDs etc ./deprecated/Deprecated_Crawler/* Functional crawler but hasn't been used throughout the project TwitterGetter.ipynb # Scrap twitter data and Scrap twitter author data TwitterGetter_V2.ipynb # Scrap twitter data including author data for whole Train + Dev Set, however, this method WILL return text that may be truncated!!! Analysis.ipynb # analyse the twitter data
./deprecated/models previously tried models, including ClassicModel: MLP, SVC, LR and preprocessings such as PCA, TFIDF, bag of words etc. It also includes a deprecated BERT model which is trained on my GPU, however, the GPU memory restricted the training batch size, so later we move to GoogleColab cloud platform. ./deprecated/preprocessing_evalluations bunch of relevant codes doing preprocessing and evaluations, trying to get original tweet data to "Source_Text" + List of "Reply_Text"
./data/* data needed to run the code, note that the path to these data need to be modified before any running and testing. ./id_data/ # stores Twitter IDs ./full_data/storage_data # stores scrapped data , full data set for twitter and author ./full_data/*.json # these file are generated by run-time program and can be changed so it's a temporary location