This repository presents a comprehensive guide to performing sentiment analysis on a large dataset of tweets. The project involves classifying the sentiment of tweets as positive or negative, providing insights into public opinion on various topics.
- Introduction
- Dataset
- Goals
- Methods
- TF-IDF Vectorization with Unigram
- TF-IDF Vectorization with N-grams
- Word2Vec Trained from Scratch
- Doc2Vec Trained from Scratch
- Google News Word2Vec
- Glove Vectorization
- Gensim Fasttext Trained from Scratch
- BERT
- RoBERTa
- Latent Dirichlet Allocation (LDA)
- Universal Sentence Encoder
- Sentence Transformers
- ELMo
- CLIP
- Results
- Future Work
- Installation
- Usage
- Contributors
- License
Sentiment analysis involves classifying text data to determine whether the sentiment is positive or negative. In this project, we analyze a large dataset of tweets to uncover public opinion on various topics.
The dataset used for this project is the Sentiment140 dataset, which contains 1.6 million tweets labeled with sentiment: 0 for negative and 4 for positive. The dataset includes the following columns:
target
: Sentiment label (0 = negative, 4 = positive)ids
: Tweet IDdate
: Date of the tweetflag
: Query flaguser
: Usernametext
: Tweet content
- To clean and preprocess the tweet data.
- To explore the data through various visualizations and descriptive statistics.
- To build and evaluate machine learning and deep learning models for sentiment classification.
We employed various methods to extract feature embeddings from the tweet text, which were then used as input for classification models. Below are the feature extraction methods:
TF-IDF (Term Frequency-Inverse Document Frequency) evaluates the importance of a word in a document relative to the corpus. It converts text data into numerical features for machine learning algorithms by capturing the significance of words based on their frequency and rarity in the corpus.
Extending TF-IDF to N-grams captures more contextual information by considering combinations of words. This approach enhances the model's understanding of word dependencies and phrases, providing richer features for text data.
Word2Vec converts text data into numerical vectors by capturing semantic relationships between words. It uses either the Continuous Bag of Words (CBOW) or Skip-gram models to predict the context of words in a sentence, creating dense word vectors.
Doc2Vec extends Word2Vec to generate vector representations of entire documents. It uses Distributed Memory (DM) and Distributed Bag of Words (DBOW) models to capture semantic relationships between sentences and paragraphs.
The Google News Word2Vec model is a pre-trained Word2Vec model trained on a large corpus of news articles. It provides high-quality embeddings that capture a wide range of semantic relationships and contextual understandings.
GloVe (Global Vectors for Word Representation) is a pre-trained word embedding model that uses aggregated global word-word co-occurrence statistics. It captures semantic relationships between words by leveraging the overall statistical information of a corpus.
FastText extends Word2Vec by representing words as a bag of character n-grams, capturing sub-word information. This approach addresses out-of-vocabulary (OOV) words and generates more accurate word vectors.
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model that reads entire sequences of words simultaneously. This bidirectional approach captures the context of words based on their surroundings, improving the understanding of meaning.
RoBERTa (Robustly Optimized BERT Pretraining Approach) is an improved version of BERT with larger training data, longer training periods, and dynamic masking. It removes the Next Sentence Prediction (NSP) task and focuses on masked language modeling.
LDA is a generative probabilistic model used for topic modeling. It discovers underlying topics in a collection of documents by representing each document as a mixture of topics and each topic as a distribution over words.
The Universal Sentence Encoder (USE) encodes text into fixed-length vectors that capture semantic meaning. It uses a transformer-based architecture to generate embeddings that excel at capturing contextual information of sentences.
Sentence Transformers generate dense vector representations for sentences, capturing their semantic meaning. They extend traditional transformers by fine-tuning them for generating meaningful sentence embeddings.
ELMo (Embeddings from Language Models) generates context-dependent word embeddings using a bi-directional LSTM. It captures the meaning of words based on their context within a sentence, providing more accurate word representations.
CLIP (Contrastive Language–Image Pre-training) bridges the gap between vision and language by learning visual concepts from text. It uses large-scale natural language supervision to learn joint embeddings for images and their descriptions.
For sentiment classification, we used a variety of classical machine learning models as well as a single hidden layer neural network. Below are the classifiers employed in this project:
CatBoostClassifier
DecisionTreeClassifier
RandomForestClassifier
ExtraTreesClassifier
BaggingClassifier
AdaBoostClassifier
GradientBoostingClassifier
KNeighborsClassifier
LogisticRegression
SGDClassifier
XGBClassifier
LinearSVC
In addition to the classical machine learning models, we also implemented a single hidden layer neural network using PyTorch. The neural network architecture consists of:
- Input Layer: Takes in the TF-IDF features.
- Hidden Layer: Contains 128 neurons with ReLU activation.
- Output Layer: A single neuron with Sigmoid activation for binary classification.
The neural network is trained using Binary Cross-Entropy Loss and the Adam optimizer over 300 epochs.
The performance of our sentiment analysis models was evaluated using various metrics. The results demonstrate the effectiveness of different methods in accurately predicting the sentiment of tweets.
This project has covered a wide range of methods for sentiment analysis on Twitter data. However, there are numerous avenues for further exploration and enhancement. Below are some potential directions for future work:
-
Transformers-based Models
- DistilBERT: Incorporate DistilBERT, a smaller, faster, cheaper, and lighter version of BERT, for efficient sentiment analysis.
- GPT-3/GPT-4: Leverage advanced models like GPT-3 or GPT-4 for zero-shot, few-shot, or fine-tuned sentiment analysis to capture more complex patterns in the data.
-
Hybrid Embeddings
- Combine different embedding methods (e.g., BERT embeddings with TF-IDF features) to capture both contextual and statistical information, enhancing model performance.
-
Contextualized Embeddings:
- Consider using newer models like T5 (Text-to-Text Transfer Transformer) or XLNet, which have shown improvements over BERT in some tasks.
-
Ensemble Methods
- Stacking: Combine multiple classifiers using a meta-classifier to improve performance and leverage the strengths of different models.
- Blending: Use blending techniques where predictions from multiple models are combined to make the final prediction, enhancing overall accuracy.
-
Deep Learning Models
- Recurrent Neural Networks (RNNs) with Attention Mechanism: Implement RNNs with attention mechanisms to focus on important words or phrases in the tweets.
- Convolutional Neural Networks (CNNs): Use CNNs to capture local features in text data, which can improve the model's ability to identify key patterns.
- Transformers: Utilize full transformer-based models directly for classification tasks to take advantage of their powerful context awareness.
-
Synthetic Data Generation
- Apply methods like SMOTE for text data to generate synthetic examples, especially for balancing the dataset and addressing class imbalance issues. Although our dataset is balanced, these techniques can be useful for other datasets with class imbalance.
-
Back Translation
- Translate tweets to another language and then back to English to create more training data with slight variations, enhancing model robustness.
-
Additional Data Cleaning and Augmentation Techniques
- Ensure comprehensive preprocessing steps, such as handling emojis, URLs, mentions, and special characters in tweets.
- Besides back translation, consider using techniques like Easy Data Augmentation (EDA), which includes operations like synonym replacement, random insertion, etc.
- Techniques like undersampling, oversampling, or using class weights in your loss function can help manage class imbalances effectively. Even though our dataset has no imbalance, if it had, we would need to handle it using these methods.
- Use techniques like Grid Search or Random Search for hyperparameter tuning to optimize your models' performance for both classification and text embedding extraction.
-
Model Interpretability
- LIME (Local Interpretable Model-agnostic Explanations): Use LIME to understand the model's predictions on individual tweets, providing insights into model behavior.
- SHAP (SHapley Additive exPlanations): Utilize SHAP to provide both global and local interpretations of model outputs, ensuring transparency and trust in the model.
-
Robustness Testing
- Test models against adversarial examples to check robustness and ensure reliability under various conditions.
- Evaluate the model on different datasets to assess generalization and adaptability to new data.
- Python 3.8 or higher
- pip (Python package installer)
python -m venv env
source env/bin/activate # On Windows use `env\Scripts\activate`
git clone https://github.com/ardaxz99/Twitter-Sentiment-Analysis-A-Complete-Guide-to-Text-Classification.git
cd Twitter-Sentiment-Analysis
pip install -r requirements.txt
- Download the Sentiment140 dataset as a CSV file named
training.1600000.processed.noemoticon.csv
. - Copy the downloaded CSV file to the working directory of the project.
Execute the Jupyter notebook to reproduce the results:
jupyter nbconvert --to notebook --execute main.ipynb
- Arda Baris Basaran
This project is licensed under the MIT License - see the LICENSE.md file for details.