Twitter Sentiment Analysis

This project performs sentiment analysis on tweets using the Sentiment140 dataset, which contains 1.6 million tweets labeled as positive or negative. The main objective is to train a machine learning model to predict whether a given tweet expresses positive or negative sentiment.

Note: This project was created as part of my learning journey in natural language processing and machine learning.

Project Overview

Sentiment analysis is a Natural Language Processing (NLP) technique used to identify and extract subjective information from text. This project aims to classify the sentiment of tweets as positive or negative, which can help businesses, researchers, or social media managers analyze public opinion on various topics or products.

Features

Dataset: 1.6 million tweets from the Sentiment140 dataset.
Preprocessing: Data cleaning, tokenization, and text vectorization techniques applied.
Model Training: A machine learning model was trained to classify tweet sentiment.
Prediction: The model predicts the sentiment of unseen tweets as either positive or negative.
Evaluation: Model performance was evaluated using accuracy.

Dataset

The Sentiment140 dataset includes the following fields:

target: Sentiment of the tweet (0 = negative, 4 = positive)
id: Unique ID of the tweet
date: Date when the tweet was created
user: Username of the person who tweeted
text: The content of the tweet

Project Workflow

Data Collection:
- The dataset was sourced from Kaggle's Sentiment140 dataset.
Data Preprocessing:
- Removed unwanted characters (e.g., URLs, mentions, hashtags).
- Tokenized the text (splitting it into individual words).
- Removed stopwords (common words like "the", "is", etc.).
- Applied techniques like stemming/lemmatization to reduce words to their base forms.
- Vectorized the text using TF-IDF or Word Embeddings (depending on the model).
Important Note: The stemming process is time-consuming. In Google Colab, it takes around 50 minutes, while locally it can take up to 150 minutes.
Model Training:
- Trained a logistic regression model to classify tweet sentiment.
Model Evaluation:
- Split the dataset into training and testing sets (80% training, 20% testing).
- Evaluated model performance using accuracy.
Prediction:
- The best-performing model was used to predict the sentiment of new, unseen tweets.

Results

The model achieved the following performance on the test set:

Accuracy: 80%

The model demonstrates a good ability to classify tweet sentiment based on the content.

Installation and Usage

Requirements

Python 3.x
Jupyter Notebook (optional, for running the project interactively)
Libraries:
- pandas
- numpy
- scikit-learn
- nltk (for text preprocessing)

Steps to Run the Project

Clone the repository:

git clone https://github.com/iflal/twitter_sentiment_analysis.git
cd twitter_sentiment_analysis

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
kaggle		kaggle
sentiment_app		sentiment_app
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
code.ipynb		code.ipynb
sentiment_analysis_model.sav		sentiment_analysis_model.sav
vectorizer.ipynb		vectorizer.ipynb
vectorizer.sav		vectorizer.sav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter Sentiment Analysis

Project Overview

Features

Dataset

Project Workflow

Results

Installation and Usage

Requirements

Steps to Run the Project

About

Releases

Packages

Languages

License

Iflal/Twitter_sentiment_analysis

Folders and files

Latest commit

History

Repository files navigation

Twitter Sentiment Analysis

Project Overview

Features

Dataset

Project Workflow

Results

Installation and Usage

Requirements

Steps to Run the Project

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages