Skip to content

ashutoshthakur454/Amazon_Sentiment_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Amazon_Sentiment_Analysis amazon-logo

Aim of the project:

  1. Enhance the ability to predict customer sentiment, which can be useful for businesses in improving their products and services.
  2. Identify key factors that contribute to positive and negative experiences.
  3. Develop practical skills in natural language processing (NLP) and machine learning, which are valuable in various data-driven fields.

About the dataset:

  1. The Amazon reviews polarity dataset is taken from Kaggle, constructed by taking review score 1 and 2 as negative, and 4 and 5 as positive. Samples of score 3 is ignored.
  2. Each class has 1,800,000 training samples and 200,000 testing samples. Further sampling is done to reduce the size of the training and testing data.
  3. The CSVs contain polarity, title, text. These 3 columns in them, correspond to class index (1 or 2), review heading and review body.

Data Cleaning Steps:

  1. review_cleaning function is applied to make text lowercase, remove links, punctuations and numbers
  2. Negative Stop words were taken special care of, while removing the stop words as they carry meaning towards negative sentiment.
  3. Words were lemmatized instead of stemmed to withold the sematic meaning of words.

Visualizations from text

  1. The polarity of the sentiments were normally distributed but it was not standard normal. polarity distribution

  2. Histogram of total length of Reviews and total word count were plotted

total length of reviews total word count

  1. Frequent 1,2 & 3 grams in both positive & negative reviews:
image 1 image 2


image 3 image 4


image 5 image 6

  1. Word Clouds

Negative Word Cloud

Negative Word Cloud

Positive Word Cloud

Positive Word Cloud

Use of Bag of Words and Tfidf Vectorizer

Both BOW and Tfidf were used to convert texts into vectors using max_features 10,000 and ngram_range 1 to 3. These vectors were then utilised for applying Machine Learning models. Here are the performances of various ML Models on these vectors:

Logistic Regression:

Trained on BOW

Logistic Regression

Trained On Tfidf

Logistic Regression

Multinomial Naive Bayes

Trained on BOW

Multinomial Naive Bayes

Trained On Tfidf

Multinomial Naive Bayes

Random Forest

Trained on BOW

Random Forest

Trained On Tfidf

Random Forest

Logistic Regression was the best performing model showcasing ROC_AUC score of 0.9016 on Tfidf Vectors

Training LSTM model

Tokenization and Padding
Max features were kept as 10,000 to maintain uniformity. Sentences were tokenised first using Keras' tokenizer. Distribution of the tokenised sequences was checked:

Tokenized Sequences Distribution

Max length for padding was decided to be 128.

As for the architecture of the model, an embedding layer at the start, followed by two Bidirectional LSTM along with Batch Normalization, and then a Dropout layer followed by some Dense layers.

image 1 image 2

Performance On test Data

Classification Report

LSTM

Confusion Matrix

LSTM

LSTM Model performance comparison on Training and Test data

Accuracy comparison

LSTM

Loss Comparison

LSTM

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published