Amazon_Sentiment_Analysis

Aim of the project:

Enhance the ability to predict customer sentiment, which can be useful for businesses in improving their products and services.
Identify key factors that contribute to positive and negative experiences.
Develop practical skills in natural language processing (NLP) and machine learning, which are valuable in various data-driven fields.

About the dataset:

The Amazon reviews polarity dataset is taken from Kaggle, constructed by taking review score 1 and 2 as negative, and 4 and 5 as positive. Samples of score 3 is ignored.
Each class has 1,800,000 training samples and 200,000 testing samples. Further sampling is done to reduce the size of the training and testing data.
The CSVs contain polarity, title, text. These 3 columns in them, correspond to class index (1 or 2), review heading and review body.

Data Cleaning Steps:

review_cleaning function is applied to make text lowercase, remove links, punctuations and numbers
Negative Stop words were taken special care of, while removing the stop words as they carry meaning towards negative sentiment.
Words were lemmatized instead of stemmed to withold the sematic meaning of words.

Visualizations from text

The polarity of the sentiments were normally distributed but it was not standard normal.
Histogram of total length of Reviews and total word count were plotted

Frequent 1,2 & 3 grams in both positive & negative reviews:

Word Clouds

Negative Word Cloud

Positive Word Cloud

Use of Bag of Words and Tfidf Vectorizer

Both BOW and Tfidf were used to convert texts into vectors using max_features 10,000 and ngram_range 1 to 3. These vectors were then utilised for applying Machine Learning models. Here are the performances of various ML Models on these vectors:

Logistic Regression:

Trained on BOW

Trained On Tfidf

Multinomial Naive Bayes

Trained on BOW

Trained On Tfidf

Random Forest

Trained on BOW

Trained On Tfidf

Logistic Regression was the best performing model showcasing ROC_AUC score of 0.9016 on Tfidf Vectors

Training LSTM model

Tokenization and Padding
Max features were kept as 10,000 to maintain uniformity. Sentences were tokenised first using Keras' tokenizer. Distribution of the tokenised sequences was checked:

Max length for padding was decided to be 128.

As for the architecture of the model, an embedding layer at the start, followed by two Bidirectional LSTM along with Batch Normalization, and then a Dropout layer followed by some Dense layers.

Performance On test Data

Classification Report

Confusion Matrix

LSTM Model performance comparison on Training and Test data

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
amazon-sentiment-analysis.ipynb		amazon-sentiment-analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon_Sentiment_Analysis

Aim of the project:

About the dataset:

Data Cleaning Steps:

Visualizations from text

Negative Word Cloud

Positive Word Cloud

Use of Bag of Words and Tfidf Vectorizer

Logistic Regression:

Trained on BOW

Trained On Tfidf

Multinomial Naive Bayes

Trained on BOW

Trained On Tfidf

Random Forest

Trained on BOW

Trained On Tfidf

Training LSTM model

Classification Report

Confusion Matrix

Accuracy comparison

Loss Comparison

About

Releases

Packages

Languages

ashutoshthakur454/Amazon_Sentiment_Analysis

Folders and files

Latest commit

History

Repository files navigation

Amazon_Sentiment_Analysis

Aim of the project:

About the dataset:

Data Cleaning Steps:

Visualizations from text

Negative Word Cloud

Positive Word Cloud

Use of Bag of Words and Tfidf Vectorizer

Logistic Regression:

Trained on BOW

Trained On Tfidf

Multinomial Naive Bayes

Trained on BOW

Trained On Tfidf

Random Forest

Trained on BOW

Trained On Tfidf

Training LSTM model

Classification Report

Confusion Matrix

Accuracy comparison

Loss Comparison

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages