- Enhance the ability to predict customer sentiment, which can be useful for businesses in improving their products and services.
- Identify key factors that contribute to positive and negative experiences.
- Develop practical skills in natural language processing (NLP) and machine learning, which are valuable in various data-driven fields.
- The Amazon reviews polarity dataset is taken from Kaggle, constructed by taking review score 1 and 2 as negative, and 4 and 5 as positive. Samples of score 3 is ignored.
- Each class has 1,800,000 training samples and 200,000 testing samples. Further sampling is done to reduce the size of the training and testing data.
- The CSVs contain polarity, title, text. These 3 columns in them, correspond to class index (1 or 2), review heading and review body.
- review_cleaning function is applied to make text lowercase, remove links, punctuations and numbers
- Negative Stop words were taken special care of, while removing the stop words as they carry meaning towards negative sentiment.
- Words were lemmatized instead of stemmed to withold the sematic meaning of words.
-
The polarity of the sentiments were normally distributed but it was not standard normal.
-
Histogram of total length of Reviews and total word count were plotted
- Frequent 1,2 & 3 grams in both positive & negative reviews:
- Word Clouds
Both BOW and Tfidf were used to convert texts into vectors using max_features 10,000 and ngram_range 1 to 3. These vectors were then utilised for applying Machine Learning models. Here are the performances of various ML Models on these vectors:
Logistic Regression was the best performing model showcasing ROC_AUC score of 0.9016 on Tfidf Vectors
Tokenization and Padding
Max features were kept as 10,000 to maintain uniformity. Sentences were tokenised first using Keras' tokenizer. Distribution of the tokenised sequences was checked:
Max length for padding was decided to be 128.
As for the architecture of the model, an embedding layer at the start, followed by two Bidirectional LSTM along with Batch Normalization, and then a Dropout layer followed by some Dense layers.
Performance On test Data
LSTM Model performance comparison on Training and Test data