Skip to content

Latest commit

 

History

History
44 lines (22 loc) · 1.91 KB

File metadata and controls

44 lines (22 loc) · 1.91 KB

Supervised-learning-using-Word-Vectors

Data Set and Problem Statement

I performed analysis on on a large Kaggle dataset of movie reviews from IMDB to predict whether a movie review is positive or negative. The data is split into training and testing sets. I have used a Random Forest classifier to fit the "Bag of Words" model. This model is further used to predict the sentiment label of movie reviews in the test dataset. I have understood how learning word vectors could play an important role in predictions using supervised learning in a text corpus.

Methodology

  • Pre-processing : Cleaning, Lowercase, Tokenization, Stopwords removal
  • Feature Extraction : Bag of Words - CountVectorizer
  • Classification : Random Forest
  • Dimension Reduction : Feature importances

Result

Testing the model on the dataset with 25000 features yields about 85.3% accuracy. The figure below shows the precision, recall, f1-score values on the test data.

The diagram below shows the confusion matrix obtained.

After reducing the dimension and bringing down the features to 20000, the model yields about 86% accuracy. The figure below shows the precision, recall, f1-score values on the test data.

The diagram below shows the confusion matrix obtained.

How to run

Run the project code by submitting it to spark.

spark-submit ClassificationRandomForest.py labeledTrainData.tsv