Supervised-learning-using-Word-Vectors

Data Set and Problem Statement

I performed analysis on on a large Kaggle dataset of movie reviews from IMDB to predict whether a movie review is positive or negative. The data is split into training and testing sets. I have used a Random Forest classifier to fit the "Bag of Words" model. This model is further used to predict the sentiment label of movie reviews in the test dataset. I have understood how learning word vectors could play an important role in predictions using supervised learning in a text corpus.

Methodology

Pre-processing : Cleaning, Lowercase, Tokenization, Stopwords removal
Feature Extraction : Bag of Words - CountVectorizer
Classification : Random Forest
Dimension Reduction : Feature importances

Result

Testing the model on the dataset with 25000 features yields about 85.3% accuracy. The figure below shows the precision, recall, f1-score values on the test data.

The diagram below shows the confusion matrix obtained.

After reducing the dimension and bringing down the features to 20000, the model yields about 86% accuracy. The figure below shows the precision, recall, f1-score values on the test data.

The diagram below shows the confusion matrix obtained.

How to run

Run the project code by submitting it to spark.

spark-submit ClassificationRandomForest.py labeledTrainData.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Supervised-learning-using-Word-Vectors

Data Set and Problem Statement

Methodology

Result

How to run

Files

README.md

Latest commit

History

README.md

File metadata and controls

Supervised-learning-using-Word-Vectors

Data Set and Problem Statement

Methodology

Result

How to run