CS 205 Final Project - Sentiment Analysis

###Authors

Tunde Agboola
Omar Shammas
Shashank Sunkavalli

###Description

A sentiment analysis tool that will output a rating between 1 (negative) and 5 (positive) for a given text. A random forest classifier is built using MPI and trained on the Yelp academic dataset, which was pre-processed using MapReduce. The dataset consisted of ~300,000 annotated reviews and was split into a training corpus and testing corpus (in a 4:1 ratio). All the data is in the folder 'data'.

Phase 1: Preprocessing Using MapReduce

The code can be found in the mapreduce folder. MapReduce was used to extract the desired features from the training and testing data. The feature set consists of :

Unigram scores
Unigram count
Bigram score
Bigram count
Positive Word count
Negative Word count
Character count
Count of Uppercase letters
Count of Lowercase letters
Count of punctuation characters
Count of alphabets
Count of numbers
Ratio of Uppercase to Lowercase characters
Ratio of Alphabets to Total Character count

The unigram and bigram scores for the text are calculated based on an index of one-gram and bi-gram scores created using the entire training corpus.

Phase 2: Training Random Forest Classifier

Code for this section can be found in train.py. Using the training data, a random forest classifier (group of decision trees) is created. First the training data given as input is sampled with replacement to create a random sample to train each decision tree. The number of decision trees, the threshold for purity of the feature, and the number of features used for splitting at each node of each decision tree is configurable. Once the random forest classifier is trained, it is saved in a serialized format for making predictions.

Phase 3: Prediction

Code for this section can be found in accuracy.py. The test data is given as input for prediction to determine the accuracy of the classifier. The ouput contains the overall accuracy of predictions, and the number of instances for each time the difference between the predicted rating and the actual rating is 1, 2, 3 and 4.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
classifier		classifier
data		data
handler		handler
mapreduce		mapreduce
preprocess		preprocess
website		website
.gitignore		.gitignore
README.markdown		README.markdown
accuracy.py		accuracy.py
interact.py		interact.py
runscript		runscript
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS 205 Final Project - Sentiment Analysis

Phase 1: Preprocessing Using MapReduce

Phase 2: Training Random Forest Classifier

Phase 3: Prediction

About

Releases

Packages

Languages

omarshammas/sentiment_analysis

Folders and files

Latest commit

History

Repository files navigation

CS 205 Final Project - Sentiment Analysis

Phase 1: Preprocessing Using MapReduce

Phase 2: Training Random Forest Classifier

Phase 3: Prediction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages