This repository contains the code and results from sentiment analysis on movie reviews. The project explores two different approaches to sentiment detection: one using Bag of Words (BoW) and Supervised Learning, and the other using SentiWordNet and Unsupervised Learning (knowledge-based).
The files and folders are organized the following way:
├── data
│ ├── original_data
│ ├── results
| └── synsets
├── media
├── compare_sup_unsup.ipynb
├── frequencies.py
├── split_dataset.py
├── supervised.ipynb
├── textserver.py
├── ukb_graph.gexf
├── ukb.py
└── unsupervised.ipynb
- data: Contains the dataset, results, and synset data used in the project.
- media: Stores images and visualizations used in the README and reports.
- supervised.ipynb: Main notebook for the supervised learning approach.
- unsupervised.ipynb: Main notebook for the unsupervised learning approach.
- compare_sup_unsup.ipynb: Notebook comparing the results of the supervised and unsupervised approaches.
- ukb.py: Contains the pseudo-implementation of the UKB disambiguation algorithm.
- Other scripts: Assist with data processing and model evaluation.
The data used was the Movie Reviews Corpus, downloaded via nltk, consisting of 1000 positive and 1000 negative movie reviews (we have split in 25% test, 18.75% val and 56.25% train).
The supervised models use the library scikit-learn.
For the unsupervised part, SentiWordNet is used, as well as Spacy and our pseudo-implementation of UKB (in the file ukb.py).
The supervised learning approach involves preprocessing the text. We have selected three different methods: just calling CountVectorizer, lemmatizing, and then CountVectorizer and binary CountVectorizer.
The models tried are:
- GradientBoosting
- AdaBoost
- RandomForest
- LogisticRegression
- SVC
- MLPClassifier
A Grid Search Cross Validation has been done to find the best model (accuracy) with the best parameters. This took +100 minutes.
The best model was a RandomForest with 1500 estimators, max depth 14, and using the binary Count Vectorizer (0.8613 accuracy val). With the test partition, the obtained accuracy was 0.854.
The first step for this part was to do synset disambiguation, so we could later use SentiWordNet. We have tried three different ways of performing this task:
- Using POS tagging and Lesk algorithm
- Using a custom implementation of UKB
- Using the most freqüent synset (based on SemCor)
UKB is a SOTA word sense disambiguator, described in this paper, this other... by the IXA group of the University of the Basque Country (you can find more information here).
In our pseudo-implementation, we revised the different methods described in their paper, using the library networkx (UKB uses PageRank). However, in the end, we couldn't use this method for the complete dataset, since it was too slow, and we only tested it with a subset of the data.
With the synsets, we can use SentiWordNet to obtain a polarity score of each word (pos, neg, or obj). However, we can get some other metrics from these scores, like the max score, the difference between pos and neg, the difference with a threshold... You can also filter words using their POS tag, like only using nouns, adjectives...
We can also use different methods to get the scores of a sentence: sum, mean, max, min, norm2 mean... To join different sentences, we decided to just use the mean.
Finally, with the score of a review, we still need to determine a threshold upon which we decide that score to be positive.
We have created a "validation" partition to test all the different ways to calculate the scores. The whole search took 170 minutes, and these are the best-performing ones:
The final configuration chosen was using all POS except verbs, score dif, merge using sum, and threshold 0. The accuracy was 0.644.
Other alternatives were also tried to obtain better results (VADER, negation detection...) but they ended up being worse.
It is pretty clear that, even though the unsupervised learning techniques could generate some predictions, the supervised approach obtained much better results.
However, with the use of bigger models, embeddings, and architectures like Transformers, these results could be further improved.