Abdullah Aljebreen, Austin Spadaro, Matt Usifer
This application aims to classify the sentiment of documents in a corpus of 50,000 IMDB reviews. This corpus was originally assembled by Maas et al. It was created as our final project for CIS 5538 (Text Mining and Language Processing) with Prof Yuhung Guo at Temple University.
It achieves sentiment classification accuracy of over 85% using several different feature selection methods.
- Clone the repo
- Download the corpus, extract into src/main/resources/aclImdb
- Run these:
$ mvn compile
$ mvn exec:java -Dexec.mainClass=com.classifier.App
Running the main application will:
- Process all IMDB files and store the processed files in the
src/main/resources/processed
directory. During this step, all documents are tokenized, lemmatized, stemmed, and stripped of stopwords using different stopwords lists. - Build a vocabulary based on the processed files.
- Insert all processed files into a term document matrix.
- Iterate through all feature selection methods, selecting varying counts of features and creating copies of the pre-processed documents based on the features selected
In order to assess the effectiveness of the features that were selected for each method, we need to classify the documents. In order to perform sentiment classification, we used the LibSVM libary. Download the current package into ~/libsvm and make
to compile it. Run the following commands to in ~/libsvm to create the necessary directories for experimentation:
$ mkdir experiments experiments/train experiments/test experiments/output experiments/models
Finally, execute the following within the imdb-sentiment-classifier
base directory to run all of our experiments:
$ ./experiments.sh 2000
As shown above, experiments.sh
accepts a single argument which will be passed into LibSVM to control how many MBs of memory in it is allotted when training the SVM. If no arguments are supplied, it will take 1000 MBs of memory by default.