Skip to content

Classifies IMDB documents using different feature selection methods

Notifications You must be signed in to change notification settings

mattusifer/imdb-sentiment-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

imdb-sentiment-classifier

Creators

Abdullah Aljebreen, Austin Spadaro, Matt Usifer

This application aims to classify the sentiment of documents in a corpus of 50,000 IMDB reviews. This corpus was originally assembled by Maas et al. It was created as our final project for CIS 5538 (Text Mining and Language Processing) with Prof Yuhung Guo at Temple University.

It achieves sentiment classification accuracy of over 85% using several different feature selection methods.

Build & Run

  1. Clone the repo
  2. Download the corpus, extract into src/main/resources/aclImdb
  3. Run these:
$ mvn compile
$ mvn exec:java -Dexec.mainClass=com.classifier.App

Running the main application will:

  1. Process all IMDB files and store the processed files in the src/main/resources/processed directory. During this step, all documents are tokenized, lemmatized, stemmed, and stripped of stopwords using different stopwords lists.
  2. Build a vocabulary based on the processed files.
  3. Insert all processed files into a term document matrix.
  4. Iterate through all feature selection methods, selecting varying counts of features and creating copies of the pre-processed documents based on the features selected

In order to assess the effectiveness of the features that were selected for each method, we need to classify the documents. In order to perform sentiment classification, we used the LibSVM libary. Download the current package into ~/libsvm and make to compile it. Run the following commands to in ~/libsvm to create the necessary directories for experimentation:

$ mkdir experiments experiments/train experiments/test experiments/output experiments/models

Finally, execute the following within the imdb-sentiment-classifier base directory to run all of our experiments:

$ ./experiments.sh 2000

As shown above, experiments.sh accepts a single argument which will be passed into LibSVM to control how many MBs of memory in it is allotted when training the SVM. If no arguments are supplied, it will take 1000 MBs of memory by default.

About

Classifies IMDB documents using different feature selection methods

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published