GitHub - mattusifer/imdb-sentiment-classifier: Classifies IMDB documents using different feature selection methods

imdb-sentiment-classifier

Creators

Abdullah Aljebreen, Austin Spadaro, Matt Usifer

This application aims to classify the sentiment of documents in a corpus of 50,000 IMDB reviews. This corpus was originally assembled by Maas et al. It was created as our final project for CIS 5538 (Text Mining and Language Processing) with Prof Yuhung Guo at Temple University.

It achieves sentiment classification accuracy of over 85% using several different feature selection methods.

Build & Run

Clone the repo
Download the corpus, extract into src/main/resources/aclImdb
Run these:

$ mvn compile
$ mvn exec:java -Dexec.mainClass=com.classifier.App

Running the main application will:

Process all IMDB files and store the processed files in the src/main/resources/processed directory. During this step, all documents are tokenized, lemmatized, stemmed, and stripped of stopwords using different stopwords lists.
Build a vocabulary based on the processed files.
Insert all processed files into a term document matrix.
Iterate through all feature selection methods, selecting varying counts of features and creating copies of the pre-processed documents based on the features selected

In order to assess the effectiveness of the features that were selected for each method, we need to classify the documents. In order to perform sentiment classification, we used the LibSVM libary. Download the current package into ~/libsvm and make to compile it. Run the following commands to in ~/libsvm to create the necessary directories for experimentation:

$ mkdir experiments experiments/train experiments/test experiments/output experiments/models

Finally, execute the following within the imdb-sentiment-classifier base directory to run all of our experiments:

$ ./experiments.sh 2000

As shown above, experiments.sh accepts a single argument which will be passed into LibSVM to control how many MBs of memory in it is allotted when training the SVM. If no arguments are supplied, it will take 1000 MBs of memory by default.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
src		src
.gitignore		.gitignore
README.md		README.md
experiments.sh		experiments.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

imdb-sentiment-classifier

Creators

Build & Run

About

Releases

Packages

Contributors 2

Languages

mattusifer/imdb-sentiment-classifier

Folders and files

Latest commit

History

Repository files navigation

imdb-sentiment-classifier

Creators

Build & Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages