Simple spam classifier using Support Vector Machine

This is a python implementation of spam classification taught in famous Machine Learning course by Andrew Ng

Train and test set are provided in spamTrain.mat, spamTest.mat

Full dataset at SpamAssassin Public Corpus

Email preprocessing pipeline

Lower case email
Remove HTML tags
Replace URLs with 'httpaddr', emails with 'emailaddr', numbers with 'number', $ with 'dollar'
Remove non-alphanumeric characters
Tokenize
Poster stem

Email features

Each email is represented by a N-vector. N is the number of word in dictionary (provided in vocab.txt). The dictionary is built by collecting most frequency words appear in dataset. Each ith-row in feature vector represent whether or not the ith word in dictionary appear in preprocessed email.

Training

Email features are fed to a linear SVM classifier with C parameter set to 0.1. The classifier then have to classify between two class: 0-non spam, 1-spam

Result

Train accuracy: 0.998250

Test accuracy: 0.989000

Top 15 predictors for spam classifirer

our click remov guarante visit basenumb dollar will price pleas most nbsp lo ga hour

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
emailSample1.txt		emailSample1.txt
emailSample2.txt		emailSample2.txt
emailSample3.txt		emailSample3.txt
processEmail.py		processEmail.py
simple-spam-classifier.pyproj		simple-spam-classifier.pyproj
simple-spam-classifier.sln		simple-spam-classifier.sln
simple_spam_classifier.py		simple_spam_classifier.py
spamTest.mat		spamTest.mat
spamTrain.mat		spamTrain.mat
vocab.txt		vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple spam classifier using Support Vector Machine

Email preprocessing pipeline

Email features

Training

Result

About

Releases

Packages

Languages

supernovalx/simple-spam-classifier

Folders and files

Latest commit

History

Repository files navigation

Simple spam classifier using Support Vector Machine

Email preprocessing pipeline

Email features

Training

Result

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages