spam-classifier

A Machine Learning based spam classifier

Dataset is composed of :

Training Set : ham 900, spam 2700
Test Set : ham 300, spam 900
Cross Validation set : ham 300, spam 900

First a list of the words present in the training set is made. After this, we find the count of these words in every set. We take the 1000 most frequently occurring words, and create a feature vector. This is stored in a csv file.

Note that for the words:

Stop Words : We have omitted the most commonly used words (a, the etc).
Stemming : We have used stemming.
Non alphabet characters have been removed.

Now Logistic regression is applied taking the 1000 most frequently used words as features.

Next we calculate the prediction accuracy by taking the sum of all the training examples that were predicted correctly in the cross validation set and dividing that with the total number of training examples (m). We further calculate the precision,recall and F1score for the cross validation set.

Now we plot the learning curve (error function vs number of training examples) of our training set and cross validation set. We finally calculated the prediction accuracy , precision, recall and F1score for different values of lambda(0.01,0.1,1,10) for 100 and 1000 features on our test set and got the best value of lambda for our prediction classifier.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Dataset		Dataset
Learning Curves Images (100 features)		Learning Curves Images (100 features)
.gitignore		.gitignore
1000words.txt		1000words.txt
README.md		README.md
Summary .pages		Summary .pages
costfunction_test.m		costfunction_test.m
costfunction_test2.m		costfunction_test2.m
costprediciton_train.m		costprediciton_train.m
countWords.py		countWords.py
cv_testPrediction.m		cv_testPrediction.m
featureExtract.py		featureExtract.py
learningCurves.m		learningCurves.m
main.m		main.m
mycsvcv.csv		mycsvcv.csv
mycsvtest.csv		mycsvtest.csv
mycsvtrain.csv		mycsvtrain.csv
precisionandrecall.m		precisionandrecall.m
process-data.py		process-data.py
regularizedCostFunction.m		regularizedCostFunction.m
sigmoid.m		sigmoid.m
train.m		train.m
trainClassifier.m		trainClassifier.m
wordList.py		wordList.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

spam-classifier

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

dhruvmullick/spam-classifier

Folders and files

Latest commit

History

Repository files navigation

spam-classifier

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages