Building a Spam Filter with Naive Bayes

Here I build a simple (but effective) spam filter for E-Mails using a naive Bayes approach on the Enron Spam Dataset. So far I am just scanning the subject line of the email. This approach with no further fine-tuning detects 100% of the spam in the test dataset, and only classifies 4% of "ham" messages incorrectly.

Contents of this repo

File	Details
/data	data files see below for details
Building a Spam Filter with Naive Bayes.ipynb	Jupyter Notebook with the python code to build the Spam Filter. Open the Notebook here to see the analysis.
process_data.py	Python script to fetch data from repo and pre-process it (as seperate script so it can run in the background with low priorities if needed)

Contents of /data

File	Content
enron_spam_data.zip	The raw Enron-Spam data set from my repo here. A zipped csv-file that contains the columns Subject (subject line), Message (email body), Spam/Ham (email category encoded "ham" or "spam") and Date (date of the email in the format YYYY-MM-DD)
train.zip	80% of the original data set for training the model. The structure is similar to the the Enron-Spam data (above), but Subject and Message have been preprocessed: The string has been converted to lowercase, punctuation has been removed. Additionally, the string has been converted to a list with one value per word.
test.zip	20% of the original data set for testing the model. No further preprocessing has been done on this data.
subject_voc.zip	Zipped csv-file that contains the word count of all unique words from the subject line (based on the training data). Each unique word is represented by a column and each subject line of the train is represented by a single line. So each line contains the word count per word for this subject line.
message_voc_zip	Zipped csv file with the vocabulary for the email text bodies. Structure is identical to subject_voc.zip (above) so each line contains the word count per word for a single email message body.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
Building a Spam Filter with Naive Bayes.ipynb		Building a Spam Filter with Naive Bayes.ipynb
LICENSE		LICENSE
README.md		README.md
process_data.py		process_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building a Spam Filter with Naive Bayes

Contents of /data

About

Releases

Packages

Languages

License

MWiechmann/enron_spam_filter

Folders and files

Latest commit

History

Repository files navigation

Building a Spam Filter with Naive Bayes

Contents of /data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages