Email Spam Detection

Project Overview

This project involves the creation of an email spam classifier using the SpamAssassin public dataset. The classifier leverages the Random Forest algorithm to distinguish between spam and non-spam (ham) emails. The project includes a data pipeline to preprocess and extract features from raw emails, which are then used to train the model.

Project Structure

The project directory structure is as follows:

email-spam-detection/
- data-pipeline/
  - ham/
  - spam/
  - data_final.csv
  - process_emails.py
  - run-pipeline.py
- ml-model/
  - EmailSpamDetection.ipynb
- README.md
- requirements.txt

Installation

To run this project locally, follow these steps:

Clone the repository:

git clone https://github.com/RushabhMehta2005/email-spam-detection.git
cd email-spam-detection

Install the required packages:
```
pip install -r requirements.txt
```

Dataset

The SpamAssassin public dataset was used for training and evaluating the spam classifier. The dataset consists of both spam and ham emails in raw text format. Download the dataset here

Data Pipeline

The data pipeline involves the following steps:

Loading Raw Emails: Emails are loaded from the downloaded dataset.
Preprocessing: Raw emails are cleaned and preprocessed to remove unnecessary metadata and whitespaces, word stemming is performed to reduce all words to their word stem.
Feature Extraction: Features such as word frequencies, frequencies of special characters, detection of HTML tags, number of URLs present and other text-based features are extracted from the emails.
Vectorization: The text features are converted into a feature vector, finally all the vectors are converted into a pd.DataFrame object which is then saved as a .csv file.

Model Training

3 different machine learning models are trained on this dataset, namely Logistic Regression, Xtreme Gradient Boosting and the Random Forest Classifier. The training of each model involves:

Splitting the Data: The dataset is split into training and testing sets with a 75:25 ratio.
Training: The model is trained on the training set. Scikit-learn pipelines are used for convenient feature scaling and training.
Hyperparameter Tuning: Selected hyperparameters of the model are tuned for optimal performance, Grid Search is used to find the optimal choices with an industry standard of 10 fold cross validation. The decision threshold is adjusted across many iterations of the model to achieve best F1-score, final decision threshold is 0.35.
Evaluation: The model is evaluated on the testing set using metrics such as accuracy, precision, recall, and F1-score.

Evaluation

We now list the evaluation metrics of all the 3 models.

Logistic Regression

Precision: 0.89
Recall: 0.88
F1-score: 0.87

Xtreme Gradient Boosted Tree

Precision: 0.91
Recall: 0.91
F1-score: 0.91

Random Forest Classifier

Precision: 0.91
Recall: 0.91
F1-score: 0.91

As the accuracy, training time and memory consumption of the random forest classifier were better, it is chosen as the final model for this project.

Best Random Forest Model Parameters

clf__max_depth: 3
clf__max_features: 'sqrt'
clf__min_samples_leaf: 4
clf__min_samples_split: 4
clf__n_estimators: 50

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data-pipeline		data-pipeline
ml-model		ml-model
results		results
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Email Spam Detection

Project Overview

Table of Contents

Project Structure

Installation

Dataset

Data Pipeline

Model Training

Evaluation

Best Random Forest Model Parameters

About

Uh oh!

Releases

Packages

Uh oh!

Languages

RushabhMehta2005/email-spam-detection

Folders and files

Latest commit

History

Repository files navigation

Email Spam Detection

Project Overview

Table of Contents

Project Structure

Installation

Dataset

Data Pipeline

Model Training

Evaluation

Best Random Forest Model Parameters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages