This project involves the creation of an email spam classifier using the SpamAssassin public dataset. The classifier leverages the Random Forest algorithm to distinguish between spam and non-spam (ham) emails. The project includes a data pipeline to preprocess and extract features from raw emails, which are then used to train the model.
The project directory structure is as follows:
- email-spam-detection/
- data-pipeline/
- ham/
- spam/
- data_final.csv
- process_emails.py
- run-pipeline.py
- ml-model/
- EmailSpamDetection.ipynb
- README.md
- requirements.txt
- data-pipeline/
To run this project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/RushabhMehta2005/email-spam-detection.git cd email-spam-detection
-
Install the required packages:
pip install -r requirements.txt
The SpamAssassin public dataset was used for training and evaluating the spam classifier. The dataset consists of both spam and ham emails in raw text format. Download the dataset here
The data pipeline involves the following steps:
- Loading Raw Emails: Emails are loaded from the downloaded dataset.
- Preprocessing: Raw emails are cleaned and preprocessed to remove unnecessary metadata and whitespaces, word stemming is performed to reduce all words to their word stem.
- Feature Extraction: Features such as word frequencies, frequencies of special characters, detection of HTML tags, number of URLs present and other text-based features are extracted from the emails.
- Vectorization: The text features are converted into a feature vector, finally all the vectors are converted into a pd.DataFrame object which is then saved as a .csv file.
3 different machine learning models are trained on this dataset, namely Logistic Regression, Xtreme Gradient Boosting and the Random Forest Classifier. The training of each model involves:
- Splitting the Data: The dataset is split into training and testing sets with a 75:25 ratio.
- Training: The model is trained on the training set. Scikit-learn pipelines are used for convenient feature scaling and training.
- Hyperparameter Tuning: Selected hyperparameters of the model are tuned for optimal performance, Grid Search is used to find the optimal choices with an industry standard of 10 fold cross validation. The decision threshold is adjusted across many iterations of the model to achieve best F1-score, final decision threshold is 0.35.
- Evaluation: The model is evaluated on the testing set using metrics such as accuracy, precision, recall, and F1-score.
We now list the evaluation metrics of all the 3 models.
- Logistic Regression
- Precision: 0.89
- Recall: 0.88
- F1-score: 0.87
- Xtreme Gradient Boosted Tree
- Precision: 0.91
- Recall: 0.91
- F1-score: 0.91
- Random Forest Classifier
- Precision: 0.91
- Recall: 0.91
- F1-score: 0.91
As the accuracy, training time and memory consumption of the random forest classifier were better, it is chosen as the final model for this project.
clf__max_depth
: 3clf__max_features
: 'sqrt'clf__min_samples_leaf
: 4clf__min_samples_split
: 4clf__n_estimators
: 50