This repository contains a notebook and datset used to build an SMS spam filte using using the multinomial Naive Bayes algorithm
In machine learning, Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes theorem with strong (but naive) independence assumptions between the features. In probability theory and statistics, Bayes' theorem (alternatively Bayes law or Bayes rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
To train the algorithm, I'll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. The data collection process is described in more details on this page, where you can also find some of the authors' papers.
The notebook is based on a guided project from Dataquest, an online Data Science bootcamp. The learning goal of the project was to test understanding of probability, conditional probability and bayes theorem.