diff --git a/ml/algorithms.qmd b/ml/algorithms.qmd index 18f6f40..32d5988 100644 --- a/ml/algorithms.qmd +++ b/ml/algorithms.qmd @@ -1,6 +1,6 @@ # Examples of algorithms {#sec-example-alogirhms} -There are hundreds of machine learning algorithms. Here we provide a few examples spanning rather different approaches. Throughout the chapter we will be using the two predictor digits data introduced in @sec-two-or-seven to demonstrate how the algorithms work. We focus on the concepts and ideas behind the algorithms using illustrative datasets from the **dslabs** package. +There are hundreds of machine learning algorithms. Here we provide a few examples spanning rather different approaches. Throughout the chapter, we will be using the two predictor digits data introduced in @sec-two-or-seven to demonstrate how the algorithms work. We focus on the concepts and ideas behind the algorithms using illustrative datasets from the **dslabs** package. ```{r warning = FALSE, message = FALSE, cache=FALSE} library(tidyverse) @@ -8,11 +8,11 @@ library(caret) library(dslabs) ``` -Then in @sec-ml-in-practice we show an efficient way to implement these ideas using the **caret** package. +Then, in @sec-ml-in-practice, we show an efficient way to implement these ideas using the **caret** package. ## Logistic regression -In @sec-two-or-seven we used linear regression to predict classes by fitting the model +In @sec-two-or-seven, we used linear regression to predict classes by fitting the model: $$ p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2) = @@ -25,7 +25,7 @@ fit_lm <- lm(y ~ x_1 + x_2, data = mutate(mnist_27$train,y = ifelse(y == 7, 1, 0 range(fit_lm$fitted) ``` -To avoid this we can apply the approach described in @sec-glm that is more appropriate for binary data. We write the model like this: +To avoid this, we can apply the approach described in @sec-glm that is more appropriate for binary data. We write the model like this: $$ @@ -50,7 +50,7 @@ mnist_27$true_p |> mutate(p_hat = p_hat) |> stat_contour(breaks = c(0.5), color = "black") ``` -Just like regression, the decision rule is a line, a fact that can be corroborated mathematically, definint $g(x) = \log \{x/(1-x)\}$, we have: +Just like regression, the decision rule is a line, a fact that can be corroborated mathematically, defining $g(x) = \log \{x/(1-x)\}$, we have: $$ g^{-1}(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2) = 0.5 \implies @@ -58,7 +58,7 @@ g^{-1}(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2) = 0.5 \implies x_2 = -\hat{\beta}_0/\hat{\beta}_2 -\hat{\beta}_1/\hat{\beta}_2 x_1 $$ -Thus, just like with regression, $x_2$ is a linear function of $x_1$. This implies that our logistic regression approach has no chance of capturing the non-linear nature of the true $p(\mathbf{x})$. We now described some techniques that estimate the conditional probability in a way that is more flexible. +Thus, just like with regression, $x_2$ is a linear function of $x_1$. This implies that our logistic regression approach has no chance of capturing the non-linear nature of the true $p(\mathbf{x})$. FIX We now described some techniques that estimate the conditional probability in a way that is more flexible. :::{.callout-note} You are ready to do exercises 1 - 11. @@ -91,7 +91,7 @@ plot_cond_prob <- function(p_hat = NULL){ train_knn <- knn3(y ~ ., k = 31, data = mnist_27$train) ``` -We introduced the kNN algorithm in @sec-knn-cv-intro. In @sec-mse-estimates we noted that $k=31$ provided the highest accuracy in the test set. Using $k=31$ we obtain an accuracy `r confusionMatrix(predict(train_knn, mnist_27$test, type = "class"),mnist_27$test$y)$overall["Accuracy"]`, an improvement over regression. A plot of the estimated conditional probability shows that the kNN estimate is flexible enough and does indeed capture the shape of the true conditional probability. +We introduced the kNN algorithm in @sec-knn-cv-intro. In @sec-mse-estimates, we noted that $k=31$ provided the highest accuracy in the test set. Using $k=31$, we obtain an accuracy `r confusionMatrix(predict(train_knn, mnist_27$test, type = "class"),mnist_27$test$y)$overall["Accuracy"]`, an improvement over regression. A plot of the estimated conditional probability shows that the kNN estimate is flexible enough and does indeed capture the shape of the true conditional probability. ```{r best-knn-fit, echo = FALSE, out.width = "100%"} p1 <- plot_cond_prob() + ggtitle("True conditional probability") @@ -210,7 +210,7 @@ Again, this is because the algorithm gives more weight to specificity to account specificity(data = factor(y_hat_bayes), reference = factor(test_set$sex)) ``` -This is due mainly to the fact that $\hat{\pi}$ is substantially less than 0.5, so we tend to predict `Male` more often. It makes sense for a machine learning algorithm to do this in our sample because we do have a higher percentage of males. But if we were to extrapolate this to a general population, our overall accuracy would be affected by the low sensitivity. +This is mainly due to the fact that $\hat{\pi}$ is substantially less than 0.5, so we tend to predict `Male` more often. It makes sense for a machine learning algorithm to do this in our sample because we do have a higher percentage of males. But if we were to extrapolate this to a general population, our overall accuracy would be affected by the low sensitivity. The Naive Bayes approach gives us a direct way to correct this since we can simply force $\hat{\pi}$ to be whatever value we want it to be. So to balance specificity and sensitivity, instead of changing the cutoff in the decision rule, we could simply change $\hat{\pi}$ to 0.5 like this: diff --git a/ml/conditionals.qmd b/ml/conditionals.qmd index 962c9bd..eace4fc 100644 --- a/ml/conditionals.qmd +++ b/ml/conditionals.qmd @@ -1,10 +1,10 @@ # Conditional probabilities and expectations -In machine learning applications, we rarely can predict outcomes perfectly. For example, spam detectors often miss emails that are clearly spam, Siri often misunderstands the words we are saying, and your bank at times thinks your card was stolen when it was not. The most common reason for not being able to build perfect algorithms is that it is impossible. To see this, note that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes. +In machine learning applications, we rarely can predict outcomes perfectly. For example, spam detectors often miss emails that are clearly spam, Siri often misunderstands the words we are saying, and sometimes your bank thinks your card was stolen when it was not. The most common reason for not being able to build perfect algorithms is that it is impossible. To see this, consider that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes. Because our prediction rules are functions, equal inputs (the predictors) implies equal outputs (the predictions). Therefore, for a challenge in which the same predictors are associated with different outcomes across different individual observations, it is impossible to predict correctly for all these cases. We saw a simple example of this in the previous section: for any given height $x$, you will have both males and females that are $x$ inches tall. -However, none of this means that we can't build useful algorithms that are much better than guessing, and in some cases better than expert opinions. To achieve this in an optimal way, we make use of probabilistic representations of the problem based on the ideas presented in Section @sec-conditional-expectation. Observations with the same observed values for the predictors may not all be the same, but we can assume that they all have the same probability of this class or that class. We will write this idea out mathematically for the case of categorical data. +However, none of this means that we can't build useful algorithms that are much better than guessing, and in some cases better than expert opinions. To achieve this in an optimal way, we make use of probabilistic representations of the problem based on the ideas presented FIX in Section @sec-conditional-expectation. Observations with the same observed values for the predictors may not all be the same, but we can assume that they all have the same probability of this class or that class. We will write this idea out mathematically for the case of categorical data. ## Conditional probabilities @@ -31,12 +31,11 @@ $$\hat{Y} = \max_k p_k(\mathbf{x})$$ In machine learning, we refer to this as _Bayes' Rule_. But this is a theoretical rule since, in practice, we don't know $p_k(\mathbf{x}), k=1,\dots,K$. In fact, estimating these conditional probabilities can be thought of as the main challenge of machine learning. The better our probability estimates $\hat{p}_k(\mathbf{x})$, the better our predictor $\hat{Y}$. -So how well we predict depends on two things: 1) how close are the $\max_k p_k(\mathbf{x})$ to 1 or 0 (perfect certainty) -and 2) how close our estimates $\hat{p}_k(\mathbf{x})$ are to $p_k(\mathbf{x})$. We can't do anything about the first restriction as it is determined by the nature of the problem, so our energy goes into finding ways to best estimate conditional probabilities. +So how well we predict depends on two things: 1) how close are the $\max_k p_k(\mathbf{x})$ to 1 or 0 (perfect certainty) and 2) how close our estimates $\hat{p}_k(\mathbf{x})$ are to $p_k(\mathbf{x})$. We can't do anything about the first restriction as it is determined by the nature of the problem, so our energy goes into finding ways to best estimate conditional probabilities. -The first restriction does imply that we have limits as to how well even the best possible algorithm can perform. You should get used to the idea that while in some challenges we will be able to achieve almost perfect accuracy, with digit readers for example, in others, our success is restricted by the randomness of the process, with movie recommendations for example. +The first restriction does imply that we have limits as to how well even the best possible algorithm can perform. You should get used to the idea that while in some challenges we will be able to achieve almost perfect accuracy, with digit readers for example, in others, our success is restricted by the randomness of the process, such as with movie recommendations. -It is important to remember that defining our prediction by maximizing the probability is not always optimal in practice and depends on the context. As discussed in @sec-evaluation-metrics, sensitivity and specificity may differ in importance. But even in these cases, having a good estimate of the $p_k(x), k=1,\dots,K$ will suffice for us to build optimal prediction models, since we can control the balance between specificity and sensitivity however we wish. For instance, we can simply change the cutoffs used to predict one outcome or the other. In the plane example, we may ground the plane anytime the probability of malfunction is higher than 1 in a million as opposed to the default 1/2 used when error types are equally undesired. +Keep in mind that defining our prediction by maximizing the probability is not always optimal in practice and depends on the context. As discussed in @sec-evaluation-metrics, sensitivity and specificity may differ in importance. But even in these cases, having a good estimate of the $p_k(x), k=1,\dots,K$ will suffice for us to build optimal prediction models, since we can control the balance between specificity and sensitivity however we wish. For instance, we can simply change the cutoffs used to predict one outcome or the other. In the plane example, we may ground the plane anytime the probability of malfunction is higher than 1 in a million as opposed to the default 1/2 used when error types are equally undesired. ## Conditional expectations diff --git a/ml/cross-validation.qmd b/ml/cross-validation.qmd index e3bff2a..a52fdcc 100644 --- a/ml/cross-validation.qmd +++ b/ml/cross-validation.qmd @@ -1,10 +1,10 @@ # Cross validation {#sec-cross-validation} -In this chapter we introduce cross validation, one of the most important ideas in machine learning. Here we focus on the conceptual and mathematical aspects. We will describe how to implement cross validation in practice with the **caret** package later in @sec-caret-cv. To motivate the concept, we will use the two predictor digits data presented in \@ref-two-or-seven and introduce an actual machine learning algorithm: k-nearest neighbors (kNN), to demonstrate the ideas. +In this chapter, we introduce cross validation, one of the most important ideas in machine learning. Here we focus on the conceptual and mathematical aspects. We will describe how to implement cross validation in practice with the **caret** package later in @sec-caret-cv. To motivate the concept, we will use the two predictor digits data presented in \@ref-two-or-seven and introduce an actual machine learning algorithm, k-nearest neighbors (kNN), to demonstrate the ideas. ## Motivation with k-nearest neighbors {#sec-knn-cv-intro} -We are interested in estimating the conditional probability function +We are interested in estimating the conditional probability function: $$ p(\mathbf{x}) = \mbox{Pr}(Y = 1 \mid X_1 = x_1 , X_2 = x_2). @@ -12,11 +12,11 @@ $$ as defined in @sec-smoothing-ml-connection. -With k-nearest neighbors (kNN) we estimate $p(\mathbf{x})$ in a similar way to bin smoothing. First we define the distance between all observations based on the features. Then, for any point $\mathbf{x}_0$ we estimate $p(\mathbf{x})$ by identifying the $k$ nearest points to $mathbf{x}_0$ and then taking an average of the $y$s associated with these points. We refer to the set of points used to compute the average as the *neighborhood*. +With k-nearest neighbors (kNN) we estimate $p(\mathbf{x})$ in a similar way to bin smoothing. First, we define the distance between all observations based on the features. Then, for any point $\mathbf{x}_0$, we estimate $p(\mathbf{x})$ by identifying the $k$ nearest points to $mathbf{x}_0$ and afterwards taking an average of the $y$s associated with these points. We refer to the set of points used to compute the average as the *neighborhood*. Due to the connection we described earlier between conditional expectations and conditional probabilities, this gives us $\hat{p}(\mathbf{x}_0)$, just like the bin smoother gave us an estimate of a trend. As with bin smoothers, we can control the flexibility of our estimate through the $k$ parameter: larger $k$s result in smoother estimates, while smaller $k$s result in more flexible and wiggly estimates. -To implement the algorithm, we can use the `knn3` function from the **caret** package. Looking at the help file for this package, we see that we can call it in one of two ways. We will use the first in which we specify a *formula* and a data frame. The data frame contains all the data to be used. The formula has the form `outcome ~ predictor_1 + predictor_2 + predictor_3` and so on. Therefore, we would type `y ~ x_1 + x_2`. If we are going to use variables in the data frame, we can use the `.` like this `y ~ .`. We also need to pick $k$, which is set to `k = 5` by default. The final call looks like this: +To implement the algorithm, we can use the `knn3` function from the **caret** package. Looking at the help file for this package, we see that we can call it in one of two ways. We will use the first way in which we specify a *formula* and a data frame. The data frame contains all the data to be used. The formula has the form `outcome ~ predictor_1 + predictor_2 + predictor_3` and so on. Therefore, we type `y ~ x_1 + x_2`. If we are going to use variables in the data frame, we can use the `.` like this `y ~ .`. We also need to pick $k$, which is set to `k = 5` by default. The final call looks like this: ```{r, message = FALSE, warning = FALSE} library(dslabs) @@ -26,7 +26,7 @@ knn_fit <- knn3(y ~ ., data = mnist_27$train, k = 5) In this case, since our dataset is balanced and we care just as much about sensitivity as we do about specificity, we will use accuracy to quantify performance. -The `predict` function for `knn` produces a probability for each class. We keep the probability of being a 7 as the estimate $\hat{p}(\mathbf{x})$ +The `predict` function for `knn` produces a probability for each class. We keep the probability of being a 7 as the estimate $\hat{p}(\mathbf{x})$: ```{r} y_hat_knn <- predict(knn_fit, mnist_27$test, type = "class") @@ -76,7 +76,7 @@ This is due to what we call *over-training*. With kNN, over-training is at its worst when we set $k = 1$. With $k = 1$, the estimate for each $\mathbf{x}$ in the training set is obtained with just the $y$ corresponding to that point. In this case, if the $x_1$ and $x_2$ are unique, we will obtain perfect accuracy in the training set because each point is used to predict itself. Remember that if the predictors are not unique and have different outcomes for at least one set of predictors, then it is impossible to predict perfectly. -Here we fit a kNN model with $k = 1$ and confirm we get close to perfect accuracy in the training set: +Here we fit a kNN model with $k = 1$ and confirm we get nearer to perfect accuracy in the training set: ```{r} knn_fit_1 <- knn3(y ~ ., data = mnist_27$train, k = 1) @@ -91,7 +91,7 @@ y_hat_knn_1 <- predict(knn_fit_1, mnist_27$test, type = "class") confusionMatrix(y_hat_knn_1, mnist_27$test$y)$overall["Accuracy"] ``` -We can see the over-fitting problem in this figure. +We can see the over-fitting problem in this figure: ```{r knn-1-overfit, echo = FALSE, out.width="100%"} tmp <- mnist_27$true_p @@ -117,7 +117,7 @@ grid.arrange(p1, p2, nrow = 1) The black curves denote the decision rule boundaries. -The estimate $\hat{p}(\mathbf{x})$ follows the training data too closely (left). You can see that in the training set, boundaries have been drawn to perfectly surround a single red point in a sea of blue. Because most points $\mathbf{x}$ are unique, the prediction is either 1 or 0 and the prediction for that point is the associated label. However, once we introduce the training set (right), we see that many of these small islands now have the opposite color and we end up making several incorrect predictions. +The estimate $\hat{p}(\mathbf{x})$ follows the training data too closely (left). You can see that, in the training set, boundaries have been drawn to perfectly surround a single red point in a sea of blue. Because most points $\mathbf{x}$ are unique, the prediction is either 1 or 0 and the prediction for that point is the associated label. However, once we introduce the training set (right), we see that many of these small islands now have the opposite color and we end up making several incorrect predictions. ### Over-smoothing @@ -148,7 +148,7 @@ This size of $k$ is so large that it does not permit enough flexibility. We call ### Parameter tuning -So how do we pick $k$? In principle we want to pick the $k$ that maximizes accuracy, or minimizes the expected MSE as defined in @sec-mse. The goal of cross validation is to estimate these quantities for any given algorithm and set of tuning parameters such as $k$. To understand why we need a special method to do this let's repeat what we did above, comparing the training set and test set accuracy, but for different values of $k$. We can plot the accuracy estimates for each value of $k$: +So how do we pick $k$? In principle, we want to pick the $k$ that maximizes accuracy or minimizes the expected MSE as defined in @sec-mse. The goal of cross validation is to estimate these quantities for any given algorithm and set of tuning parameters such as $k$. To understand why we need a special method to do this, let's repeat what we did above, comparing the training set and test set accuracy, but for different values of $k$. We can plot the accuracy estimates for each value of $k$: ```{r, echo=FALSE,warning = FALSE, message = FALSE} ks <- seq(3, 251, 2) @@ -178,7 +178,7 @@ data.frame(k = ks, train = accuracy["train",], test = accuracy["test",]) |> First, note that the estimate obtained on the training set is generally lower than the estimate obtained with the test set, with the difference larger for smaller values of $k$. This is due to over-training. -So do we simply pick the $k$ that maximizes accuracy and report this accuracy? There are to problems with this approach: +So do we simply pick the $k$ that maximizes accuracy and report this accuracy? There are two problems with this approach: 1. The accuracy versus $k$ plot is quite jagged. We do not expect this because small changes in $k$ should not affect the algorithm's performance so much. The jaggedness is explained by the fact that the accuracy is computed on a sample and therefore is a random variable. @@ -188,7 +188,7 @@ Cross validation provides a solution to both these problems. ## Mathematical description of cross validation -In the previous section we introduced kNN as an example to motivate the topic of this chapter. In this case particular case there is just one parameter, $k$, that affects the performance of the algorithm. However, in general, machine algorithms may have multiple parameter so we use the notation $\lambda$ to represent any set of parameters needed to define a machine learning algorithm. We also introduce notation to distinguish the predictions you get with each set of parameters with $\hat{y}(\lambda)$ and the MSE for this choice with $\text{MSE}(\lambda)$. Our goal is to find the $\lambda$ that minimizes $\text{MSE}(\lambda)$. Cross validation is a method the helps us estimate $\text{MSE}(\lambda)$. +In the previous section, we introduced kNN as an example to motivate the topic of this chapter. In this particular case, there is just one parameter, $k$, that affects the performance of the algorithm. However, in general, machine algorithms may have multiple parameters so we use the notation $\lambda$ to represent any set of parameters needed to define a machine learning algorithm. We also introduce notation to distinguish the predictions you get with each set of parameters with $\hat{y}(\lambda)$ and the MSE for this choice with $\text{MSE}(\lambda)$. Our goal is to find the $\lambda$ that minimizes $\text{MSE}(\lambda)$. Cross validation is a method the helps us estimate $\text{MSE}(\lambda)$. A intuitive first attempt is the apparent error defined in @sec-mse and used in the previous section: @@ -196,18 +196,17 @@ $$ \hat{\mbox{MSE}}(\lambda) = \frac{1}{N}\sum_{i = 1}^N \left\{\hat{y}_i(\lambda) - y_i\right\}^2 $$ -As noted in the previous section this estimate is a random error, based on just one test set, with enough variability to affect the choice of the best $\lambda$ substantially. +As noted in the previous section, this estimate is a random error, based on just one test set, with enough variability to affect the choice of the best $\lambda$ substantially. -Now imagine a world in which we could obtain data over and over again, say from new random samples. We could take a very large number of new samples $B$, split them into training and test sets, and define +Now imagine a world in which we could obtain data repeatedly, say from new random samples. We could take a very large number of new samples $B$, split them into training and test sets, and define: $$ \frac{1}{B} \sum_{b=1}^B \frac{1}{N}\sum_{i=1}^N \left\{\hat{y}_i^b(\lambda) - y_i^b\right\}^2 $$ -with $y_i^b$ the $i$th observation in sample $b$ and $\hat{y}_{i}^b(\lambda)$ the prediction obtained with the algorithm defined by parameter $\lambda$ and trained indipendently of $y_i^b$. The law of large numbers tells us that as $B$ becomes larger, this quantity gets closer and closer to $MSE(\lambda)$. This is of course a theoretical consideration as we rarely get access to more than one dataset for algorithms development, but the concept inspires the idea behind cross-validation. +with $y_i^b$ the $i$th observation in sample $b$ and $\hat{y}_{i}^b(\lambda)$ the prediction obtained with the algorithm defined by parameter $\lambda$ and trained indipendently of $y_i^b$. The law of large numbers tells us that as $B$ becomes larger, this quantity gets closer and closer to $MSE(\lambda)$. This is of course a theoretical consideration as we rarely get access to more than one dataset for algorithm development, but the concept inspires the idea behind cross-validation. -The general idea is to generate a series of different random samples from the data at hand. There are several approaches to doing this, but the general idea for all to randomly generate -several smaller datasets that are not used for training, and instead use to estimate MSE. +FIX [REPHSRASE ORACION 2. REPITES 'THE GENERAL IDEA'] The general idea is to generate a series of different random samples from the data at hand. There are several approaches to doing this, but the general idea for all to randomly generate several smaller datasets that are not used for training, and instead are used to estimate MSE. ## Cross validation in practice @@ -220,13 +219,13 @@ if (knitr::is_html_output()) { } ``` -Generally speaking, we are provided a dataset (blue) and we need to build an algorithm, using this dataset, that will eventually be used in completely independent datasets (yellow) that we might not even see. +Overall, we are provided a dataset (blue) and we need to build an algorithm, using this dataset, that will eventually be used in completely independent datasets (yellow) that we might not even see. ```{r, echo = FALSE} knitr::include_graphics("img/cv-1.png") ``` -So to imitate this situation, we start by carving out a piece of our dataset and pretend it is an independent dataset: we divide the dataset into a *training set* (blue) and a *test set* (red). We will train the entirity of our algorithm, including the choice of parameter $\lambda$, exclusively on the training set and use the test set only for evaluation purposes. +So to imitate this situation, we start by carving out a piece of our dataset and pretend it is an independent dataset: we divide the dataset into a *training set* (blue) and a *test set* (red). We will train the entirety of our algorithm, including the choice of parameter $\lambda$, exclusively on the training set and use the test set only for evaluation purposes. We usually try to select a small piece of the dataset so that we have as much data as possible to train. However, we also want the test set to be large so that we obtain a stable estimate of the MSE without fitting an impractical number of models. Typical choices are to use 10%-20% of the data for testing. @@ -234,9 +233,9 @@ We usually try to select a small piece of the dataset so that we have as much da knitr::include_graphics("img/cv-3.png") ``` -Let's reiterate that it is indispensable that we not use the test set at all: not for filtering out rows, not for selecting features, nothing! +Let's reiterate that it is indispensable that we not use the test set at all: not for filtering out rows, not for selecting features, not for anything! -But then how do we optimize $\lambda$? In cross validation we achieve this by splitting the training set into two,the training and validation sets. +But then how do we optimize $\lambda$? In cross validation, we achieve this by splitting the training set into two: the training and validation sets. ```{r, echo = FALSE} @@ -247,7 +246,7 @@ We will do this many times assuring that the estimates of MSE obtained in each d ### K-fold cross validation -As a reminder we are going to imitate the concept used when introducing this version of the MSE: +As a reminder, we are going to imitate the concept used when introducing this version of the MSE: $$ \mbox{MSE}(\lambda) = \frac{1}{B} \sum_{b = 1}^B \frac{1}{N}\sum_{i = 1}^N \left(\hat{y}_i^b(\lambda) - y_i^b\right)^2 @@ -255,7 +254,7 @@ $$ We want to generate datasets that can be thought of as an independent random sample and we want to do this $B$ times. With K-fold cross validation, we do it $K$ times. In the illustrations, we are showing an example that uses $K = 5$. -We will eventually end up with $K$ samples, but let's start by describing how to construct the first: we simply pick $M = N/K$ observations at random (we round if $M$ is not a round number) and think of these as a random sample $y_1^b, \dots, y_M^b$, with $b = 1$. We call this the validation set: +We will eventually end up with $K$ samples, but let's start by describing how to construct the first: we simply pick $M = N/K$ observations at random (we round if $M$ is not a round number) and think of these as a random sample $y_1^b, \dots, y_M^b$, with $b = 1$. We call this the validation set. Now we can fit the model in the training set, then compute the apparent error on the independent set: @@ -285,7 +284,7 @@ One way we can improve the variance of our final estimate is to take more sample ### Estimate MSE of our optimized algorithm -We have described how to use cross validation to optimize parameters. However, we now have to take into account the fact that the optimization occurred on the training data and therefore we need an estimate of our final algorithm based on data that was not used to optimize the choice. Here is where we use the test set we separated early on: +We have described how to use cross validation to optimize parameters. However, we now have to take into account the fact that the optimization occurred on the training data and we therefore need an estimate of our final algorithm based on data that was not used to optimize the choice. Here is where we use the test set we separated early on: ```{r, echo = FALSE} knitr::include_graphics("img/cv-6.png") @@ -312,18 +311,17 @@ knitr::opts_chunk$set(out.width = "70%", out.extra = NULL) ### Boostrap in cross-validation -Typically, cross-validation involves partitioning the original dataset into a training set to train the model and a testing set to evaluate it. With the bootstrap approach, based on the ideas described in #sec-bootstrap, you can create multiple different training datasets via bootstrapping. This method is sometimes called bootstrap aggregating or bagging. We provide a step-by-step description. +Typically, cross-validation involves partitioning the original dataset into a training set to train the model and a testing set to evaluate it. With the bootstrap approach, based on the ideas described in @sec-bootstrap, you can create multiple different training datasets via bootstrapping. This method is sometimes called bootstrap aggregating or bagging. FIX [WHERE DO YOU PROVIDE THIS? BELOW? or "we will now"?] We provide a step-by-step description. -From the original dataset, we create a large number of bootstrap samples. Each bootstrap sample is created by randomly selecting observations with replacement, usually the same size as the original dataset. For each bootstrap sample, fit the model and compute the MSE estimate on the observations not selected in the random sampling, referred to as the _out-of-bag observations_. -These out-of-bag observations serve a similar role to a test set in standard cross-validation. +From the original dataset, we create a large number of bootstrap samples. Each bootstrap sample is created by randomly selecting observations with replacement, usually the same size as the original dataset. For each bootstrap sample, fit the model and compute the MSE estimate on the observations not selected in the random sampling, referred to as the _out-of-bag observations_. These out-of-bag observations serve a similar role to a test set in standard cross-validation. -We then average the MSEs obtaind in the out-of-bag observations from each bootstrap sample to estimate the model's performance. +We then average the MSEs obtained in the out-of-bag observations from each bootstrap sample to estimate the model's performance. This approach is actually the default approach in the **caret** package. We describe how to implement cross validation with the **caret** package in the next chapter. In the next section, we include an explanation of how the bootstrap works in general. ### Comparison of MSE estimates {#sec-mse-estimates} -In Section @sec-knn-cv-intro we computed an estimate of MSE based just on the provided test set (shown in red in the plot below). Here we show how the cross-validation techniques described above help reduce variability. The green curve below shows the results of apply K-fold cross validation with 10 folds, leaving out 10% of the data for validation. We can see that the variance is reduced substantially. The blue curve is the result of using 100 bootrap samples to estimate MSE. The variability is reduced even further, but at the cost of increasing computation time by 10 fold. +In @sec-knn-cv-intro, we computed an estimate of MSE based just on the provided test set (shown in red in the plot below). Here we show how the cross-validation techniques described above help reduce variability. The green curve below shows the results of apply K-fold cross validation with 10 folds, leaving out 10% of the data for validation. We can see that the variance is reduced substantially. The blue curve is the result of using 100 bootrap samples to estimate MSE. The variability is reduced even further, but at the cost of increasing computation time by 10 fold. ```{r} set.seed(2023-11-30) @@ -365,7 +363,7 @@ x_subset <- x[ ,sample(p, 100)] 1\. Because `x` and `y` are completely independent, you should not be able to predict `y` using `x` with accuracy larger than 0.5. Confirm this by running cross validation using logistic regression to fit the model. Because we have so many predictors, we selected a random sample `x_subset`. Use the subset when training the model. Hint: use the caret `train` function. The `results` component of the output of `train` shows you the accuracy. Ignore the warnings. -2\. Now, instead of a random selection of predictors, we are going to search for those that are most predictive of the outcome. We can do this by comparing the values for the $y = 1$ group to those in the $y = 0$ group, for each predictor, using a t-test. You can perform this step like this: +2\. Now instead of a random selection of predictors, we are going to search for those that are most predictive of the outcome. We can do this by comparing the values for the $y = 1$ group to those in the $y = 0$ group, for each predictor, using a t-test. You can perform this step as follows: ```{r, eval = FALSE} devtools::install_bioc("genefilter") @@ -402,5 +400,5 @@ indexes <- createResample(mnist_27$train$y, 10) How many times do `3`, `4`, and `7` appear in the first re-sampled index? -10\. We see that some numbers appear more than once and others appear no times. This has to be this way for each dataset to be independent. Repeat the exercise for all the re-sampled indexes. +10\. We see that some numbers appear more than once and others appear no times. This must be so for each dataset to be independent. Repeat the exercise for all the re-sampled indexes. diff --git a/ml/evaluation-metrics.qmd b/ml/evaluation-metrics.qmd index 5c0262f..ffb9848 100644 --- a/ml/evaluation-metrics.qmd +++ b/ml/evaluation-metrics.qmd @@ -2,7 +2,7 @@ Before we start describing approaches to optimize the way we build algorithms, we first need to define what we mean when we say one approach is better than another. In this section, we focus on describing ways in which machine learning algorithms are evaluated. Specifically, we need to quantify what we mean by "better". -For our first introduction to machine learning concepts, we will start with a boring and simple example: how to predict sex using height. As we explain machine learning step by step, this example will let us set down the first building block. Soon enough, we will be attacking more interesting challenges. We use the __caret__ package, which has several useful functions for building and assessing machine learning methods and we introduce in more detail in @sec-caret, and for a first example, we use the height data in dslabs. +For our first introduction to machine learning concepts, we will start with a boring and simple example: how to predict sex using height. As we explain machine learning step by step, this example will let us set down the first building block. Soon enough, we will be undertaking more interesting challenges. FIX We use the __caret__ package, which has several useful functions for building and assessing machine learning methods and we introduce in more detail in @sec-caret, and for a first example, we use the height data in dslabs. ```{r, message=FALSE, warning=FALSE, cache=FALSE} library(tidyverse) @@ -17,14 +17,13 @@ y <- heights$sex x <- heights$height ``` -In this case, we have only one predictor, height, and `y` is clearly a categorical outcome since observed values are either `Male` or `Female`. -We know that we will not be able to predict $Y$ very accurately based on $X$ because male and female average heights are not that different relative to within group variability. But can we do better than guessing? To answer this question, we need a quantitative definition of better. +In this case, we have only one predictor, height, and `y` is clearly a categorical outcome since observed values are either `Male` or `Female`. We know that we will not be able to predict $Y$ very accurately based on $X$ because male and female average heights are not that different relative to within group variability. But can we do better than guessing? To answer this question, we need a quantitative definition of better. ## Training and test sets {#sec-training-test} -Ultimately, a machine learning algorithm is evaluated on how it performs in the real world with completely new datasets. However, when developing an algorithm, we usually have a dataset for which we know the outcomes, as we do with the heights: we know the sex of every student in our dataset. Therefore, to mimic the ultimate evaluation process, we typically split the data into two parts and act as if we don't know the outcome for one of these. We stop pretending we don't know the outcome to evaluate the algorithm, but only *after* we are done constructing it. We refer to the group for which we know the outcome, and use to develop the algorithm, as the _training_ set. We refer to the group for which we pretend we don't know the outcome as the _test_ set. +Ultimately, a machine learning algorithm is evaluated on how it performs in the real world with completely new datasets. However, when developing an algorithm, we usually have a dataset for which we know the outcomes, as we do with the heights: we know the sex of every student in our dataset. Therefore, to mimic the ultimate evaluation process, we typically split the data into two parts and act as if we don't know the outcome for one of these. We stop pretending we don't know the outcome to evaluate the algorithm, but only *after* we are done constructing it. We refer to the group for which we know the outcome, and that we use to develop the algorithm, as the _training_ set. We refer to the group for which we pretend we don't know the outcome as the _test_ set. -A standard way of generating the training and test sets is by randomly splitting the data. The __caret__ package includes the function `createDataPartition` that helps us generates indexes for randomly splitting the data into training and test sets: +A standard way of generating the training and test sets is by randomly splitting the data. The __caret__ package includes the function `createDataPartition` that helps us generate indexes for randomly splitting the data into training and test sets: ```{r} @@ -33,7 +32,7 @@ test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE) ``` The argument `times` is used to define how many random samples of indexes to return, the argument `p` is used to define what proportion of the data is represented by the index, and the argument `list` is used to decide if we want the indexes returned as a list or not. -We can use the result of the `createDataPartition` function call to define the training and test sets like this: +We can use the result of the `createDataPartition` function call to define the training and test sets as follows: ```{r} test_set <- heights[test_index, ] @@ -75,7 +74,7 @@ Can we do better? Exploratory data analysis suggests we can because, on average, heights |> group_by(sex) |> summarize(mean(height), sd(height)) ``` -But how do we make use of this insight? Let's try another simple approach: predict `Male` if height is within two standard deviations from the average male: +But how do we make use of this insight? Let's try another simple approach: predict `Male` if height is within two standard deviations from the average male. ```{r} y_hat <- factor(ifelse(x > 62, "Male", "Female"), levels(test_set$sex)) @@ -136,7 +135,7 @@ We see that it is a bit lower than the accuracy observed for the training set, b The prediction rule we developed in the previous section predicts `Male` if the student is taller than `r best_cutoff` inches. Given that the average female is about `r best_cutoff` inches, this prediction rule seems wrong. What happened? If a student is the height of the average female, shouldn't we predict `Female`? -Generally speaking, overall accuracy can be a deceptive measure. To see this, we will start by constructing what is referred to as the _confusion matrix_, which basically tabulates each combination of prediction and actual value. We can do this in R simply using `table(predicted = y_hat, actual = test_set$sex)` +Generally speaking, overall accuracy can be a deceptive measure. To see this, we will start by constructing what is referred to as the _confusion matrix_, which basically tabulates each combination of prediction and actual value. We can do this in R simply using `table(predicted = y_hat, actual = test_set$sex)`, but the `confusionMatrix` **caret** package computes the confusion matrix and much more: @@ -152,9 +151,9 @@ If we study this table closely, it reveals a problem. If we compute the accuracy cm$byClass[c("Sensitivity", "Specificity")] ``` -In the next section we explain that these two are equivalnet to accuracy with femeles and males, respectively. +In the next section, we explain that these two are equivalent to accuracy with females and males, respectively. -We notice an imbalance: too many females are predicted to be male. We are calling almost half of the females male! How can our overall accuracy be so high then? This is because the _prevalence_ of males in this dataset is high. These heights were collected from three data sciences courses, two of which had more males enrolled: +We notice an imbalance: too many females are predicted to be male. We are calling almost half of the females male! How can our overall accuracy be so high then? This is because the _prevalence_ of males in this dataset is high. These heights were collected from three data sciences courses, two of which had higher male enrollment: ```{r} @@ -197,7 +196,7 @@ if (knitr::is_html_output()){ Sensitivity is typically quantified by $TP/(TP+FN)$, the proportion of actual positives (the first column = $TP+FN$) that are called positives ($TP$). This quantity is referred to as the _true positive rate_ (TPR) or _recall_. -Specificity is defined as $TN/(TN+FP)$ or the proportion of negatives (the second column = $FP+TN$) that are called negatives ($TN$). This quantity is also called the true negative rate (TNR). There is another way of quantifying specificity which is $TP/(TP+FP)$ or the proportion of outcomes called positives (the first row or $TP+FP$) that are actually positives ($TP$). This quantity is referred to as _positive predictive value (PPV)_ and also as _precision_. Note that, unlike TPR and TNR, precision depends on prevalence since higher prevalence implies you can get higher precision even when guessing. +Specificity is defined as $TN/(TN+FP)$ or the proportion of negatives (the second column = $FP+TN$) that are called negatives ($TN$). This quantity is also called the true negative rate (TNR). There is another way of quantifying specificity which is $TP/(TP+FP)$ or the proportion of outcomes called positives (the first row or $TP+FP$) that are actually positives ($TP$). This quantity is referred to as _positive predictive value (PPV)_ and also as _precision_. Note that, unlike TPR and TNR, precision depends on prevalence since higher prevalence implies you can get higher precision even when guessing. The multiple names can be confusing, so we include a table to help us remember the terms. The table includes a column that shows the definition if we think of the proportions as probabilities. @@ -208,7 +207,7 @@ sensitivity | TPR | Recall | $\frac{\mbox{TP}}{\mbox{TP} + \mbox{FN}}$ | $\mbox{ specificity | TNR | 1-FPR | $\frac{\mbox{TN}}{\mbox{TN}+\mbox{FP}}$ | $\mbox{Pr}(\hat{Y}=0 \mid Y=0)$ | specificity | PPV | Precision | $\frac{\mbox{TP}}{\mbox{TP}+\mbox{FP}}$ | $\mbox{Pr}(Y=1 \mid \hat{Y}=1)$| -The __caret__ function `confusionMatrix` computes all these metrics for us once we define which category is the "positive" (Y=1). The function expects factors as input, and the first level is considered the positive outcome or $Y=1$. In our example, `Female` is the first level because it comes before `Male` alphabetically. If you type this into R you will see several metrics including accuracy, sensitivity, specificity, and PPV. +The __caret__ function `confusionMatrix` computes all these metrics for us once we define which category is the "positive" (Y=1). The function expects factors as input, and the first level is considered the positive outcome or $Y=1$. In our example, `Female` is the first level because it comes before `Male` alphabetically. If you type this into R, you will see several metrics including accuracy, sensitivity, specificity, and PPV. You can acceess these directly, for example, like this: @@ -222,7 +221,7 @@ We can see that the high overall accuracy is possible despite relatively low sen ## Balanced accuracy and $F_1$ score -Although we usually recommend studying both specificity and sensitivity, very often it is useful to have a one-number summary, for example, for optimization purposes. One metric that is preferred over overall accuracy is the average of specificity and sensitivity, referred to as _balanced accuracy_. Because specificity and sensitivity are rates, it is more appropriate to compute the _harmonic_ average. In fact, the _$F_1$-score_, a widely used one-number summary, is the harmonic average of precision and recall: +Although we usually recommend studying both specificity and sensitivity, often it is useful to have a one-number summary, for example, for optimization purposes. One metric that is preferred over overall accuracy is the average of specificity and sensitivity, referred to as _balanced accuracy_. Because specificity and sensitivity are rates, it is more appropriate to compute the _harmonic_ average. In fact, the _$F_1$-score_, a widely used one-number summary, is the harmonic average of precision and recall: $$ \frac{1}{\frac{1}{2}\left(\frac{1}{\mbox{recall}} + @@ -238,7 +237,7 @@ $$ when defining $F_1$. -Remember that, depending on the context, some types of errors are more costly than others. For example, in the case of plane safety, it is much more important to maximize sensitivity over specificity: failing to predict a plane will malfunction before it crashes is a much more costly error than grounding a plane when, in fact, the plane is in perfect condition. In a capital murder criminal case, the opposite is true since a false positive can lead to executing an innocent person. The $F_1$-score can be adapted to weigh specificity and sensitivity differently. To do this, we define $\beta$ to represent how much more important sensitivity is compared to specificity and consider a weighted harmonic average: +Remember that, depending on the context, some types of errors are more costly than others. For instance, in the case of plane safety, it is much more important to maximize sensitivity over specificity: failing to predict a plane will malfunction before it crashes is a much more costly error than grounding a plane when, in fact, the plane is in perfect condition. In a capital murder criminal case, the opposite is true since a false positive can lead to executing an innocent person. The $F_1$-score can be adapted to weigh specificity and sensitivity differently. To do this, we define $\beta$ to represent how much more important sensitivity is compared to specificity and consider a weighted harmonic average: $$ \frac{1}{\frac{\beta^2}{1+\beta^2}\frac{1}{\mbox{recall}} + @@ -295,7 +294,7 @@ We now see that we do much better than guessing, that both sensitivity and speci A machine learning algorithm with very high TPR and TNR may not be useful in practice when prevalence is close to either 0 or 1. To see this, consider the case of a doctor that specializes in a rare disease and is interested in developing an algorithm for predicting who has the disease. -The doctor shares data with about 1/2 cases and 1/2 controls and some predictors. You then develop an algorithm with TPR=0.99 and TNR = 0.99. You are excited to explain to the doctor that this means that if a patient has the disease, the algorithm is very likely to predict correctly. The doctor is not impressed and explains that your TNR is way to low for this algorithm to be used in practice. This is because this is a rare disease with a prevelance in the general population of 0.5%. The doctor reminds you of Bayes formula: +The doctor shares data with about 1/2 cases and 1/2 controls and some predictors. You then develop an algorithm with TPR=0.99 and TNR = 0.99. You are excited to explain to the doctor that this means that if a patient has the disease, the algorithm is very likely to predict correctly. The doctor is not impressed and explains that your TNR is too low for this algorithm to be used in practice. This is because this is a rare disease with a prevelance in the general population of 0.5%. The doctor reminds you of Bayes formula: $$ \mbox{Pr}(Y = 1\mid \hat{Y}=1) = \mbox{Pr}(\hat{Y}=1 \mid Y=1) \frac{\mbox{Pr}(Y=1)}{\mbox{Pr}(\hat{Y}=1)} \implies \text{Precision} = \text{TPR} \times \frac{\text{Prevalence}}{\text{TPR}\times \text{Prevalence} + \text{FPR}\times(1-\text{Prevalence})} \approx 0.33 $$ @@ -310,11 +309,11 @@ data.frame(Prevalence = prevalence, ggtitle("Precision when TPR = 0.95 and TNR = 0.95 as a function of prevalence") ``` -Although your algorithm has a precision of about 95% on the data you train on, with prevalence of 50%, if applied to the general population, the algorithm's precision would be just 33%. The doctor can't use an algorithm with 33% of people receiving a positive test actually not having the disease. Note that if even if your algorithm had perfect sensitivity, the precision would still be around 33%. So you need to greatly decrease your FPR for the algorithm to be useful in practice. +Although your algorithm has a precision of about 95% on the data you train on, with prevalence of 50%, if applied to the general population, the algorithm's precision would be just 33%. The doctor can't use an algorithm with 33% of people receiving a positive test actually not having the disease. Note that even if your algorithm had perfect sensitivity, the precision would still be around 33%. So you need to greatly decrease your FPR for the algorithm to be useful in practice. ## ROC and precision-recall curves -When comparing the two methods (guessing versus using a height cutoff), we looked at accuracy and $F_1$. The second method clearly outperformed the first. However, while we considered several cutoffs for the second method, for the first we only considered one approach: guessing with equal probability. Note that guessing `Male` with higher probability would give us higher accuracy due to the bias in the sample: +When comparing the two methods (guessing versus using a height cutoff), we looked at accuracy and $F_1$. The second method clearly outperformed the first. However, while we considered several cutoffs for the second method, for the first we only considered one approach: guessing with equal probability. Be aware that guessing `Male` with higher probability would give us higher accuracy due to the bias in the sample: ```{r} @@ -329,8 +328,7 @@ But, as described above, this would come at the cost of lower sensitivity. The c Remember that for each of these parameters, we can get a different sensitivity and specificity. For this reason, a very common approach to evaluating methods is to compare them graphically by plotting both. -A widely used plot that does this is the _receiver operating characteristic_ (ROC) curve. If you are wondering where this name comes from, you can consult the -ROC Wikipedia page^[https://en.wikipedia.org/wiki/Receiver_operating_characteristic]. +A widely used plot that does this is the _receiver operating characteristic_ (ROC) curve. If you are wondering where this name comes from, you can consult the ROC Wikipedia page^[https://en.wikipedia.org/wiki/Receiver_operating_characteristic]. The ROC curve plots sensitivity, represented as the TPR, versus 1 - specificity represented as the false positive rate (FPR). Here we compute the TPR and FPR needed for different probabilities of guessing male: @@ -387,7 +385,7 @@ bind_rows(tmp_1, tmp_2) |> geom_text_repel(nudge_x = 0.01, nudge_y = -0.01, show.legend = FALSE) ``` -We can see that we obtain higher sensitivity with this approach for all values of specificity, which implies it is in fact a better method. Note that ROC curves for guessing always fall on the identity line. Also note that when making ROC curves, it is often nice to add the cutoff associated with each point. +We see that we obtain higher sensitivity with this approach for all values of specificity, which implies it is in fact a better method. Keep in mind that ROC curves for guessing always fall on the identity line. Also, note that when making ROC curves, it is often nice to add the cutoff associated with each point. The packages __pROC__ and __plotROC__ are useful for generating these plots. @@ -438,7 +436,7 @@ bind_rows(tmp_1, tmp_2) |> facet_wrap(~ Positive) ``` -From this plot we immediately see that the precision of guessing is not high. This is because the prevalence is low. We also see that if we change positives to mean male instead of female, the ROC curve remains the same, but the precision recall plot changes. +From this plot, we immediately see that the precision of guessing is not high. This is because the prevalence is low. FIX [SHOULD MALE AND FEMALE BE IN CAPS?] We also see that if we change positives to mean male instead of female, the ROC curve remains the same, but the precision recall plot changes. ## Mean Squared Error {#sec-mse} @@ -450,14 +448,14 @@ In this section, we describe how the general approach to defining "best" in mach The most commonly used loss function is the squared loss function. If $\hat{y}$ is our predictor and $y$ is the observed outcome, the squared loss function is simply: $(\hat{y} - y)^2$. -Because we often model $y$ as the outcome of a random process, theoretically, it does not make sense to compare algorithms based on $(\hat{y} - y)^2$ as the minimum can change from sample to sample. For this reason we minimize mean squared error (MSE): +Because we often model $y$ as the outcome of a random process, theoretically, it does not make sense to compare algorithms based on $(\hat{y} - y)^2$ as the minimum can change from sample to sample. For this reason, we minimize mean squared error (MSE): $$ \text{MSE} \equiv \mbox{E}\{(\hat{Y} - Y)^2 \} $$ -Note that if the outcomes are binary, the MSE is equivalent to one minus expected accuracy, since $(\hat{y} - y)^2$ is 0 if the prediction was correct and 1 otherwise. +Consider that if the outcomes are binary, the MSE is equivalent to one minus expected accuracy, since $(\hat{y} - y)^2$ is 0 if the prediction was correct and 1 otherwise. Different algorithms will result in different predictions $\hat{Y}$, and therefore different MSE. In general, our goal is to build an algorithm that minimizes the loss so it is as close to 0 as possible. @@ -474,14 +472,14 @@ In practice, we often report the root mean squared error (RMSE), which is simply ::: However, the estimate $\hat{\text{MSE}}$ is a random variable. In fact, $\text{MSE}$ and $\hat{\text{MSE}}$ are often referred to as the true error and apparent error, respectively. -Due to the complexity of some machine learning it is difficult to derive its statistical properties of how well the apparent error estimates the true error. In @sec-cross-validation we introduce cross-validation and an approach to estimating the MSE that get closer the true MSE. +Due to the complexity of some machine learning, it is difficult to derive its statistical properties of how well the apparent error estimates the true error. In @sec-cross-validation, we introduce cross-validation and FIX an approach to estimating the MSE that get closer the true MSE. -We end the chapter by pointing out that there are loss functions other than the squared loss. For example, the _Mean Absolute Error_ uses absolute values, $|\hat{Y}_i - Y_i|$ instead of squaring the errors +We end the chapter by pointing out that there are loss functions other than the squared loss. For example, the _Mean Absolute Error_ uses absolute values, $|\hat{Y}_i - Y_i|$ instead of squaring the errors $(\hat{Y}_i - Y_i)^2$. However, in this book we focus on minimizing square loss since it is the most widely used. ## Exercises -The `reported_height` and `height` datasets were collected from three classes taught in the Departments of Computer Science and Biostatistics, as well as remotely through the Extension School. The biostatistics class was taught in 2016 along with an online version offered by the Extension School. On 2016-01-25 at 8:15 AM, during one of the lectures, the instructors asked students to fill in the sex and height questionnaire that populated the `reported_height` dataset. The online students filled the survey during the next few days, after the lecture was posted online. We can use this insight to define a variable, call it `type`, to denote the type of student: `inclass` or `online`: +The `reported_height` and `height` datasets were collected from three classes taught in the Departments of Computer Science and Biostatistics, as well as remotely through the Extension School. The Biostatistics class was taught in 2016 along with an online version offered by the Extension School. On 2016-01-25 at 8:15 AM, during one of the lectures, the instructors asked students to fill in the sex and height questionnaire that populated the `reported_height` dataset. The online students filled the survey during the next few days, after the lecture was posted online. We can use this insight to define a variable, call it `type`, to denote the type of student: `inclass` or `online`: ```{r, eval=FALSE} library(lubridate) diff --git a/ml/notation-and-terminology.qmd b/ml/notation-and-terminology.qmd index ea53619..ad76a74 100644 --- a/ml/notation-and-terminology.qmd +++ b/ml/notation-and-terminology.qmd @@ -1,14 +1,13 @@ -# Notation and Terminology +# Notation and terminology -In Section \ref{@mnist}, we introduced the MNIST handwritten digits dataset. Here we describe how the task of automatically reading these digits can be framed as a machine learning challenge and, in doing so, we introduce machine learning mathematical notation and terminology used throughout this part of the book. +In Section \ref{@mnist}, we introduced the MNIST handwritten digits dataset. Here we describe how the task of automatically reading these digits can be framed as a machine learning challenge. In doing so, we introduce machine learning mathematical notation and terminology used throughout this part of the book. -Originally, sorting mail received in the post office involved humans reading zip codes, written on the envelops. Today, thanks to machine learning algorithms, a computer can read zip codes and then a robot sorts the letters. We will learn how to build algorithms that can read a digitized handwritten digit. +Originally, mail sorting in the post office involved humans reading zip codes written on the envelopes. Today, thanks to machine learning algorithms, a computer can read zip codes and then a robot sorts the letters. We will learn how to build algorithms that can read a digitized handwritten digit. ## Terminology -In machine learning, data comes in the form of the *outcome* we want to predict and the *features* that we will use to predict the outcome. -We build algorithms that take feature values as input and returns a prediction for the outcome when we don't know the outcome. The machine learning approach is to *train* an algorithm using a dataset for which we do know the outcome, and then apply this algorithm in the future to make a prediction when we don't know the outcome. +In machine learning, data comes in the form of the *outcome* we want to predict and the *features* that we will use to predict the outcome. We build algorithms that take feature values as input and returns a prediction for the outcome when we don't know the outcome. The machine learning approach is to *train* an algorithm using a dataset for which we do know the outcome, and then apply this algorithm in the future to make a prediction when we don't know the outcome. Prediction problems can be divided into categorical and continuous outcomes. For categorical outcomes, $Y$ can be any one of $K$ classes. The number of classes can vary greatly across applications. For example, in the digit reader data, $K=10$ with the classes being the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. In speech recognition, the outcomes are all possible words or phrases we are trying to detect. Spam detection has two outcomes: spam or not spam. In this book, we denote the $K$ categories with indexes $k=1,\dots,K$. However, for binary data we will use $k=0,1$ for mathematical conveniences that we demonstrate later. @@ -17,9 +16,9 @@ Prediction problems can be divided into categorical and continuous outcomes. For Here we will use $Y$ to denote the outcome and $X_1, \dots, X_p$ to denote features. Note that features are sometimes referred to as predictors or covariates. We consider all these to be synonyms. -The first step in building an algorithm is to understand what are the outcomes and features. In Section @sec-mnist we showed that associated with each digitized image $i$, there is a categorical outcome $Y_i$ and features $X_{i,1}, \dots, X_{i,p}$, with $p=784$. We use bold face $\mathbf{X}_i = (X_{i,1}, \dots, X_{i,p})$ to denote the vector of predictors. Note that we are using the matrix notation described in @sec-matrix-notation. When referring to an arbitrary set of features rather than a specific image, we drop the index $i$ and use $Y$ and $\mathbf{X} = (X_{1}, \dots, X_{p})$. We use upper case variables because, in general, we think of the outcome and predictors as random variables. We use lower case, for example $\mathbf{X} = \mathbf{x}$, to denote observed values. Although, when we code, we stick to lower case. +The first step in building an algorithm is to understand what are the outcomes and features. In Section @sec-mnist, we showed that associated with each digitized image $i$, there is a categorical outcome $Y_i$ and features $X_{i,1}, \dots, X_{i,p}$, with $p=784$. We use bold face $\mathbf{X}_i = (X_{i,1}, \dots, X_{i,p})$ to denote the vector of predictors. Notice that we are using the matrix notation described in @sec-matrix-notation. When referring to an arbitrary set of features rather than a specific image, we drop the index $i$ and use $Y$ and $\mathbf{X} = (X_{1}, \dots, X_{p})$. We use upper case variables because, in general, we think of the outcome and predictors as random variables. We use lower case, for example $\mathbf{X} = \mathbf{x}$, to denote observed values. Although, when we code, we adhere to lower case. -The machine learning task is to build an algorithm that returns a prediction for any of the possible values of the features. Here, we will learn several approaches to building these algorithms. Although at this point it might seem impossible to achieve this, we will start with simple examples and build up our knowledge until we can attack more complex ones. In fact, we start with an artificially simple example with just one predictor and then move on to a slightly more realistic example with two predictors. Once we understand these, we will attack real-world machine learning challenges involving many predictors. +The machine learning task is to build an algorithm that returns a prediction for any of the possible values of the features. Here, we will learn several approaches to building these algorithms. Although at this point it might seem impossible to achieve this, we will start with basic examples and build up our knowledge until we can tackle more complex ones. In fact, we start with an artificially simple example with just one predictor and then move on to a slightly more realistic example with two predictors. Once we understand these, we will address real-world machine learning challenges involving many predictors. ## The machine learning challenge @@ -65,8 +64,8 @@ if(knitr::is_html_output()){ } ``` -When the output is continuous we refer to the machine learning task as *prediction*, and the main output of the model is a function $f$ that automatically produces a prediction, denoted with $\hat{y}$, for any set of predictors: $\hat{y} = f(x_1, x_2, \dots, x_p)$. We use the term *actual outcome* to denote what we ended up observing. So we want the prediction $\hat{y}$ to match the actual outcome $y$ as well as possible. Because our outcome is continuous, our predictions $\hat{y}$ will not be either exactly right or wrong, but instead we will determine an *error* defined as the difference between the prediction and the actual outcome $y - \hat{y}$. +When the output is continuous, we refer to the machine learning task as *prediction*, and the main output of the model is a function $f$ that automatically produces a prediction, denoted with $\hat{y}$, for any set of predictors: $\hat{y} = f(x_1, x_2, \dots, x_p)$. We use the term *actual outcome* to denote what we end up observing. So we want the prediction $\hat{y}$ to match the actual outcome $y$ as best as possible. Because our outcome is continuous, our predictions $\hat{y}$ will not be either exactly right or wrong, but instead we will determine an *error* defined as the difference between the prediction and the actual outcome $y - \hat{y}$. -When the outcome is categorical, we refer to the machine learning task as *classification*, and the main output of the model will be a *decision rule* which prescribes which of the $K$ classes we should predict. In this scenario, most models provide functions of the predictors for each class $k$, $f_k(x_1, x_2, \dots, x_p)$, that are used to make this decision. When the data is binary a typical decision rules looks like this: if $f_1(x_1, x_2, \dots, x_p) > C$, predict category 1, if not the other category, with $C$ a predetermined cutoff. Because the outcomes are categorical, our predictions will be either right or wrong. +When the outcome is categorical, we refer to the machine learning task as *classification*, and the main output of the model will be a *decision rule* which prescribes which of the $K$ classes we should predict. In this scenario, most models provide functions of the predictors for each class $k$, $f_k(x_1, x_2, \dots, x_p)$, that are used to make this decision. When the data is binary, a typical decision rules looks like this: if $f_1(x_1, x_2, \dots, x_p) > C$, predict category 1, if not the other category, with $C$ a predetermined cutoff. Because the outcomes are categorical, our predictions will be either right or wrong. -Notice that these terms vary among courses, text books, and other publications. Often *prediction* is used for both categorical and continuous outcomes, and the term *regression* can be used for the continuous case. Here we avoid using *regression* to avoid confusion with our previous use of the term *linear regression*. In most cases it will be clear if our outcomes are categorical or continuous, so we will avoid using these terms when possible. +Notice that these terms vary among courses, textbooks, and other publications. Often *prediction* is used for both categorical and continuous outcomes, and the term *regression* can be used for the continuous case. Here we avoid using *regression* to avoid confusion with our previous use of the term *linear regression*. In most cases, it will be clear if our outcomes are categorical or continuous, so we will avoid using these terms when possible. diff --git a/ml/smoothing.qmd b/ml/smoothing.qmd index 6ab4943..c7a264f 100644 --- a/ml/smoothing.qmd +++ b/ml/smoothing.qmd @@ -22,9 +22,9 @@ Part of what we explain in this section are the assumptions that permit us to ex ## Example: Is it a 2 or a 7? {#sec-two-or-seven} -To motivate the need for smoothing and make the connection with machine learning, we will construct a simplified version of the MNIST dataset with just two classes for the outcome and two predictors. Specifically, we define the challenge as building an algorithm that can determine if a digit is a 2 or 7 from the proportion of dark pixels in the upper left quadrant ($X_1$) and the lower right quadrant ($X_2$). We also selected a random sample of 1,000 digits, 500 in the training set and 500 in the test set. We provide this dataset in the `dslabs` package: +To motivate the need for smoothing and make the connection with machine learning, we will construct a simplified version of the MNIST dataset with just two classes for the outcome and two predictors. Specifically, we define the challenge as building an algorithm that can determine if a digit is a 2 or 7 from the proportion of dark pixels in the upper left quadrant ($X_1$) and the lower right quadrant ($X_2$). We also selected a random sample of 1,000 digits, 500 in the training set and 500 in the test set. FIX We provide this dataset in the `dslabs` package: -So for the training data we have $n=500$ observed outcomes $y_1,\dots,y_n$, with $Y$ defined as $1$ if the digit is 7 and 0 if it's 2, and $n=500$ features $\matbh{x}_1, \dots, \matbh{x}_n$, with each feature a two-dimensional point $\mathbf{x}_i = (x_{i,1}, x_{i,2})^\top$. Here is a plot of the $x_2$s versus the $x_1$s with color determinining if $y$ is 1 (blue) or 0 (red). +So for the training data, we have $n=500$ observed outcomes $y_1,\dots,y_n$, with $Y$ defined as $1$ if the digit is 7 and 0 if it's 2, and $n=500$ features $\matbh{x}_1, \dots, \matbh{x}_n$, with each feature a two-dimensional point $\mathbf{x}_i = (x_{i,1}, x_{i,2})^\top$. Here is a plot of the $x_2$s versus the $x_1$s with color determinining if $y$ is 1 (blue) or 0 (red): ```{r two-or-seven-scatter, warning=FALSE, message=FALSE, cache=FALSE} library(caret) @@ -85,7 +85,7 @@ p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid \mathbf{X}=\mathbf{x}) = \mbox{Pr}(Y=1 \mid X $$ We fit can fit this model using least squares and obtain an estimate $\hat{p}(\mathbf{x})$ by using the least square estimates $\hat{\beta}_0$, $\hat{\beta}_1$ and $\hat{\beta}_2$. -We define a decision rule by predictin $\hat{y}=1$ if $\hat{p}(\mathbf{x})>0.5$ and 0 otherwise. +We define a decision rule by predicting $\hat{y}=1$ if $\hat{p}(\mathbf{x})>0.5$ and 0 otherwise. ```{r, echo=FALSE} @@ -116,7 +116,7 @@ $$ x_2 = (0.5-\hat{\beta}_0)/\hat{\beta}_2 -\hat{\beta}_1/\hat{\beta}_2 x_1 $$ -This implies that for the boundary $x_2$ is a linear function of $x_1$. This implies that our regression approach has no chance of capturing the non-linear nature of the true $p(\mathbf{x})$. Below is a visual representation of $\hat{p}(\mathbf{x})$ which clearly shows how it fails to capture the shape of $p(\mathbf{x})$. +This implies that for the boundary, $x_2$ is a linear function of $x_1$, which suggests that our regression approach has no chance of capturing the non-linear nature of the true $p(\mathbf{x})$. Below is a visual representation of $\hat{p}(\mathbf{x})$ which clearly shows how it fails to capture the shape of $p(\mathbf{x})$: ```{r regression-p-hat, echo=FALSE, out.width="100%", fig.height=3, fig.width=7} p_hat <- predict(fit, newdata = mnist_27$true_p) @@ -140,7 +140,7 @@ We need something more flexible: a method that permits estimates with shapes oth ## Signal plus noise model -To explain these concepts, we will focus first on a problem with just one predictor. Specifically, we try to estimate the time trend in the 2008 US popular vote poll margin (difference between Obama and McCain). Later we will learn about methods such as k-nearest neighbors that can be used to smooth with higher dimensions. +To explain these concepts, we will focus first on a problem with just one predictor. Specifically, we try to estimate the time trend in the 2008 US popular vote poll margin (the difference between Obama and McCain). Later we will learn about methods, such as k-nearest neighbors, that can be used to smooth with higher dimensions. ```{r polls-2008-data, warning=FALSE, message=FALSE, cache=FALSE} @@ -305,7 +305,7 @@ There are several functions in R that implement bin smoothers. One example is `k ## Local weighted regression (loess) -A limitation of the bin smoother approach just described is that we need small windows for the approximately constant assumptions to hold. As a result, we end up with a small number of data points to average and obtain imprecise estimates $\hat{f}(x)$. Here we describe how *local weighted regression* (loess) permits us to consider larger window sizes. To do this, we will use a mathematical result, referred to as Taylor's theorem, which tells us that if you look closely enough at any smooth function $f(x)$, it will look like a line. To see why this makes sense, consider the curved edges gardeners make using straight-edged spades: +A limitation of the bin smoother approach just described is that we need small windows for the approximately constant assumptions to hold. As a result, we end up with a small number of data points to average and obtain imprecise estimates $\hat{f}(x)$. Here we describe how *local weighted regression* (loess) permits us to consider larger window sizes. To do this, we will use a mathematical result, referred to as Taylor's theorem, which tells us that if you look closely enough at any smooth function $f(x)$, it will look like a line. To see why this makes sense, consider the curved edges gardeners make using straight-edged spades: ![](img/garden.png) @@ -345,7 +345,7 @@ tmp |> facet_wrap(~center) ``` -The fitted value at $x_0$ becomes our estimate $\hat{f}(x_0)$. Below we show the procedure happening as we move from the -155 up to 0. +The fitted value at $x_0$ becomes our estimate $\hat{f}(x_0)$. Below we show the procedure happening as we move from the -155 up to 0: ```{r loess-animation, echo=FALSE, warning=FALSE} library(broom) @@ -545,11 +545,11 @@ $$ p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2). $$ -with $x_1$ and $x_2$ the two predictors defined in Section @sec-two-or-seven. In this example, the 0s and 1s we observe are "noisy" because for some regions the probabilities $p(\mathbf{x})$ are not that close to 0 or 1. So we need to estimate $p(\mathbf{x})$. Smoothing is an alternative to accomplishing this. In @sec-two-or-seven we saw that linear regression was not flexible enough to capture the non-linear nature of $p(\mathbf{x})$, thus smoothing approaches provide an improvement. In @sec-knn-cv-intro we describe a popular machine learning algorithm, k-nearest neighbors, which is based on the concept of smoothing. +with $x_1$ and $x_2$ the two predictors defined in FIX Section @sec-two-or-seven. In this example, the 0s and 1s we observe are "noisy" because for some regions the probabilities $p(\mathbf{x})$ are not that close to 0 or 1. We therefore need to estimate $p(\mathbf{x})$. Smoothing is an alternative to accomplishing this. In @sec-two-or-seven, we saw that linear regression was not flexible enough to capture the non-linear nature of $p(\mathbf{x})$, thus smoothing approaches provide an improvement. In @sec-knn-cv-intro, we describe a popular machine learning algorithm, k-nearest neighbors, which is based on the concept of smoothing. ## Exercises -1\. In the wrangling part of this book, we used the code below to obtain mortality counts for Puerto Rico for 2015-2018. +1\. FIX In the wrangling part of this book, we used the code below to obtain mortality counts for Puerto Rico for 2015-2018. ```{r, eval=FALSE} library(dslabs)