From 46acc8ec6025a13f9cb13bf891a4b0ef3e0adb56 Mon Sep 17 00:00:00 2001 From: Alexandra Nones Date: Fri, 22 Dec 2023 11:40:24 -0500 Subject: [PATCH] typos --- ml/algorithms.qmd | 85 +++++++++++++++++++++---------------------- ml/clustering.qmd | 26 ++++++------- ml/ml-in-practice.qmd | 65 ++++++++++++++++----------------- 3 files changed, 87 insertions(+), 89 deletions(-) diff --git a/ml/algorithms.qmd b/ml/algorithms.qmd index 32d5988..52c2c04 100644 --- a/ml/algorithms.qmd +++ b/ml/algorithms.qmd @@ -8,7 +8,7 @@ library(caret) library(dslabs) ``` -Then, in @sec-ml-in-practice, we show an efficient way to implement these ideas using the **caret** package. +Later, in @sec-ml-in-practice, we show an efficient way to implement these ideas using the **caret** package. ## Logistic regression @@ -38,7 +38,7 @@ p_hat_glm <- predict(fit_glm, mnist_27$test, type = "response") y_hat_glm <- factor(ifelse(p_hat_glm > 0.5, 7, 2)) ``` -We can then find the the _maximum likelihood estimates_ (MLE) of the model parameters and predict using the estimate $p(\mathbf{x})$ to obtain an accuracy of `r confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"]`. We see that logistic regression performs similarly to regression. This is not surprising, given that the estimate of $\hat{p}(\mathbf{x})$ looks similar as well: +We can then find the _maximum likelihood estimates_ (MLE) of the model parameters and predict using the estimate $p(\mathbf{x})$ to obtain an accuracy of `r confusionMatrix(y_hat_glm, mnist_27$test$y)$overall["Accuracy"]`. We see that logistic regression performs similarly to regression. This is not surprising given that the estimate of $\hat{p}(\mathbf{x})$ looks similar as well: ```{r logistic-p-hat, echo = FALSE} @@ -50,7 +50,7 @@ mnist_27$true_p |> mutate(p_hat = p_hat) |> stat_contour(breaks = c(0.5), color = "black") ``` -Just like regression, the decision rule is a line, a fact that can be corroborated mathematically, defining $g(x) = \log \{x/(1-x)\}$, we have: +Just like regression, the decision rule is a line, a fact that can be corroborated mathematically. Defining $g(x) = \log \{x/(1-x)\}$, we have: $$ g^{-1}(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2) = 0.5 \implies @@ -58,7 +58,7 @@ g^{-1}(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2) = 0.5 \implies x_2 = -\hat{\beta}_0/\hat{\beta}_2 -\hat{\beta}_1/\hat{\beta}_2 x_1 $$ -Thus, just like with regression, $x_2$ is a linear function of $x_1$. This implies that our logistic regression approach has no chance of capturing the non-linear nature of the true $p(\mathbf{x})$. FIX We now described some techniques that estimate the conditional probability in a way that is more flexible. +Thus, much like with regression, $x_2$ is a linear function of $x_1$. This implies that our logistic regression approach has no chance of capturing the non-linear nature of the true $p(\mathbf{x})$. FIX We now described some techniques that estimate the conditional probability in a way that is more flexible. :::{.callout-note} You are ready to do exercises 1 - 11. @@ -109,7 +109,7 @@ You are ready to do exercises 12 - 13. ## Generative models -We have described how, when using squared loss, the conditional expectation provide the best approach to developing a decision rule. In a binary case, the smallest true error we can achieve is determined by Bayes' rule, which is a decision rule based on the true conditional probability: +We have described how, when using squared loss, the conditional expectation provides the best approach to developing a decision rule. In a binary case, the smallest true error we can achieve is determined by Bayes' rule, which is a decision rule based on the true conditional probability: $$ p(\mathbf{x}) = \mbox{Pr}(Y = 1 \mid \mathbf{X}=\mathbf{x}) @@ -121,7 +121,7 @@ However, Bayes' theorem tells us that knowing the distribution of the predictors ### Naive Bayes -Recall that Bayes rule tells us that we can rewrite $p(\mathbf{x})$ like this: +Recall that Bayes rule tells us that we can rewrite $p(\mathbf{x})$ as follows: $$ p(\mathbf{x}) = \mbox{Pr}(Y = 1|\mathbf{X}=\mathbf{x}) = \frac{f_{\mathbf{X}|Y = 1}(\mathbf{x}) \mbox{Pr}(Y = 1)} @@ -238,7 +238,7 @@ abline(v = 67, lty = 2) Quadratic Discriminant Analysis (QDA) is a version of Naive Bayes in which we assume that the distributions $p_{\mathbf{X}|Y = 1}(x)$ and $p_{\mathbf{X}|Y = 0}(\mathbf{x})$ are multivariate normal. The simple example we described in the previous section is actually QDA. Let's now look at a slightly more complicated case: the 2 or 7 example. -In this case, we have two predictors so we assume each one is bivariate normal. This implies that we need to estimate two averages, two standard deviations, and a correlation for each case $Y = 1$ and $Y = 0$. Once we have these, we can approximate the distributions $f_{X_1,X_2|Y = 1}$ and $f_{X_1, X_2|Y = 0}$. We can easily estimate parameters from the data: +In this example, we have two predictors so we assume each one is bivariate normal. This implies that we need to estimate two averages, two standard deviations, and a correlation for each case $Y = 1$ and $Y = 0$. Once we have these, we can approximate the distributions $f_{X_1,X_2|Y = 1}$ and $f_{X_1, X_2|Y = 0}$. We can easily estimate parameters from the data: ```{r} params <- mnist_27$train |> @@ -249,14 +249,14 @@ params <- mnist_27$train |> params ``` -With these estimates in place, all we need are the prevalence $\pi$ to to compute +With these estimates in place, all we need are the prevalence $\pi$ to compute: $$ \hat{p}(\mathbf{x})= \frac{\hat{f}_{\mathbf{X}|Y = 1}(\mathbf{x}) \hat{\pi}} { \hat{f}_{\mathbf{X}|Y = 0}(x)(1-\hat{\pi}) + \hat{f}_{\mathbf{X}|Y = 1}(\mathbf{x})\hat{\pi} } $$ -Note that the densities $f$ are bivariate normal distributions. Here we provide a visual way of showing the approach. We plot the data and use contour plots to give an idea of what the two estimated normal densities look like (we show the curve representing a region that includes 95% of the points): +Note that the densities $f$ are bivariate normal distributions. Here we provide a visual way of showing the approach. We plot the data and use contour plots to give an idea of what the two estimated normal densities look like (we show the curve representing a region that includes 95% of the points): ```{r qda-explained, echo=FALSE} mnist_27$train |> mutate(y = factor(y)) |> @@ -272,7 +272,7 @@ train_qda <- MASS::qda(y ~ ., data = mnist_27$train) y_hat <- predict(train_qda, mnist_27$test)$class ``` -We see that we obtain relatively good accuracy +We see that we obtain relatively good accuracy: ```{r} confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"] @@ -291,7 +291,7 @@ p2 <- plot_cond_prob(predict(train_qda, newdata = mnist_27$true_p, type = "prob" grid.arrange(p2, p1, nrow = 1) ``` -One reason QDA does not work as well as the kernel methods is because the assumption of normality do not quite hold. Although for the 2s it seems reasonable, for the 7s it does seem to be off. Notice the slight curvature in the points for the 7s: +One reason QDA does not work as well as the kernel methods is because the assumption of normality does not quite hold. Although for the 2s it seems reasonable, for the 7s it does seem to be off. Notice the slight curvature in the points for the 7s: ```{r qda-does-not-fit, out.width = "100%"} mnist_27$train |> mutate(y = factor(y)) |> @@ -306,7 +306,7 @@ QDA can work well here, but it becomes harder to use as the number of predictors ### Linear discriminant analysis -A relatively simple solution to QDA's problem of having too many parameters is to assume that the correlation structure is the same for all classes, which reduces the number of parameters we need to estimate. In this case the the distributions looks like this: +A relatively simple solution to QDA's problem of having too many parameters is to assume that the correlation structure is the same for all classes, which reduces the number of parameters we need to estimate. In this case, the the distributions looks like this: ```{r lda-explained, echo = FALSE} params <- mnist_27$train |> @@ -337,7 +337,7 @@ train_lda <- MASS::lda(y ~ ., data = mnist_27$train) y_hat <- predict(train_lda, mnist_27$test)$class ``` -Now the size of the ellipses as well as the angles are the same. This is because they are assumed to have the same standard deviations and correlations. Although this added constrain lowers the number of parameters, the rigidity lowers our accuracy to: +Now the size of the ellipses as well as the angles are the same. This is because they are assumed to have the same standard deviations and correlations. Although this added constraint lowers the number of parameters, the rigidity lowers our accuracy to: ```{r} confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"] @@ -388,9 +388,9 @@ img_path <- "img" ### The curse of dimensionality -We described how methods such as LDA and QDA are not meant to be used with many predictors $p$ because the number of parameters that we need to estimate becomes too large. For example, with the digits example $p = 784$, we would have over 600,000 parameters with LDA, and we would multiply that by the number of classes for QDA. Kernel methods such as kNN or local regression do not have model parameters to estimate. However, they also face a challenge when multiple predictors are used due to what is referred to as the *curse of dimensionality*. The *dimension* here refers to the fact that when we have $p$ predictors, the distance between two observations is computed in $p$-dimensional space. +We described how methods such as LDA and QDA are not meant to be used with many predictors $p$ because the number of parameters that we need to estimate becomes too large. For example, with the digits example $p = 784$, we would have over 600,000 parameters with LDA, and we would multiply that by the number of classes for QDA. Kernel methods, such as kNN or local regression, do not have model parameters to estimate. However, they also face a challenge when multiple predictors are used due to what is referred to as the *curse of dimensionality*. The *dimension* here refers to the fact that when we have $p$ predictors, the distance between two observations is computed in $p$-dimensional space. -A useful way of understanding the curse of dimensionality is by considering how large we have to make a span/neighborhood/window to include a given percentage of the data. Remember that with larger neighborhoods, our methods lose flexibility, and to be flexible we need to keep the neighborhoods small. +A useful way of understanding the curse of dimensionality is by considering how large we have to make a span/neighborhood/window to include a given percentage of the data. Remember that with larger neighborhoods, our methods lose flexibility, and to be flexible we need to keep the neighborhoods small. To see how this becomes an issue for higher dimensions, suppose we have one continuous predictor with equally spaced points in the [0,1] interval and we want to create windows that include 1/10th of data. Then it's easy to see that our windows have to be of size 0.1: @@ -405,7 +405,7 @@ points(x[25],y[25],col = "blue", cex = 0.5, pch = 4) text(x[c(15,35)], y[c(15,35)], c("[","]")) ``` -Now, for two predictors, if we decide to keep the neighborhood just as small, 10% for each dimension, we include only 1 point. If we want to include 10% of the data, then we need to increase the size of each side of the square to $\sqrt{.10} \approx .316$: +Now for two predictors, if we decide to keep the neighborhood just as small, 10% for each dimension, we include only 1 point. If we want to include 10% of the data, then we need to increase the size of each side of the square to $\sqrt{.10} \approx .316$: ```{r curse-of-dim-2, echo = FALSE, fig.width = 7, fig.height = 3.5, out.width = "50%"} rafalib::mypar(1,2) @@ -465,7 +465,7 @@ fit <- train(region ~ ., method = "knn", data = olive) ``` -Using kNN we can achieve a test set accuracy of `r fit$results[1,2]`. However, a bit of data exploration reveals that we should be able to do even better. For example, if we look at the distribution of each predictor stratified by region we see that eicosenoic is only present in Southern Italy and that linoleic separates Northern Italy from Sardinia. +Using kNN, we can achieve a test set accuracy of `r fit$results[1,2]`. However, a bit of data exploration reveals that we should be able to do even better. For example, if we look at the distribution of each predictor stratified by region we see that eicosenoic is only present in Southern Italy and that linoleic separates Northern Italy from Sardinia. ```{r olive-eda, fig.height = 3, fig.width = 6, echo = FALSE} olive |> gather(fatty_acid, percentage, -region) |> @@ -486,9 +486,9 @@ olive |> color = "black", lty = 2) ``` -In Section @sec-predictor-space we defined _predictor spaces_. The predictor space here consists of eight-dimensional points with values between 0 and 100. In the plot above, we show the space defined by the two predictors eicosenoic and linoleic, and, by eye, we can construct a prediction rule that partitions the predictor space so that each partition contains only outcomes of a one category. +In @sec-predictor-space, we defined _predictor spaces_, which in this case consists of eight-dimensional points with values between 0 and 100. In the plot above, we show the space defined by the two predictors eicosenoic and linoleic, and, by eye, we can construct a prediction rule that partitions the predictor space so that each partition contains only outcomes of a one category. -This in turn can be used to define an algorithm with perfect accuracy. Specifically, we define the following decision rule: if eicosenoic is larger than 0.065, predict Southern Italy. If not, then if linoleic is larger than $10.535$, predict Sardinia, and if lower, predict Northern Italy. We can draw this decision tree like this: +This in turn can be used to define an algorithm with perfect accuracy. Specifically, we define the following decision rule: if eicosenoic is larger than 0.065, predict Southern Italy. If not, then if linoleic is larger than $10.535$, predict Sardinia, and if lower, predict Northern Italy. We can draw this decision tree as follows: ```{r olive-tree, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 4.5, out.width = "50%"} library(rpart) @@ -521,7 +521,7 @@ The general idea here is to build a decision tree and, at the end of each *node* But how do we decide on the partition $R_1, R_2, \ldots, R_J$ and how do we choose $J$? Here is where the algorithm gets a bit complicated. -Regression trees create partitions recursively. We start the algorithm with one partition, the entire predictor space. In our simple first example, this space is the interval \[-155, 1\]. But after the first step we will have two partitions. After the second step we will split one of these partitions into two and will have three partitions, then four, and so on. We describe how we pick the partition to further partition, and when to stop, later. +Regression trees create partitions recursively. We start the algorithm with one partition, the entire predictor space. In our simple first example, this space is the interval \[-155, 1\]. But after the first step, we will have two partitions. After the second step, we will split one of these partitions into two and will have three partitions, then four, and so on. Later we describe how we pick the partition to further partition, and when to stop. For each existing partition, we find a predictor $j$ and value $s$ that define two new partitions, which we will call $R_1(j,s)$ and $R_2(j,s)$, that split our observations in the current partition by asking if $x_j$ is bigger than $s$: @@ -529,7 +529,7 @@ $$ R_1(j,s) = \{\mathbf{x} \mid x_j < s\} \mbox{ and } R_2(j,s) = \{\mathbf{x} \mid x_j \geq s\} $$ -In our current example we only have one predictor, so we will always choose $j = 1$, but in general this will not be the case. Now, after we define the new partitions $R_1$ and $R_2$, and we decide to stop the partitioning, we compute predictors by taking the average of all the observations $y$ for which the associated $\mathbf{x}$ is in $R_1$ and $R_2$. We refer to these two as $\hat{y}_{R_1}$ and $\hat{y}_{R_2}$ respectively. +In our current example, we only have one predictor, so we will always choose $j = 1$, but in general this will not be the case. Now after we define the new partitions $R_1$ and $R_2$, and we decide to stop the partitioning, we compute predictors by taking the average of all the observations $y$ for which the associated $\mathbf{x}$ is in $R_1$ and $R_2$. We refer to these two as $\hat{y}_{R_1}$ and $\hat{y}_{R_2}$ respectively. But how do we pick $j$ and $s$? Basically we find the pair that minimizes the residual sum of squares (RSS): @@ -547,7 +547,7 @@ library(rpart) fit <- rpart(margin ~ ., data = polls_2008) ``` -Here, there is only one predictor. Thus we do not have to decide which predictor $j$ to split by, we simply have to decide what value $s$ we use to split. We can visually see where the splits were made: +In this case, there is only one predictor. Thus we do not have to decide which predictor $j$ to split by, we simply have to decide what value $s$ we use to split. We can visually see where the splits were made: ```{r, eval = FALSE} plot(fit, margin = 0.1) @@ -570,7 +570,7 @@ polls_2008 |> geom_step(aes(day, y_hat), col = "red") ``` -Note that the algorithm stopped partitioning at 8. The decision is made based on a measure referred to as *complexity parameter* (cp). Every time we split and define two new partitions, our training set RSS decreases. This is because with more partitions, our model has more flexibility to adapt to the training data. In fact, if you split until every point is its own partition, then RSS goes all the way down to 0 since the average of one value is that same value. To avoid this, the algorithm sets a minimum for how much the RSS must improve for another partition to be added. This parameter is referred to as the *complexity parameter* (cp). The RSS must improve by a factor of cp for the new partition to be added. Large values of cp will therefore force the algorithm to stop earlier which results in fewer nodes. +Note that the algorithm stopped partitioning at 8. The decision is made based on a measure referred to as *complexity parameter* (cp). Every time we split and define two new partitions, our training set RSS decreases. This is because with more partitions, our model has more flexibility to adapt to the training data. In fact, if you split until every point is its own partition, then RSS goes all the way down to 0 since the average of one value is that same value. To avoid this, the algorithm sets a minimum for how much the RSS must improve for another partition to be added. This parameter is referred to as the *complexity parameter* (cp). The RSS must improve by a factor of cp for the new partition to be added. Large values of cp will therefore force the algorithm to stop earlier, which results in fewer nodes. However, cp is not the only parameter used to decide if we should partition a current partition or not. Another common parameter is the minimum number of observations required in a partition before partitioning it further. The argument used in the `rpart` function is `minsplit` and the default is 20. The `rpart` implementation of regression trees also permits users to determine a minimum number of observations in each node. The argument is `minbucket` and defaults to `round(minsplit/3)`. @@ -623,15 +623,15 @@ The first difference is that we form predictions by calculating which class is t The second is that we can no longer use RSS to choose the partition. While we could use the naive approach of looking for partitions that minimize training error, better performing approaches use more sophisticated metrics. Two of the more popular ones are the *Gini Index* and *Entropy*. -In a perfect scenario, the outcomes in each of our partitions are all of the same category since this will permit perfect accuracy. The *Gini Index* is going to be 0 in this scenario, and become larger the more we deviate from this scenario. To define the Gini Index, we define $\hat{p}_{j,k}$ as the proportion of observations in partition $j$ that are of class $k$. The Gini Index is defined as +In a perfect scenario, the outcomes in each of our partitions are all of the same category since this will permit perfect accuracy. The *Gini Index* is going to be 0 in this scenario, and become larger the more we deviate from this scenario. To define the Gini Index, we define $\hat{p}_{j,k}$ as the proportion of observations in partition $j$ that are of class $k$. The Gini Index is defined as: $$ \mbox{Gini}(j) = \sum_{k = 1}^K \hat{p}_{j,k}(1-\hat{p}_{j,k}) $$ -If you study the formula carefully you will see that it is in fact 0 in the perfect scenario described above. +If you study the formula carefully, you will see that it is in fact 0 in the perfect scenario described above. -*Entropy* is a very similar quantity, defined as +*Entropy* is a very similar quantity, defined as: $$ \mbox{entropy}(j) = -\sum_{k = 1}^K \hat{p}_{j,k}\log(\hat{p}_{j,k}), \mbox{ with } 0 \times \log(0) \mbox{ defined as }0 @@ -646,7 +646,7 @@ train_rpart <- train(y ~ ., y_hat <- predict(train_rpart, mnist_27$test) ``` -If we use classification tree performs on the 2 or 7 example we achieve an accuracy of `r confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"]` which is better than regression, but is not as good as what we achieved with kernel methods. +FIX If we use classification tree performs on the 2 or 7 example, we achieve an accuracy of `r confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"]` which is better than regression, but is not as good as what we achieved with kernel methods. The plot of the estimated conditional probability shows us the limitations of classification trees: @@ -682,14 +682,14 @@ So how do we get different decision trees from a single training set? For this, 2\. A large number of features is typical in machine learning challenges. Often, many features can be informative but including them all in the model may result in overfitting. The second way random forests induce randomness is by randomly selecting features to be included in the building of each tree. A different random subset is selected for each tree. This reduces correlation between trees in the forest, thereby improving prediction accuracy. -To illustrate how the first steps can result in smoother estimates we will demonstrate by fitting a random forest to the 2008 polls data. We will use the `randomForest` function in the **randomForest** package: +To illustrate how the first steps can result in smoother estimates, we will fit a random forest to the 2008 polls data. We will use the `randomForest` function in the **randomForest** package: ```{r polls-2008-rf, message = FALSE, warning = FALSE} library(randomForest) fit <- randomForest(margin ~ ., data = polls_2008) ``` -Note that if we apply the function `plot` to the resulting object, stored in `fit`, we see how the error rate of our algorithm changes as we add trees. +Note that if we apply the function `plot` to the resulting object, stored in `fit`, we see how the error rate of our algorithm changes as we add trees: ```{r, eval = FALSE} rafalib::mypar() @@ -701,7 +701,7 @@ rafalib::mypar() plot(fit) ``` -We can see that in this case, the accuracy improves as we add more trees until about 30 trees where accuracy stabilizes. +FIX [until WE HAVE about] In this case, the accuracy improves as we add more trees until about 30 trees where accuracy stabilizes. The resulting estimate for this random forest can be seen like this: @@ -713,7 +713,7 @@ polls_2008 |> geom_line(aes(day, y_hat), col = "red") ``` -Notice that the random forest estimate is much smoother than what we achieved with the regression tree in the previous section. This is possible because the average of many step functions can be smooth. We can see this by visually examining how the estimate changes as we add more trees. In the following figure you see each of the bootstrap samples for several values of $b$ and for each one we see the tree that is fitted in grey, the previous trees that were fitted in lighter grey, and the result of averaging all the trees estimated up to that point. +Notice that the random forest estimate is much smoother than what we achieved with the regression tree in the previous section. This is possible because the average of many step functions can be smooth. We can see this by visually examining how the estimate changes as we add more trees. In the following figure, you see each of the bootstrap samples for several values of $b$ and for each one we see the tree that is fitted in grey, the previous trees that were fitted in lighter grey, and the result of averaging all the trees estimated up to that point. ```{r rf-animation, echo = FALSE, out.width = "100%"} library(rafalib) @@ -824,7 +824,7 @@ grid.arrange(p2, p1, nrow = 1) train_rf_2 <- randomForest(y ~ ., data = mnist_27$train, nodesize = 31) ``` -Visualizing the estimate shows that, although we obtain high accuracy, it appears that there is room for improvement by making the estimate smoother. This could be achieved by changing the parameter that controls the minimum number of data points in the nodes of the tree. The larger this minimum, the smoother the final estimate will be. If we use a nodeize of 31, the number of neighbors we used with knn, we get an accuracy of `confusionMatrix(predict(train_rf_2, mnist_27$test), mnist_27$test$y)$overall["Accuracy"]`. The selected model improves accuracy and provides a smoother estimate: +Visualizing the estimate shows that, although we obtain high accuracy, it appears that there is room for improvement by making the estimate smoother. This could be achieved by changing the parameter that controls the minimum number of data points in the nodes of the tree. The larger this minimum, the smoother the final estimate will be. If we use a node size of 31, the number of neighbors we used with knn, we get an accuracy of `confusionMatrix(predict(train_rf_2, mnist_27$test), mnist_27$test$y)$overall["Accuracy"]`. The selected model improves accuracy and provides a smoother estimate: ```{r cond-prob-final-rf, echo = FALSE, out.width = "100%"} p1 <- plot_cond_prob() + ggtitle("True conditional probability") @@ -836,13 +836,13 @@ grid.arrange(p2, p1, nrow = 1) ``` -Random forest performs better than trees in all the examples we have considered. However, a disadvantage of random forests is that we lose interpretability. An approach that helps with interpretability is to examine *variable importance*. To define *variable importance* we count how often a predictor is used in the individual trees. You can learn more about *variable importance* in an advanced machine learning book[^trees-3]. The **caret** package includes the function `varImp` that extracts variable importance from any model in which the calculation is implemented. We give an example on how we use variable importance in the next section. +Random forest performs better than trees in all the examples we have considered. However, a disadvantage of random forests is that we lose interpretability. An approach that helps with interpretability is to examine *variable importance*. To define *variable importance*, we count how often a predictor is used in the individual trees. You can learn more about *variable importance* in an advanced machine learning book[^trees-3]. The **caret** package includes the function `varImp` that extracts variable importance from any model in which the calculation is implemented. We give an example on how we use variable importance in the next section. [^trees-3]: https://web.stanford.edu/\~hastie/Papers/ESLII.pdf ## Exercises -1\. Create a dataset using the following code. +1\. Create a dataset using the following code: ```{r, eval = FALSE} n <- 100 @@ -851,7 +851,7 @@ dat <- MASS::mvrnorm(n = 100, c(69, 69), Sigma) |> data.frame() |> setNames(c("x", "y")) ``` -Use the __caret__ package to partition into a test and training set of equal size. Train a linear model and report the RMSE. Repeat this exercise 100 times and make a histogram of the RMSEs and report the average and standard deviation. Hint: adapt the code shown earlier like this +Use the __caret__ package to partition into a test and training set of equal size. Train a linear model and report the RMSE. Repeat this exercise 100 times and make a histogram of the RMSEs and report the average and standard deviation. Hint: adapt the code shown earlier like this: ```{r, eval = FALSE} library(caret) @@ -891,7 +891,7 @@ dat <- MASS::mvrnorm(n = 100, c(69, 69), Sigma) |> Repeat the exercise and note what happens to the RMSE now. -5\. Which of the following best explains why the RMSE in exercise 4 is so much lower than exercise 1. +5\. Which of the following best explains why the RMSE in exercise 4 is so much lower than exercise 1: a. It is just luck. If we do it again, it will be larger. b. The Central Limit Theorem tells us the RMSE is normal. @@ -978,18 +978,18 @@ Fit QDA using the `qda` function in the **MASS** package the create a confusion a\. It is a two-by-two table. b\. Because we have three classes, it is a two-by-three table. c\. Because we have three classes, it is a three-by-three table. -d\. Confusion matrices only make sense when the outcomes are binary. +d\. Confusion matrices only make sense when the outcomes are binary. -13\. The `byClass` component returned by the `confusionMatrix` oject provides sensitivity and specificity for each class. Because these terms only make sense when data is binary, each row reprents sensitivity and specificity when a particular class is 1 (positives) and the other two are considered 0s (negatives). Based on the values returned by `confusionMatrix` which of the following is the most common mistake: +13\. The `byClass` component returned by the `confusionMatrix` object provides sensitivity and specificity for each class. Because these terms only make sense when data is binary, each row represents sensitivity and specificity when a particular class is 1 (positives) and the other two are considered 0s (negatives). Based on the values returned by `confusionMatrix`, which of the following is the most common mistake: a\. Calling 1s either a 2 or 7. b\. Calling 2s either a 1 or 7. c\. Calling 7s either a 1 or 2. d\. All mistakes are equally common. -14. Create a grid of `x_1` and `x_2` using +14. Create a grid of `x_1` and `x_2` using: ```{r, eval = FALSE} GS <- 150 @@ -998,8 +998,7 @@ new_x <- with(mnist_127$train, x_2 = seq(min(x_2), max(x_2), len = GS))) ``` -then visualize the decision rule by coloring -the regions of the Cartesian plan to represent the label that would be called in that region. +then visualize the decision rule by coloring the regions of the Cartesian plan to represent the label that would be called in that region. 14\. Repeat exercise 13 but for LDA. Which of the following explains why LDA has worse accuracy: @@ -1012,7 +1011,7 @@ d. LDA can't be used with more than one class. 15\. Now repear exercise 13 for kNN with $k=31$ and compute and compare the overall accuracy for all three methods. -16. To understand how a simple method like kNN can outperform a model that explicitly tries to emulate Bayes' rule, explore the conditional conditional distributions of `x_1` and `x_2` to see if the normal approximation hold. Generative models can be very powerful, but only when we are able to successfully approximate the joint distribution of predictors conditioned on each class. +16. To understand how a simple method like kNN can outperform a model that explicitly tries to emulate Bayes' rule, explore the conditional distributions of `x_1` and `x_2` to see if the normal approximation holds. Generative models can be very powerful, but only when we are able to successfully approximate the joint distribution of predictors conditioned on each class. 17\. Earlier we used logistic regression to predict sex from height. Use kNN to do the same. Use the code described in this chapter to select the $F_1$ measure and plot it against $k$. Compare to the $F_1$ of about 0.6 we obtained with regression. @@ -1055,7 +1054,7 @@ table(tissue_gene_expression$y) ``` -Fit a random forest using the `randomForest` function the package **randomForest** then use the `varImp` function to see which are the top 10 most predictive genes. Make a histogram of the reported importance to get an idea of the distribution of the importance values. +Fit a random forest using the `randomForest` function in the package **randomForest**. Then use the `varImp` function to see which are the top 10 most predictive genes. Make a histogram of the reported importance to get an idea of the distribution of the importance values. ```{r} library(randomForest) diff --git a/ml/clustering.qmd b/ml/clustering.qmd index 3674426..cf6f8d9 100644 --- a/ml/clustering.qmd +++ b/ml/clustering.qmd @@ -1,8 +1,8 @@ # Clustering {#sec-clustering} -The algorithms we have described up to now are examples of a general approach referred to as _supervised_ machine learning. The name comes from the fact that we use the outcomes in a training set to _supervise_ the creation of our prediction algorithm. There is another subset of machine learning referred to as _unsupervised_. In this subset we do not necessarily know the outcomes and instead are interested in discovering groups. These algorithms are also referred to as _clustering_ algorithms since predictors are used to define _clusters_. +The algorithms we have described up to now are examples of a general approach referred to as _supervised_ machine learning. The name comes from the fact that we use the outcomes in a training set to _supervise_ the creation of our prediction algorithm. There is another subset of machine learning referred to as _unsupervised_. In this subset, we do not necessarily know the outcomes and instead are interested in discovering groups. These algorithms are also referred to as _clustering_ algorithms since predictors are used to define _clusters_. -In the two examples we have shown here, clustering would not be very useful. In the first example, if we are simply given the heights we will not be able to discover two groups, males and females, because the intersection is large. In the second example, we can see from plotting the predictors that discovering the two digits, two and seven, will be challenging. +In the two examples we have shown here, clustering would not be very useful. In the first example, if we are simply given the heights, we will not be able to discover two groups, males and females, because the intersection is large. In the second example, we can see from plotting the predictors that discovering the two digits, two and seven, will be challenging. However, there are applications in which unsupervised learning can be a powerful technique, in particular as an exploratory tool. @@ -50,9 +50,9 @@ rafalib::mypar() plot(h, cex = 0.65, main = "", xlab = "") ``` -This graph gives us an approximation between the distance between any two movies. To find this distance we find the first location, from top to bottom, where these movies split into two different groups. The height of this location is the distance between these two groups. So, for example, the distance between the three _Star Wars_ movies is 8 or less, while the distance between _Raiders of the Lost of Ark_ and _Silence of the Lambs_ is about 17. +This graph gives us an approximation between the distance between any two movies. To find this distance, we find the first location, from top to bottom, where these movies split into two different groups. The height of this location is the distance between these two groups. So, for example, the distance between the three _Star Wars_ movies is 8 or less, while the distance between _Raiders of the Lost of Ark_ and _Silence of the Lambs_ is about 17. -To generate actual groups we can do one of two things: 1) decide on a minimum distance needed for observations to be in the same group or 2) decide on the number of groups you want and then find the minimum distance that achieves this. The function `cutree` can be applied to the output of `hclust` to perform either of these two operations and generate groups. +To generate actual groups, we can do one of two things: 1) decide on a minimum distance needed for observations to be in the same group or 2) decide on the number of groups you want and then find the minimum distance that achieves this. The function `cutree` can be applied to the output of `hclust` to perform either of these two operations and generate groups. ```{r} groups <- cutree(h, k = 10) @@ -65,7 +65,7 @@ blockbusters: names(groups)[groups == 4] ``` -And group 9 appears to be nerd movies: +And Group 9 appears to be nerd movies: ```{r} names(groups)[groups == 6] @@ -73,7 +73,7 @@ names(groups)[groups == 6] We can change the size of the group by either making `k` larger or `h` smaller. -Note that we can also explore the data to see if there are clusters of movie raters. +We can also explore the data to see if there are clusters of movie raters: ```{r} h_2 <- hclust(dist(t(x))) @@ -83,9 +83,10 @@ h_2 <- hclust(dist(t(x))) ## k-means To use the k-means clustering algorithm we have to pre-define $k$, the number of clusters we want to define. The k-means algorithm is iterative. -The first step is to define $k$ centers. Then each observation is assigned to the cluster with the closest center to that observation. In a second step the centers are redefined using the observation in each cluster: the column means are used to define a _centroid_. We repeat these two steps until the centers converge. -The `kmeans` function included in R-base does not handle NAs. For illustrative purposes we will fill out the NAs with 0s. In general, the choice of how to fill in missing data, or if one should do it at all, should be made with care. +The first step is to define $k$ centers. Then each observation is assigned to the cluster with the closest center to that observation. In a second step, the centers are redefined using the observation in each cluster: the column means are used to define a _centroid_. We repeat these two steps until the centers converge. + +The `kmeans` function included in R-base does not handle NAs. For illustrative purposes, we will fill out the NAs with 0s. In general, the choice of how to fill in missing data, or if one should do it at all, should be made with care. ```{r} x_0 <- x @@ -109,7 +110,7 @@ k <- kmeans(x_0, centers = 10, nstart = 25) A powerful visualization tool for discovering clusters or patterns in your data is the heatmap. The idea is simple: plot an image of your data matrix with colors used as the visual cue and both the columns and rows ordered according to the results of a clustering algorithm. We will demonstrate this with the `tissue_gene_expression` dataset. We will scale the rows of the gene expression matrix. -The first step is compute: +FIX The first step is compute: ```{r} x <- sweep(tissue_gene_expression$x, 2, colMeans(tissue_gene_expression$x)) @@ -118,7 +119,7 @@ h_2 <- hclust(dist(t(x))) ``` -Now we can use the results of this clustering to order the rows and columns. +Now we can use the results of this clustering to order the rows and columns: ```{r heatmap, out.width="100%", fig.height=7, eval=FALSE} image(x[h_1$order, h_2$order]) @@ -134,8 +135,7 @@ We do not show the results of the heatmap function because there are too many fe ### Filtering features -If the information about clusters is included in just a few features, including all the features can add enough noise that detecting clusters -becomes challenging. One simple approach to try to remove features with no information is to only include those with high variance. In the movie example, a user with low variance in their ratings is not really informative: all the movies seem about the same to them. Here is an example of how we can include only the features with high variance. +FIX If the information about clusters is included in just a few features, including all the features can add enough noise that detecting clusters becomes challenging. One simple approach to try to remove features with no information is to only include those with high variance. In the movie example, a user with low variance in their ratings is not really informative: all the movies seem about the same to them. Here is an example of how we can include only the features with high variance. ```{r heatmap-3, out.width="100%", fig.height=5, fig.width=6, message=FALSE, warning=FALSE} library(matrixStats) @@ -144,7 +144,7 @@ o <- order(sds, decreasing = TRUE)[1:25] heatmap(x[,o], col = RColorBrewer::brewer.pal(11, "Spectral")) ``` -Note there are several other heatmap function in R. A popular examples is the `heatmap.2` in the **gplots** package. +Note there are several other heatmap functions in R. A popular example is the `heatmap.2` in the **gplots** package. ## Exercises diff --git a/ml/ml-in-practice.qmd b/ml/ml-in-practice.qmd index 798d926..76f66ac 100644 --- a/ml/ml-in-practice.qmd +++ b/ml/ml-in-practice.qmd @@ -15,7 +15,7 @@ library(dslabs) mnist <- read_mnist() ``` -The dataset includes two components, a training set and test set: +The dataset includes two components, a training set and a test set: ```{r} names(mnist) @@ -50,11 +50,11 @@ y_test <- factor(mnist$test$labels[index]) We have already learned about several machine learning algorithms. Many of these algorithms are implemented in R. However, they are distributed via different packages, developed by different authors, and often use different syntax. The __caret__ package tries to consolidate these differences and provide consistency. It currently includes over 200 different methods which are summarized in the __caret__ package manual^[https://topepo.github.io/caret/available-models.html]. Keep in mind that __caret__ does not include the packages needed to run each possible algorithm. To apply a machine learning method through __caret__ you still need to install the library that implmenent the method. The required packages for each method are described in the package manual. -The __caret__ package also provides a function that performs cross validation for us. Here we provide some examples showing how we use this helpful package. We will first use the 2 or 7 example to illustrate and in later sections we use use the package to run algorithms on the larger MNIST datset. +The __caret__ package also provides a function that performs cross validation for us. Here we provide some examples showing how we use this helpful package. We will first use the 2 or 7 example to illustrate and, in later sections, we use the package to run algorithms on the larger MNIST dataset. -### The `train` functon +### The `train` function -The R functions that fit machine algorithms are all slighltly different. Functions such `lm`, `glm`, `qda`, `lda`, `knn3`, `rpart` and `randomForrest` use different syntax, have different argument names and produce objects of different types. +The R functions that fit machine algorithms are all slightly different. Functions such as `lm`, `glm`, `qda`, `lda`, `knn3`, `rpart` and `randomForrest` use different syntax, have different argument names and produce objects of different types. The __caret__ `train` function lets us train different algorithms using similar syntax. So, for example, we can type: @@ -66,33 +66,32 @@ train_qda <- train(y ~ ., method = "qda", data = mnist_27$train) train_knn <- train(y ~ ., method = "knn", data = mnist_27$train) ``` -As we explain in more detail in #sec-caret-cv, the train function selects parameters for you using cross validation. +FIX As we explain in more detail in #sec-caret-cv, the train function selects parameters for you using cross validation. ### The `predict` function The `predict` function is very useful for machine learning applications. This function takes an object from a fitting function and a data frame with features \mathbf{x} for which to predict, and returns predictions for these features. -Here is an example with logistic regression +Here is an example with logistic regression: ```{r} fit <- glm(y ~ ., data = mnist_27$train, family = "binomial") p_hat <- predict(fit, mnist_27$test) ``` -In this case, the function is simply computing +In this case, the function is simply computing: $$ \hat{p}(\mathbf{x}) = g^{-1}\left(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 \right) \text{ with } g(p) = \log\frac{p}{1-p} \implies g^{-1}(\mu) = \frac{1}{1-e^{-\mu}} $$ -for the `x_1` and `x_2` in the test set `mnist_27$test`. With these estimates in place we can make our predictions and compute our accuracy: +for the `x_1` and `x_2` in the test set `mnist_27$test`. With these estimates in place, we can make our predictions and compute our accuracy: ```{r} y_hat <- factor(ifelse(p_hat > 0.5, 7, 2)) ``` -However, note that `predict` does not always return objects of the same types; it depends on what type of object it is applied to. -To learn about the specifics, you need to look at the help file specific for the type of fit object that is being used. The `predict` is actually a special type of function in R (called a _generic function_) that calls other functions depending on what kind of object it receives. So if `predict` receives an object coming out of the `lm` function, it will call `predict.glm`. If it receives an object coming out of `glm`, it calls `predict.qda`, if the fit is from `knn3`, it calls `predict.knn3`, and so on. These functions are similar but not exaclty. You can learn more about the differences by reading the help files: +However, note that `predict` does not always return objects of the same types; it depends on what type of object it is applied to. To learn about the specifics, you need to look at the help file specific for the type of fit object that is being used. The `predict` is actually a special type of function in R (called a _generic function_) that calls other functions depending on what kind of object it receives. So if `predict` receives an object coming out of the `lm` function, it will call `predict.glm`. If it receives an object coming out of `glm`, it calls `predict.qda`. If the fit is from `knn3`, it calls `predict.knn3`, and so on. These functions are similar but not exactly. You can learn more about the differences by reading the help files: ```{r, eval = FALSE} ?predict.glm @@ -148,7 +147,7 @@ ggplot(train_knn, highlight = TRUE) By default, the cross validation is performed by taking 25 bootstrap samples comprised of 25% of the observations. For the `kNN` method, the default is to try $k=5,7,9$. We change this using the `tuneGrid` argument. The grid of values must be supplied by a data frame with the parameter names as specified in the `modelLookup` output. -Here, we present an example where we try out 30 values between 9 and 67. To do this with __caret__, we need to define a column named `k`, so we use this: `data.frame(k = seq(9, 67, 2))`. Note that when running this code, we are fitting 30 versions of kNN to 25 bootstrapped samples. Since we are fitting $30 \times 25 = 750$ kNN models, running this code will take several seconds. We set the seed because cross validation is a random procedure and we want to make sure the result here is reproducible. +Here we present an example where we try out 30 values between 9 and 67. To do this with __caret__, we need to define a column named `k`, so we use this: `data.frame(k = seq(9, 67, 2))`. Note that when running this code, we are fitting 30 versions of kNN to 25 bootstrapped samples. Since we are fitting $30 \times 25 = 750$ kNN models, running this code will take several seconds. We set the seed because cross validation is a random procedure and we want to make sure the result here is reproducible. ```{r train-knn-plot} set.seed(2008) @@ -170,7 +169,7 @@ and the best performing model like this: train_knn$finalModel ``` -The function `predict` will use this best performing model. Here is the accuracy of the best model when applied to the test set, which we have not used at all yet because the cross validation was done on the training set: +The function `predict` will use this best performing model. Here is the accuracy of the best model when applied to the test set, which we have not yet used because the cross validation was done on the training set: ```{r} confusionMatrix(predict(train_knn, mnist_27$test, type = "raw"), @@ -188,7 +187,7 @@ train_knn_cv <- train(y ~ ., method = "knn", ggplot(train_knn_cv, highlight = TRUE) ``` -We notice that the accuracy estimates are more variable, which is expected since we changed the number of samples used to estimate accuracy. +We observe that the accuracy estimates are more variable, which is expected since we changed the number of samples used to estimate accuracy. Note that `results` component of the `train` output includes several summary statistics related to the variability of the cross validation estimates: @@ -202,7 +201,7 @@ You can learn many more details about the **caret** package, from the manual ^[h We often transform predictors before running the machine algorithm. We also remove predictors that are clearly not useful. We call these steps *preprocessing*. -Examples of preprocessing include standardizing the predictors, taking the log transform of some predictors, removing predictors that are highly correlated with others, and removing predictors with very few non-unique values or close to zero variation. We show an example below. +Examples of preprocessing include standardizing the predictors, taking the log transform of some predictors, removing predictors that are highly correlated with others, and removing predictors with very few non-unique values or close to zero variation. FIX DO YOU NEED THIS SENT: We show an example below. For example, we can run the `nearZero` function from the **caret** package to see that several features do not vary much from observation to observation. We can see that there is a large number of features with 0 variability: @@ -248,7 +247,7 @@ colnames(x_test) <- colnames(x) ## k-nearest neighbors :::{.callout-warning} -Before starting this section, note that the first two calls to the `train` function in the code below can take several hours to run. This is common challenge when training machine learning algorithms since we have to run the algorithm for each cross validation split and each set of tuning parameter being considered. In the next section, we will provide some suggestion on how to predict the duration of the process and ways to reduce. +Before starting this section, note that the first two calls to the `train` function in the code below can take several hours to run. This is a common challenge when training machine learning algorithms since we have to run the algorithm for each cross validation split and each set of tuning parameters being considered. In the next section, we will provide some suggestions on how to predict the duration of the process and ways to reduce. ::: The first step is to optimize for $k$. @@ -278,14 +277,14 @@ confusionMatrix(y_hat_knn, factor(y_test))$overall["Accuracy"] ### Dimension reduction with PCA -An alternative to removing low variance columns directly, is to use dimension reduction on the feature matrix before applying the algorithms. It is important that we not use the test set when finding the PCs nor any summary of the data, as this could result in overtraining. So we start by applying `prcomp` to the training data: +An alternative to removing low variance columns directly is to use dimension reduction on the feature matrix before applying the algorithms. It is important that we not use the test set when finding the PCs nor any summary of the data, as this could result in overtraining. So we start by applying `prcomp` to the training data: ```{r} col_means <- colMeans(x) pca <- prcomp(sweep(x, 2, col_means), center = FALSE) ``` -Next, we run knn on just a small number of dimensions. We try 36 dimensions since this explains about 80% of the data. +Next, we run kNN on just a small number of dimensions. We try 36 dimensions since this explains about 80% of the data. ```{r} #| eval: false @@ -313,10 +312,10 @@ y_hat_knn_pca <- predict(fit_knn_pca, newdata, type = "class") confusionMatrix(y_hat_knn_pca, factor(y_test))$overall["Accuracy"] ``` -With obtain an improvement in accuracy, while using only 36 dimensions. +FIX With obtain an improvement in accuracy, while using only 36 dimensions. :::{.callout-note} -Remember the entire algorithm needs to be developed on the training data. This is why the column means subtracted from test set columns where those estimated in the training set and the rotation applied to the test data to obtained PCs we the one obtained with the train data. +Remember the entire algorithm needs to be developed on the training data. FIX This is why the column means subtracted from test set columns were those estimated in the training set and the rotation applied to the test data to obtained PCs we the one obtained with the train data. ::: @@ -358,11 +357,11 @@ y_hat_rf <- predict(fit_rf, x_test[ ,col_index]) confusionMatrix(y_hat_rf, y_test)$overall["Accuracy"] ``` -By optimizing some of the other algorithm parameters we can achieve even higher accuracy. +By optimizing some of the other algorithm parameters, we can achieve even higher accuracy. ## Testing and improving computation time -The default method for estimating accuracy used by the `train` function is to test prediction on 25 bootstrap samples. This can result in long compute times. For examples, if we are considering several values, say 10, of the tuning parameters, we will fit the algorithm 250 times. We can use the `system.time` function to estimate the how long it takes to run the algorithm once +The default method for estimating accuracy used by the `train` function is to test prediction on 25 bootstrap samples. This can result in long compute times. For example, if we are considering several values, say 10, of the tuning parameters, we will fit the algorithm 250 times. We can use the `system.time` function to estimate how long it takes to run the algorithm once: ```{r} system.time({fit_rf <- randomForest(x[, col_index], y, mtry = 9)}) @@ -385,9 +384,9 @@ train_rf <- train(x[, col_index], y, trControl = control) ``` -For random forest we can also speed up the training step by running less trees per fit. After running the algorithm once, we can use the plot function to see how the error rate changes as the number of trees grows. +For random forest, we can also speed up the training step by running less trees per fit. After running the algorithm once, we can use the plot function to see how the error rate changes as the number of trees grows. -Here we can see that error rate stabilizes after about 200 trees. +Here we can see that error rate stabilizes after about 200 trees: ```{r} plot(fit_rf) @@ -438,7 +437,7 @@ for (i in ind[1:4]) { } ``` -By examining errors like this we often find specific weaknesses to algorithms or parameter choices and can try to correct them. +By examining errors like this, we often find specific weaknesses to algorithms or parameter choices and can try to correct them. ## Ensembles @@ -457,7 +456,7 @@ y_pred <- factor(apply(p, 1, which.max) - 1) confusionMatrix(y_pred, y_test)$overall["Accuracy"] ``` -We have just build an ensemble with just two algorithms. By combing more similarly performing, but uncorrelated, algorithms we can improve accuracy further. +We have just built an ensemble with just two algorithms. By combing more similarly performing, but uncorrelated, algorithms we can improve accuracy further. ## Exercises @@ -492,15 +491,15 @@ x <- tissue_gene_expression$x[ind, ] x <- x[, sample(ncol(x), 10)] ``` -4\. In this case, LDA fits two 10-dimensional normal distributions. Look at the fitted model by looking at the `finalModel` component of the result of train. Notice there is a component called `means` that includes the estimate `means` of both distributions. Plot the mean vectors against each other and determine which predictors (genes) appear to be driving the algorithm. +4\. In this case, LDA fits two 10-dimensional normal distributions. Look at the fitted model by looking at the `finalModel` component of the result of train. Notice there is a component called `means` that includes the estimate `means` of both distributions. Plot the mean vectors against each other and determine which predictors (genes) appear to be driving the algorithm. 5\. Repeat exercises 3 with QDA. Does it have a higher accuracy than LDA? 6\. Are the same predictors (genes) driving the algorithm? Make a plot as in exercise 3. -7\. One thing we see in the previous plot is that the value of predictors correlate in both groups: some predictors are low in both groups while others are high in both groups. The mean value of each predictor, `colMeans(x)`, is not informative or useful for prediction, and often for interpretation purposes it is useful to center or scale each column. This can be achieved with the `preProcessing` argument in `train`. Re-run LDA with `preProcessing = "scale"`. Note that accuracy does not change but see how it is easier to identify the predictors that differ more between groups in the plot made in exercise 4. +7\. One thing we see in the previous plot is that the value of predictors correlate in both groups: some predictors are low in both groups while others are high in both groups. The mean value of each predictor, `colMeans(x)`, is not informative or useful for prediction and often, for interpretation purposes, it is useful to center or scale each column. This can be achieved with the `preProcessing` argument in `train`. Re-run LDA with `preProcessing = "scale"`. Note that accuracy does not change but see how it is easier to identify the predictors that differ more between groups in the plot made in exercise 4. -8\. In the previous exercises we saw that both approaches worked well. Plot the predictor values for the two genes with the largest differences between the two groups in a scatterplot to see how they appear to follow a bivariate distribution as assumed by the LDA and QDA approaches. Color the points by the outcome. +8\. In the previous exercises, we saw that both approaches worked well. Plot the predictor values for the two genes with the largest differences between the two groups in a scatterplot to see how they appear to follow a bivariate distribution as assumed by the LDA and QDA approaches. Color the points by the outcome. 9\. Now we are going to increase the complexity of the challenge slightly: we will consider all the tissue types. @@ -526,7 +525,7 @@ What accuracy do you get with LDA? 14\. Study the confusion matrix for the best fitting classification tree. What do you observe happening for placenta? -15\. Notice that placentas are called endometrium more often than placenta. Note also that the number of placentas is just six, and that, by default, `rpart` requires 20 observations before splitting a node. Thus it is not possible with these parameters to have a node in which placentas are the majority. Rerun the above analysis but this time permit `rpart` to split any node by using the argument `control = rpart.control(minsplit = 0)`. Does the accuracy increase? Look at the confusion matrix again. +15\. Notice that placentas are called endometrium more often than placenta. Note also that the number of placentas is just six, and that, by default, `rpart` requires 20 observations before splitting a node. Thus it is not possible with these parameters to have a node in which placentas are the majority. Rerun the above analysis, but this time permit `rpart` to split any node by using the argument `control = rpart.control(minsplit = 0)`. Does the accuracy increase? Look at the confusion matrix again. 16\. Plot the tree from the best fitting model obtained in exercise 11. @@ -547,11 +546,11 @@ tree_terms What is the variable importance in the random forest call for these predictors? Where do they rank? -20\. Extract the top 50 predictors based on importance, take a subset of `x` with just these predictors and apply the function `heatmap` to see how these genes behave across the tissues. We will introduce the `heatmap` function in Chapter @sec-clustering. +20\. Extract the top 50 predictors based on importance, take a subset of `x` with just these predictors and apply the function `heatmap` to see how these genes behave across the tissues. We will introduce the `heatmap` function in @sec-clustering. -21\. Previously we have compared conditional probability give two predictors $p(x_1,x_2)$ to the fit $\hat{p}(x_1,x_2)$ obtained with a machine learning algorithm by making image plots. The following code can be used to make these images and include a curve at the values of $x_1$ and $x_2$ for which the function is $0.5$. +21\. FIX Previously, we have compared conditional probability give two predictors $p(x_1,x_2)$ to the fit $\hat{p}(x_1,x_2)$ obtained with a machine learning algorithm by making image plots. The following code can be used to make these images and include a curve at the values of $x_1$ and $x_2$ for which the function is $0.5$: ```{r} #| eval: false @@ -574,7 +573,7 @@ with(mnist_27$true_p, plot_cond_prob(x_1, x_2, p)) Fit a kNN model and make this plot for the estimated conditional probability. Hint: Use the argument `newdata = mnist27$train` to obtain predictions for a grid points. -22\. Notice that, in the plot made in exercise 1, the boundary is somewhat wiggly. This is because kNN, like the basic bin smoother, does not use a kernel. To improve this we could try loess. By reading through the available models part of the **caret** manual we see that we can use the `gamLoess` method. We see that we need to install the __gam__ package, if we have not done so already. We see that we have two parameters to optimize: +22\. Notice that, in the plot made in exercise 1, the boundary is somewhat wiggly. This is because kNN, like the basic bin smoother, does not use a kernel. To improve this we could try loess. By reading through the available models part of the **caret** manual, we see that we can use the `gamLoess` method. We need to install the __gam__ package, if we have not done so already. We see that we have two parameters to optimize: ```{r} modelLookup("gamLoess") @@ -610,7 +609,7 @@ We have not explained many of these, but apply them anyway using `train` with al 31\. If two methods give results that are the same, ensembling them will not change the results at all. For each pair of metrics compare the percent of time they call the same thing. Then use the `heatmap` function to visualize the results. Hint: use the `method = "binary"` argument in the `dist` function. -32\. Note that each method can also produce an estimated conditional probability. Instead of majority vote we can take the average of these estimated conditional probabilities. For most methods, we can the use the `type = "prob"` in the train function. Note that some of the methods require you to use the argument `trControl=trainControl(classProbs=TRUE)` when calling train. Also these methods do not work if classes have numbers as names. Hint: change the levels like this: +32\. Note that each method can also produce an estimated conditional probability. Instead of majority vote, we can take the average of these estimated conditional probabilities. For most methods, we can the use the `type = "prob"` in the train function. Note that some of the methods require you to use the argument `trControl=trainControl(classProbs=TRUE)` when calling train. Also, these methods do not work if classes have numbers as names. Hint: change the levels like this: ```{r, eval = FALSE} dat$train$y <- recode_factor(dat$train$y, "2"="two", "7"="seven")