04-decision-trees.Rmd

# Decision Trees

## Load packages

```{r load_packages}
library(ggplot2)
library(rpart)
library(rpart.plot)
```

## Load data 

Load `train_x_class`, `train_y_class`, `test_x_class`, and `test_y_class` variables we defined in 02-preprocessing.Rmd for this *classification* task. 

```{r setup_data}
# Objects: task_reg, task_class
load("data/preprocessed.RData")
```

## Overview

Decision trees are recursive partitioning methods that divide the predictor spaces into simpler regions and can be visualized in a tree-like structure. They attempt to classify data by dividing it into subsets according to a Y output variable and based on some predictors.  

Let's see how a decision tree classifies if a person suffers from heart disease (`target` = 1) or not (`target` = 0).

## Fit Model

```{r}
set.seed(3)
tree = rpart::rpart(train_y_class ~ ., data = train_x_class,
             # Use method = "anova" for a continuous outcome.
             method = "class",
             
             # Can use "gini" for gini coefficient.
             parms = list(split = "information")) 

# https://stackoverflow.com/questions/4553947/decision-tree-on-information-gain

# Here is the text-based display of the decision tree. Yikes!  :^( 
print(tree)
```

Although interpreting the text can be intimidating, a decision tree's main strength is its tree-like plot, which is much easier to interpret.

## Investigate Results

```{r plot_tree}
rpart.plot::rpart.plot(tree) 
```

We can also look inside of `tree` to see what we can unpack. "variable.importance" is one we should check out! 

```{r}
names(tree)
tree$variable.importance
```

Plot variable importance

```{r}
# Turn the tree$variable.importance vector into a dataframe
tree_varimp = data.frame(tree$variable.importance)

# Add rownames as their own column
tree_varimp$x = rownames(tree_varimp) 

# Reorder clumns
tree_varimp = tree_varimp[, c(2,1)]

# Reset row names
rownames(tree_varimp) = NULL 

# Rename columns
names(tree_varimp) = c("Variable", "Importance") 
tree_varimp

# Plot
ggplot(tree_varimp, aes(x = reorder(Variable, Importance), 
                        y = Importance)) + 
  geom_bar(stat = "identity") + 
  theme_bw() + coord_flip() + xlab("")
```

In decision trees the main hyperparameter (configuration setting) is the **complexity parameter** (CP), but the name is a little counterintuitive; a high CP results in a simple decision tree with few splits, whereas a low CP results in a larger decision tree with many splits.  

`rpart` uses cross-validation internally to estimate the accuracy at various CP settings. We can review those to see what setting seems best.  

Print the results for various CP settings - we want the one with the lowest "xerror". We can also plot the performance estimates for different CP settings. 

```{r plotcp_tree}
# Show estimated error rate at different complexity parameter settings.
printcp(tree)

# Plot those estimated error rates.
plotcp(tree)

# Trees of similar sizes might appear to be tied for lowest "xerror", but a tree with fewer splits might be easier to interpret. 

tree_pruned2 = prune(tree, cp = 0.028986) # 2 splits

tree_pruned6 = prune(tree, cp = 0.010870) # 6 splits
```

Print detailed results, variable importance, and summary of splits.

```{r}
summary(tree_pruned2) 
rpart.plot(tree_pruned2)
```
```{r}
summary(tree_pruned6) 
rpart.plot(tree_pruned6)
```

You can also get more fine-grained control by checking out the "control" argument inside the rpart function. Type `?rpart` to learn more.  

Be sure to check out [gormanalysis](https://www.gormanalysis.com/blog/decision-trees-in-r-using-rpart/) excellent overview to help internalize what you learned in this example. 

## Challenge 2
Open Challenge 2 in the "Challenges" folder.