-
Notifications
You must be signed in to change notification settings - Fork 4
/
10_MLPS_R_instance_based_learning.Rmd
144 lines (110 loc) · 4.13 KB
/
10_MLPS_R_instance_based_learning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: "10_MLPS_R_instance_based_learning"
author: "Zhe Zhang (TA - Heinz CMU PhD)"
date: "3/04/2017"
output:
html_document:
css: '~/Dropbox/avenir-white.css'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = F, error = F, message = F)
```
## Lecture 10: Instance Based Learning
Key specific tasks we covered in this lecture:
* KNN classifier, with different k
* More advanced KNN: different distance, kernel trick/transformation KNN
* kernel density local regression
* local linear regression
* locally weighted polynomial regression
### KNN
```{r}
library(tidyverse)
library(class)
df <- airquality %>%
mutate(high_temp = ifelse(Temp > 75, "High", "Low")) %>%
drop_na() %>%
select(-Month, -Day) %>%
mutate_if(.funs = scale, .predicate = is.numeric)
# important to scale!
train_sample <- sample(seq(3), nrow(df), r = T)
train <- df[train_sample != 3, ]
test <- df[train_sample == 3, ]
knn_preds <- knn(train %>% select(-high_temp),
test%>% select(-high_temp),
cl = train$high_temp,
k = 4)
# measure performance
table(knn_preds, test$high_temp)
mean(knn_preds == test$high_temp)
# leave 1 out knn.cv
# k = 2
cv_preds <- knn.cv(df %>% select(-high_temp),
cl = df$high_temp, k = 2)
mean(cv_preds == df$high_temp)
# k = 3
cv_preds <- knn.cv(df %>% select(-high_temp),
cl = df$high_temp, k = 3)
mean(cv_preds == df$high_temp)
# k = 4
cv_preds <- knn.cv(df %>% select(-high_temp),
cl = df$high_temp, k = 4)
mean(cv_preds == df$high_temp)
# k = 5
cv_preds <- knn.cv(df %>% select(-high_temp),
cl = df$high_temp, k = 5)
mean(cv_preds == df$high_temp)
# multi-variable knn
df <- iris %>%
mutate_if(.funs = scale, .predicate = is.numeric)
# important to scale!
train_sample <- sample(seq(3), nrow(iris), r = T)
train <- df[train_sample != 3, ]
test <- df[train_sample == 3, ]
knn_preds <- knn(train %>% select(-Species),
test%>% select(-Species),
cl = train$Species,
k = 4)
table(knn_preds, test$Species)
mean(knn_preds == test$Species)
```
### More Advanced KNN
See the `kknn` package and the `kknn` command. It has options for using a kernel distance options, and customizing the Minkowski distance. See this [manual](https://cran.r-project.org/web/packages/kknn/kknn.pdf).
In general, using your own distance function may be possible with the `FastKNN` package? See this [StackOverflow question](http://stackoverflow.com/questions/23449726/find-k-nearest-neighbors-starting-from-a-distance-matrix)
### Kernel Density Regression
Predict new points from kernel density using the `x.points` option in `ksmooth`.
```{r}
df <- airquality %>%
mutate(high_temp = ifelse(Temp > 75, "High", "Low")) %>%
drop_na()
# box kernel
box_kernel2_reg <- ksmooth(x = df$Wind, y = df$Temp,
kernel = 'box', bandwidth = 2)
box_kernel5_reg <- ksmooth(x = df$Wind, y = df$Temp,
kernel = 'box', bandwidth = 5,
x.points = c(5, 10, 15, 20))
# gaussian normal kernel
gauss_kernel5_reg <- ksmooth(x = df$Wind, y = df$Temp,
kernel = 'normal', bandwidth = 3)
# this could be made in ggplot2, but I've skipped it for speed
# following the help file instead
ggplot(df, aes(x = Wind, y = Temp)) + geom_point() +
geom_line(data = as.data.frame(box_kernel2_reg), aes(x, y), color = 'red') +
geom_line(data = as.data.frame(gauss_kernel5_reg), aes(x, y), color = 'blue')
```
### Locally Weighted Regression
```{r}
library(KernSmooth)
# local linear regression
df <- airquality %>%
mutate(high_temp = ifelse(Temp > 75, "High", "Low")) %>%
drop_na()
air_loclin <- locpoly(x = df$Wind, y = df$Temp,
degree = 1, bandwidth = 5)
# local polynomial
air_locpoly <- locpoly(x = df$Wind, y = df$Temp,
degree = 2, bandwidth = 5)
# plotted
ggplot(df, aes(x = Wind, y = Temp)) + geom_point() +
geom_line(data = as.data.frame(air_loclin), aes(x, y), color = 'red') +
geom_line(data = as.data.frame(air_locpoly), aes(x, y), color = 'blue')
```