-
Notifications
You must be signed in to change notification settings - Fork 0
/
15-machine-learning.Rmd
161 lines (114 loc) · 4.45 KB
/
15-machine-learning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# Machine learning
## Learning Objectives {-#objectives-15}
1. Explain the main branches of machine learning and describe examples of the types of problems typically addressed by machine learning.
2. Explain and apply high-level concepts relevant to learning from data.
3. Describe and give examples of key supervised and unsupervised machine learning techniques, explaining the difference between regression and classification and between generative and discriminative models.
4. Explain in detail and use appropriate software to apply machine learning techniques (e.g. penalised regression and decision trees) to simple problems.
5. Demonstrate an understanding of the perspectives of statisticians, data scientists, and other quantitative researchers from non-actuarial backgrounds.
## Theory {-#theory-15}
**TO ADD THEORY ABOUT MACHINE LEARNING HERE**
## Machine learning topics
## Machine learning from data
## Supervised machine learning
## Unsupervised machine learning
## Penalised regression
## Decision trees
## Perspectives of non-actuarial professionals
## `R` Practice {-#practice-15}
```{r 15-machine-learning-01, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
We are managing a `portfolio` of investments that contains 200 assets. In this `portfolio` we measure the following features:
* Price-to-Earnings Ratio ("_PE_"), labelled $x1$ with $x1 \sim \mathcal{N}(3,\,1)$
* Price-to-Book Ratio ("_PB_"), labelled $x2$ with 65% of the assets following $\mathcal{N}(10,\,1)$ and the remaining 35% following $\mathcal{N}(4,\,1)$
We replicate this in `R` as follows:
```{r 15-machine-learning-02, message=FALSE}
library(dplyr) # Data manipulation
set.seed(42) # Fix result
portfolio <- data.frame(
x1 = rnorm(200, 3, 1),
x2 = scale(
c(
rnorm(70, 4, 1),
rnorm(130, 10, 1)
)
)
)
glimpse(portfolio)
```
Here the `scale()`{.R} function scales each element in the result by subtracting the sample mean and dividing by the sample standard deviation.
Next we want to explore whether these 200 assets can be divided into two clusters which we will label arbitrarily `A` and `B` based on the two metrics we have measured, PE (as $x1$) and PB (as $x2$).
We will first assign the assets evenly into these two clusters:
```{r 15-machine-learning-03}
group_label_stage1 <- c(
rep("A", 100),
rep("B", 100)
)
portfolio <- portfolio %>%
mutate(group_label_stage1 = group_label_stage1)
ClusterACentre <- c(
mean(portfolio$x1[portfolio$group_label_stage1 == "A"]),
mean(portfolio$x2[portfolio$group_label_stage1 == "A"])
)
ClusterBCentre <- c(
mean(portfolio$x1[portfolio$group_label_stage1 == "B"]),
mean(portfolio$x2[portfolio$group_label_stage1 == "B"])
)
glimpse(portfolio)
```
We have:
* The centre of cluster `A`, given by $(x1_A,\, x2_A)$ is `r round(ClusterACentre,3)`, and
* The centre of cluster `B`, given by $(x1_A,\, x2_A)$ is `r round(ClusterBCentre,3)`.
Next we want to calculate the Euclidean distance between:
* $(x1, x2)$ and the centre of cluster `A`, and
* $(x1, x2)$ and the centre of cluster `B`.
We will label these distances as `dist_A` and `dist_B` respectively.
The Euclidean distance is defined as:
* For `dist_A`: $\sqrt{(x1-x1_A)^2+(x2-x2_A)^2}$, and
* For `dist_B`: $\sqrt{(x1-x1_B)^2+(x2-x2_B)^2}$.
```{r 15-machine-learning-04}
dist_A <- sqrt(
(portfolio$x1 - ClusterACentre[1])^2
+ (portfolio$x2 - ClusterACentre[2])^2
)
dist_B <- sqrt(
(portfolio$x1 - ClusterBCentre[1])^2
+ (portfolio$x2 - ClusterBCentre[2])^2
)
portfolio <- portfolio %>%
mutate(
dist_A = dist_A,
dist_B = dist_B
)
glimpse(portfolio)
```
Now we will update the cluster labels (`A` and `B`) by assigning to each asset the label of the cluster whose centre is nearest from `dist_A` and `dist_B`.
```{r 15-machine-learning-05}
portfolio <- portfolio %>%
mutate(
group_label_stage2 = ifelse(portfolio$dist_A <= portfolio$dist_B, "A", "B")
)
glimpse(portfolio)
```
Let's generate a 2x2 matrix showing the number of assets with each possible combination of values from `group_label_stage1` and `group_label_stage2`:
```{r 15-machine-learning-06}
combos <- portfolio %>%
count(group_label_stage1, group_label_stage2)
matrix(
combos$n,
nrow = 2,
dimnames = list(
c("A", "B"),
c("A", "B")
)
)
```
Finally let's plot `x1` and `x2` coloured using the latest clustering labelling:
```{r 15-machine-learning-07}
library(ggplot2)
portfolio %>%
ggplot() +
geom_point(
aes(x1, x2, colour = group_label_stage2)
)
```