-
Notifications
You must be signed in to change notification settings - Fork 7
/
chapter7.Rmd
749 lines (529 loc) · 24.7 KB
/
chapter7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
---
title: 'Probability distributions'
description: 'Most things in this world are more or less random, and not evenly distributed.'
---
## Random variables and probability distributions
```yaml
type: NormalExercise
key: ef878aeeea
lang: r
xp: 100
skills: 1
```
Welcome to chapter 7: **Probability distributions**. In this chapter you will learn about [random variables](https://en.wikipedia.org/wiki/Random_variable) and their [probability distributions](https://en.wikipedia.org/wiki/Probability_distribution).
Random variables can have a set of different values. Every value has a probability of occurance. The probabilities of the values form a probability distribution for the random variable. Random variables and probability distributions can be discrete or continuous.
There are things or events that are known to follow certain probability distributions (like the heights of people usually are normally distributed), but there are also many phenomenas that have their unique distributions. In R you can create a sample from distribution with the function `sample()`. See the example code to get the general idea.
`@instructions`
- Create variable `tomorrow` with events `cat attacks baby`, `world ends` and `I eat pasta`.
- Set probabilities to the events of variable `tomorrow`.
- Draw a [*sample*](https://en.wikipedia.org/wiki/Sample_(statistics)) from tomorrow when the sample size is 1. Are you suprised with the outcome?
- Draw another sample, this time with sample size = `1000`. Add the argument `replace = TRUE` to the `sample()` function. What do you think it the `replace` argument does? Save the sample to `my_sample`. In this exercise, do not change or insert additional code **before** creating `my_sample`.
- Visualize the sample with bar plot. Notice how the bar heights in the barplot are related to the probabilities we set earlier.
`@hint`
- Save the sample you created as `my_sample`.
- Did you specify the sample size correctly in `my_sample`? If not, the code will produce an error.
- Remember to use `replace = TRUE`.
- The argument `size` in the `sample()` function is used to determine the sample size.
`@pre_exercise_code`
```{r}
```
`@sample_code`
```{r}
# Create variable 'tomorrow'
tomorrow <- c("cat attacks baby","I eat pasta", "world ends")
# Set probabilities to events
probabilities <- c(0.2, 0.5, 0.3)
# Draw a sample. Do it a couple of times
sample(tomorrow, size = 1, prob = probabilities)
# Draw a sample 1000 times (do not change or insert new code before this)
# Visualize the distribution
barplot(table(my_sample))
```
`@solution`
```{r}
# Create variable 'tomorrow'
tomorrow <- c("cat attacks baby","I eat pasta", "world ends")
# Set probabilities to events
probabilities <- c(0.2, 0.5, 0.3)
# Draw a sample. Do it a couple of times
sample(tomorrow, size = 1, prob = probabilities)
# Draw a sample 1000 times (do not change or insert new code before this)
my_sample <- sample(tomorrow, size = 1000, prob = probabilities, replace = TRUE)
# Visualize the distribution
barplot(table(my_sample))
```
`@sct`
```{r}
# submission correctness tests
test_object("my_sample", incorrect_msg = "Did you save the sample to `my_sample`?")
test_error()
# Final message the student will see upon completing the exercise
success_msg("Good work! And now you know the probability of world ending...")
```
---
## Binomial distribution
```yaml
type: NormalExercise
key: 8194f8b921
lang: r
xp: 100
skills: 1
```
[Binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) is a well known discrete probability distribution. The sum of favourable outcomes in a number of (independent) yes/no trials, where each trial has identical probability of success, follows a binomial distribution. The distribution hence has two parameters: the number of trials (*n*) and success probability in each trial (*p*).
An example of an binomial experiment could be tossing a coin 20 times to see how many heads occur. Some of the basic functions associated with binomial distribution are:
- `dbinom()`: point probabilities of binomial distribution, $P(X = x)$
- `pbinom()`: cumulative probabilities of binomial distribution, $P(X \leq x)$
- `qbinom()`: quantile function of binomial distribution
- `rbinom()`: (random) sample from binomial distribution
You can use the `help()` or `?` for any of these functions to see more about them.
`@instructions`
- Let's say that we know a probability of a customer buing a t-shirt from the store is 0.3. There are 8 people in the store looking around. The customers are not interacting with each other.
- What is the probability of two of them buing a t-shirt?
- What is the probability of seven of them buing a t-shirt?
- What is the probability that *at least* two of the customers buy a t-shirt? Note that you will need to use the `lower.tail` argument. The argument takes logical values: if it's `TRUE` (default), the probabilities are $P(X \le x)$, otherwise, $P(X > x)$.
`@hint`
- To get the probability of seven people buying a t-shirt, you just need to change the 2 to 7 in the `dbinom()` function on the line two.
- To get the probability of at least two people buying a t-shirt, you need to use the `pbinom()` function. Remember to use the `lower.tail` argument.
- Note that $P(X => 2)$ <=> $P(X > 1)$
`@pre_exercise_code`
```{r}
# pre exercise code here
```
`@sample_code`
```{r}
# What is the probability of 2 of them buying a t-shirt? P(X = 2)
dbinom(2, size = 8, prob = 0.3)
# What is the probability of 7 of them buying a t-shirt? P(X = 7)
# What is the probability that at least 2 of the customers buy a t-shirt? P(X => 2)
```
`@solution`
```{r}
# What is the probability of 2 of them buing a t-shirt? P(X = 2)
dbinom(2, size = 8, prob = 0.3)
# What is the probability of 7 of them buing a t-shirt? P(X = 7)
dbinom(7, size = 8, prob = 0.3)
# What is the probability that at least 2 of the customers buy a t-shirt? P(X => 2)
pbinom(1, size = 8, prob = 0.3, lower.tail = FALSE)
```
`@sct`
```{r}
# submission correctness tests
test_output_contains("dbinom(7, size = 8, prob = 0.3)", incorrect_msg = "Did you calculate the probability of 7 of the customers buying a t-shirt?")
test_output_contains("pbinom(1, size = 8, prob = 0.3, lower.tail=FALSE)", incorrect_msg = "Did you calculate the probability of at least 2 of the customers buying a t-shirt?")
# test if the students code produces an error
test_error()
# Final message the student will see upon completing the exercise
success_msg("Wonderful! Keep going!")
```
---
## Normal distribution (1)
```yaml
type: NormalExercise
key: 6780cce146
lang: r
xp: 100
skills: 1
```
[Normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) is continuous distribution and it is often used in statistics. It has two parameters (expected value and standard deviation) which determine the shape and place of the distribution. Many human attributes can be said to be normally distributed. For example, if you take a village full of people, their height will probably create a distribution that resembles a normal distribution.
In this exercise, we will focus on the `rnorm()` function. It draws a random sample from normal distribution (with specified parameters).
`@instructions`
- Execute the line for `normal_sample`. Note the mean and standard deviation values.
- Draw a histogram of `normal_sample`. Do you notice the shape of the distribution?
- Calculate the mean and standard deviation of `normal_sample`.
- Change the mean value to `35` in the `rnorm()` function and run all the codes again. What happened to the normal distribution?
- Change the sd value to `0.1` in the `rnorm()` function and run all the codes again. What happened to the normal distribution?
`@hint`
- The `hist()` function has a mandatory argument `breaks` that needs to be specified.
- You can calculate mean with the function `mean()`.
- You can calculate standard deviation with the function `sd()`.
`@pre_exercise_code`
```{r}
# pre exercise code here
```
`@sample_code`
```{r}
# Create a sample from normal distribution. In this case, the first argument (500) means the sample size.
normal_sample <- rnorm(500, mean = 5, sd = 2.5)
# Histogram of normal_sample
hist(normal_sample, breaks = 50)
# Calculate the mean and standard deviation of normal_sample
```
`@solution`
```{r}
# Create a sample from normal distribution. In this case, the first argument (500) means the sample size.
normal_sample <- rnorm(500, mean = 35, sd = 0.1)
# Histogram of normal_sample
hist(normal_sample, breaks = 50)
# Calculate the mean and standard deviation of normal_sample
mean(normal_sample)
sd(normal_sample)
```
`@sct`
```{r}
# submission correctness tests
test_function("mean", args=c("x"), incorrect_msg = "Did you calculate the mean of `normal_sample`?")
test_function("sd", args=c("x"), incorrect_msg = "Did you calculate the standard deviation of `normal_sample`?")
test_error()
# Final message the student will see upon completing the exercise
success_msg("Good work!")
```
---
## Normal distribution (2)
```yaml
type: NormalExercise
key: 271171917b
lang: r
xp: 100
skills: 1
```
It is good to know the formulas behind every calculations when you are dealing with probabilities. But in reality you rarely do the calculations by hand. And why would you, when you have R to calculate them for you!
In R the functions associated with normal distiribution are:
- `dnorm()`: density function of normal distribution
- `pnorm()`: cumulative probabilities of normal distribution
- `qnorm()`: quantile function of normal distribution
- `rnorm()`: (random) sample from normal distribution
You already used the function `rnorm()` in the previous exercise.
The function `pnorm()` can be used to compute probabilities from normal distributions. Let's see how it works!
`@instructions`
- The distribution of variable deep resembles a normal distribution (look at the density histogram of `deep`).
- Create object `deep`.
- Use the normal distribution and ppproximate the probability that deep learning score is *smaller* than 3 amongst the students.
- Use the normal distribution and approximate the probability that deep learning score is *greater* than 3 amongst the students. You can use the same code as above, but you need to change the `lower.tail` value to get the correct result.
- What is the probability that deep learning is greater than 4.5?
`@hint`
- Look at the help pages with `?pnorm` or `help(pnorm)`.
- Use the mean and standard deviation of variable `deep` as arguments for the `pnorm()` function.
- Remember to use the `lower.tail` argument in the `pnorm()` function. It determines which "side" you are calculating. The `lower.tail` argument can be either `TRUE` or `FALSE`.
`@pre_exercise_code`
```{r}
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
hist(learning2014$deep, freq = F, col='orange', ylim=c(0,.8))
x<-seq(0, 5, length = 50)
lines(x, dnorm(x, mean(learning2014$deep), sd(learning2014$deep)), type="l", col="orangered", lwd=2)
```
`@sample_code`
```{r}
# learning2014 is available
# Create object deep
deep <- learning2014$deep
# Approximate probability of deep being smaller than 3
pnorm(3, mean = mean(deep), sd = sd(deep), lower.tail = TRUE)
# Approximate probability of deep being greater than 3
# Approximate probability of deep being greater than 4.5
```
`@solution`
```{r}
# learning2014 is available
# Create object deep
deep <- learning2014$deep
# Approximate probability of deep being smaller than 3
pnorm(3, mean = mean(deep), sd = sd(deep), lower.tail = TRUE)
# Approximate probability of deep being greater than 3
pnorm(3, mean = mean(deep), sd = sd(deep), lower.tail = FALSE)
# Approximate probability of deep being greatet than 4.5
pnorm(4.5, mean = mean(deep), sd = sd(deep), lower.tail = FALSE)
```
`@sct`
```{r}
# submission correctness tests
test_output_contains("pnorm(3, mean = mean(deep), sd = sd(deep), lower.tail = FALSE)")
test_output_contains("pnorm(4.5, mean = mean(deep), sd = sd(deep), lower.tail = FALSE)")
# test if the students code produces an error
test_error()
# Final message the student will see upon completing the exercise
success_msg("Good work!")
```
---
## Uniform distribution
```yaml
type: MultipleChoiceExercise
key: 6c27606d58
lang: r
xp: 50
skills: 1
```
The uniform distribution can be a [discrete uniform distribution](https://en.wikipedia.org/wiki/Discrete_uniform_distribution) or a
[continuous uniform distribution](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)). The uniform distribution is a distribution that has constant probability for each outcome.
An example of event that follows a (discrete) uniform distribution is dice throwing (assuming the dice is not weighted). There are six possible values and the probability to every outcome is 1/6.
At the right you can see four plots. Can you guess which ones are the density plots of continuous and discrete uniform distributions?
`@possible_answers`
- Plot A is discrete uniform distribution and plot B is the density of continuous uniform distribution.
- There is no discrete uniform distribution, but plot C is the continuous uniform distribution.
- Plot D is discrete uniform distribution and plot A is continuous uniform distribution.
- Plot D is discrete uniform distribution and plot C is continuous uniform distribution.
`@hint`
- In uniform distribution, all the values of x get the same probability.
- Plot B is actually not a density, but a plot of cumulative distribution function.
- There are "gaps"" in the density if the distribution is discrete.
`@pre_exercise_code`
```{r}
# pre exercise code here
x <- seq(-5, 20, length = 50)
y<-'density'
color<-"orange"
par(lwd=2, mfrow=c(2,2), cex.lab=1.5, cex.main=1.2)
plot(x, dunif(x, 0, 15), type='s', ylim=c(0,.15), col=color, ylab=y, main='Plot A')
plot(x, punif(x, 0, 15), type='l', col=color, ylab=y, main='Plot B')
plot(x, dnorm(x, 5, 2), type='l',col=color, ylab=y, main='Plot C')
plot(x, dunif(x, 0, 15), type='h', ylim=c(0,.15), col=color, ylab=y, main='Plot D')
```
`@sct`
```{r}
msg1 = "Plot A is not discrete, sorry. Plot B is continuous, but it's a cumulative distribution function of uniform distribution."
msg2 = "The discrete one does exist, sorry. Also, plot C is normal distribution."
msg3 = "You are totally right! Proceed to the next exercise."
msg4 = "Sorry, plot C is normal distribution. But you got plot D right!"
test_mc(correct = 3, feedback_msgs = c(msg1,msg2,msg3,msg4))
# Final message the student will see upon completing the exercise
success_msg("Nicely done!")
```
---
## Quantiles
```yaml
type: NormalExercise
key: 50d3a1a7a6
lang: r
xp: 100
skills: 1
```
In this exercise, you will get to know the `qnorm()` function, which produces quantiles of the normal distribution. A quantile function is the inverse of the cumulative probability function.
A quantile function takes a probability value as input and produces a value as output. The produced value is a point at which the probability that the random variable takes on a value less than or equal to that point, is equal to the given probability.
As with all the *norm() functions, qnorm() has default arguments mean = 0 and sd = 1; corresponding to the standard normal distribution.
The quantile values of the standard normal variable are also called critical values. They will be very useful later on for interval estimation and hypothesis testing. The questions below relate to the standard normal distribution.
`@instructions`
- Compute the cumulative probability at point z = -1.64, rounded to 2 digits
- Compute the quantile point corresponding to the probability 0.05, rounded to 2 digits. Visualize the quantile point.
- Compute the cumulative probability at point z = -1.96, rounded to 3 digits.
- Compute the quantile point corresponding to the probabiliity 0.025, rounded to 2 digits. Visualize the quantile point.
`@hint`
- Follow the example codes
`@pre_exercise_code`
```{r}
# pre exercise code here
qnorm_plot <- function(alpha, twoway = F) {
par(mar = c(7,4,5,4))
a <- alpha
x <- (-50:50)/10
y <- dnorm(x)
q1 <- qnorm(a); q2 <- qnorm(1-a)
# draw the plot
if(twoway) alph <- "alpha/2 =" else alph <- "alpha ="
main <- paste("Critical values and regions of the N(0,1) distribution \n",
alph, a)
plot(x, y, type ="l", main = main, xlab = "", yaxt = "n", ylab = "", xaxt = "n")
axis(1, at = c(-3, -1, 0, 1, 3))
# mark the critical values with ticks
if(twoway) at <- c(q1, q2) else at <- q1
axis(1, at = round(at,2) , col.ticks = "red", las = 2)
# show the critical value with the call to qnorm()
mtext(paste0("- z = qnorm(",a,") = ",round(q1,2)),
side=1, line = 4, cex = 1.5)
# highlight critical regions, add matching percentages
x1 <- x[x<=q1]; x2 <- x[x>=q2]
if(twoway) {
polygon(c(min(x1),x1, max(x1), min(x2), x2, max(x2)),
c(0, dnorm(x1),0, 0, dnorm(x2), 0), col = "grey60")
text(x = c(-3.5, 3.5), y = c(0.08,0.08), labels = paste0(a*100,"%"), cex = 1.5)
text(x = 0, y = 0.08, labels=paste0(100*(1-alpha/2),"%"), cex = 1.5)
} else {
polygon(c(min(x1),x1, max(x1)),
c(0, dnorm(x1), 0), col = "grey60")
text(x = -3.5, y = 0.08, labels = paste0(a*100,"%"), cex = 1.5)
text(x = 0, y = 0.08, labels=paste0(100*(1-alpha),"%"), cex = 1.5)
}
}
```
`@sample_code`
```{r}
# qnorm_plot() is available. Z denotes a standard normal random variable.
# P(Z < - 1.64)
round(pnorm(-1.64), digits = 2)
# For which z: P(Z < z) = 0.05
round(qnorm(0.05), digits = 2)
# Visualize
qnorm_plot(0.05)
# P(Z < -1.96)
# For which z: P(Z < z) = 0.025
# Visualize
qnorm_plot(0.025)
```
`@solution`
```{r}
# qnorm_plot() is available. Z denotes a standard normal random variable.
# P(Z < - 1.64)
round(pnorm(-1.64), digits = 2)
# For which z: P(Z < z) = 0.05
round(qnorm(0.05), digits = 2)
# Visualize
qnorm_plot(0.05)
# P(Z < -1.96)
round(pnorm(-1.96), digits = 3)
# For which z: P(Z < z) = 0.025
round(qnorm(0.025), digits = 2)
# Visualize
qnorm_plot(0.025)
```
`@sct`
```{r}
test_output_contains("round(pnorm(-1.96), digits = 3)", incorrect_msg = "Please compute the asked probability using `pnorm()` and remember rounding!")
test_output_contains("round(qnorm(0.025), digits = 2)", incorrect_msg = "Please compute the asked critical value using `qnorm()`and remember rounding!")
test_error()
success_msg("Good work!")
```
---
## Two-way quantiles
```yaml
type: NormalExercise
key: fb1ef1bc2b
lang: r
xp: 100
skills: 1
```
In the previous exercise you learned how to use the `qnorm()` function to compute critical values related to a given *tail probability*.
Sometimes it is usefull to define a value which leaves a desired probability to **both tails** of the distribution (left and rightmost areas).
The normal distribution is symmetrical, which means that the values $-z$ and $z$ leave the same amount of probability to the left and right tails. This also means that $P(|Z| > z) = P(Z < -z) + P(Z > z) = 2 \cdot P(Z < -z)$.
`@instructions`
- Compute the probability $P(|Z| > 1.96) = P(Z < - 1.96) + P(Z > 1.96)$, rounded to 2 digits
- Compute the quantile (critical) value $z$ corresponding to a probability $\alpha = 0.05$ divided to both "tails" of the distribution, rounded to 2 digits
- Visualize the computation
- Compute the probability $P(|Z| > 2.58) = P(Z < - 2.58) + P(Z > 2.58)$, rounded to 2 digits
- Compute the quantile (critical) value z corresponding to probability $\alpha$ = 0.01 divided to both "tails" of the distribution, rounded to 2 digits
- Visualize the computation
`@hint`
- Follow the examples and use the same functions `pnorm()`, `qnorm()` and `qnorm_plot()`
`@pre_exercise_code`
```{r}
# pre exercise code here
qnorm_plot <- function(alpha, twoway = F) {
par(mar = c(7,4,5,4))
a <- alpha
x <- (-50:50)/10
y <- dnorm(x)
q1 <- qnorm(a); q2 <- qnorm(1-a)
# draw the plot
if(twoway) alph <- "alpha/2 =" else alph <- "alpha ="
main <- paste("Critical values and regions of the N(0,1) distribution \n",
alph, a)
plot(x, y, type ="l", main = main, xlab = "", yaxt = "n", ylab = "", xaxt = "n")
axis(1, at = c(-3, -1, 0, 1, 3))
# mark the critical values with ticks
if(twoway) at <- c(q1, q2) else at <- q1
axis(1, at = round(at,2) , col.ticks = "red", las = 2)
# show the critical value with the call to qnorm()
mtext(paste0("- z = qnorm(",a,") = ",round(q1,2)),
side=1, line = 4, cex = 1.5)
# highlight critical regions, add matching percentages
x1 <- x[x<=q1]; x2 <- x[x>=q2]
# twoway = TRUE
if(twoway) {
polygon(c(min(x1),x1, max(x1), min(x2), x2, max(x2)),
c(0, dnorm(x1),0, 0, dnorm(x2), 0), col = "grey60")
text(x = c(-3.5, 3.5), y = c(0.08,0.08), labels = paste0(a*100,"%"), cex = 1.5)
text(x = 0, y = 0.08, labels=paste0(100*(1-a*2),"%"), cex = 1.5)
}
# twoway = FALSE
else {
polygon(c(min(x1),x1, max(x1)),
c(0, dnorm(x1), 0), col = "grey60")
text(x = -3.5, y = 0.08, labels = paste0(a*100,"%"), cex = 1.5)
text(x = 0, y = 0.08, labels=paste0(100*(1-alpha),"%"), cex = 1.5)
}
}
```
`@sample_code`
```{r}
# qnorm_plot() is available. Z denotes a standard normal random variable.
# P(|Z| > 1.96)
round(2*pnorm(-1.96), digits = 2)
# For which z: P(|Z| > z) = 0.05
round(qnorm(0.05/2), digits = 2)
# Visualize
qnorm_plot(0.05/2, twoway = TRUE)
# P(|Z| > 2.58)
# For which z: P(|Z| > z) = 0.01
# Visualize
qnorm_plot(0.01/2, twoway = TRUE)
```
`@solution`
```{r}
# qnorm_plot() is available. Z denotes a standard normal random variable.
# P(|Z| > 1.96)
round(2*pnorm(-1.96), digits = 2)
# For which z: P(|Z| > z) = 0.05
round(qnorm(0.05/2), digits = 2)
# Visualize
qnorm_plot(0.05/2, twoway = TRUE)
# P(|Z| > 2.58)
round(2*pnorm(-2.58), digits = 2)
# For which z: P(|Z| > z) = 0.01
round(qnorm(0.01/2), digits = 2)
# Visualize
qnorm_plot(0.01/2, twoway = TRUE)
```
`@sct`
```{r}
test_output_contains("round(2*pnorm(-2.58), digits = 2)", incorrect_msg = "Please use `pnorm()` to compute the asked probability.")
test_output_contains("round(qnorm(0.01/2), digits = 2)", incorrect_msg = "Please use `qnorm()` to produce the asked critical value.")
test_error()
success_msg("Great work! That was a tough one.")
```
---
## Standardization
```yaml
type: NormalExercise
key: 8ca0f12d3d
lang: r
xp: 100
skills: 1
```
A standardized distribution has 0 as expected value and 1 as standard deviation. A handy thing to know is that in normal distributions about 68% of the observations are within one standard deviation away from the expected value and 95% within two. This can be convenient sometimes.
[Standardization](https://en.wikipedia.org/wiki/Standard_score) also makes it easier to compare different distributions. Any variable can be standardized. The formula for standardization is
$$Z = \frac{X - mean}{sd}$$
where $X$ is the object we want to standardize and $mean$ and $sd$ are the mean and standard deviation of $X$. If $X$ is a normally distributed random variable, the standardization produces a standard normal random variable: $Z \sim N(0, 1)$
`@instructions`
- Look at the histogram of `deep` from the `learning2014`. Note the x-axis of the histogram: the numbers are deep learning scores and you can see how they are distributed in the data.
- Standardize the variable `deep` by using the formulate given above and create object `std_deep`
- Compute the mean and standard deviation of `std_deep`
- Draw a histogram of `std_deep`. Note what happened to the x-axis.
`@hint`
- Remember to use `$`-mark when accessing variables from dataset.
- See `?hist()` if you don't remember how to draw histograms.
`@pre_exercise_code`
```{r}
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/learning2014.txt", sep = "\t", header = TRUE)
deep <- learning2014$deep
hist(deep, breaks=25, col="orange")
```
`@sample_code`
```{r}
# learning2014 is available
deep <- learning2014$deep
# Standardize variable 'deep'
std_deep <-
# Compute the mean and stardard deviation of std_deep
# Draw a histogram of standardized variable 'deep'
hist(std_deep, breaks=25, col="orange")
```
`@solution`
```{r}
# learning2014 is available
deep <- learning2014$deep
# Standardize variable 'deep'
std_deep <- (deep - mean(deep)) /sd(deep)
# Compute the mean and stardard deviation of std_deep
mean(std_deep)
sd(std_deep)
# Draw a histogram of standardized variable 'deep'
hist(std_deep, breaks=25, col="orange")
```
`@sct`
```{r}
# submission correctness tests
test_object("std_deep", incorrect_msg ="Did you standardize the deep variable and create object `std_deep`?")
test_output_contains("mean(std_deep)", incorrect_msg = "Did you compute the mean of `std_deep`?")
test_output_contains("sd(std_deep)", incorrect_msg = "Did you compute the standard deviation of `std_deep`?")
# test if the students code produces an error
test_error()
# Final message the student will see upon completing the exercise
success_msg("This was not an easy one. Well done!")
```