forked from PeterKDunn/SRM-Textbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path13-NumericalQuant.Rmd
executable file
·2878 lines (2299 loc) · 96.8 KB
/
13-NumericalQuant.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Numerical summaries: quantitative data {#NumericalQuant}
<!-- Introductions; easier to separate by format -->
```{r, child = if (knitr::is_html_output()) {'./introductions/13-NumericalQuant-HTML.Rmd'} else {'./introductions/13-NumericalQuant-LaTeX.Rmd'}}
```
## Introduction {#Chap13-Intro}
In the last chapter (Sect.\ \@ref(NHANESGraphs)), this RQ was posed:
> Among Americans, is the average direct HDL cholesterol concentration different for current smokers and non-smokers?
Graphs were used to understand the data in Sect.\ \@ref(NHANESGraphs).
Some features of the data displayed in graphs can be described *numerically*.
The purpose of this chapter is to learn how to numerically summarise *quantitative* data.
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/hamburger-2253349_640.jpg" height=140px width="200px"/>
</div>
::: {.example #DescribeQuantDataHDL name="Describing quantitative data"}
For the RQ above, the response variable (direct HDL cholesterol concentration) can be displayed using a histogram (Fig.\ \@ref(fig:NHANESdirectHDLHisto)).
What does the histogram tell us?
* *Average*: The average value is about 1.5 mmol/L.
* *Variation*: Values range from about 0.5 to 3 mmol/L, with some larger values.
* *Shape*: The [distribution](#GraphsOneQuant) is slightly skewed right.
* *Outliers*: Some large outliers are present (hard to see on the histogram).
Describing these features more precisely, with *numbers*, can be helpful.
:::
```{r NHANESdirectHDLHisto, fig.cap="The histogram of the direct HDL cholesterol concentration from the NHANES study. Large outliers exist but are hard to see.", fig.align="center", fig.width=5, fig.height=3.5}
hist(NHANES$DirectChol,
xlab = "Direct HDL cholesterol conc. (mmol/L)",
ylab = "Number of people",
las = 1,
ylim = c(0, 2500),
xlim = c(0, 4.5),
main = "Distribution of direct HDL cholesterol\nconc. in the NHANES data",
breaks = seq(0, 4.5, by = 0.25),
col = plot.colour)
box()
```
A number that describes a feature of a *population* is called a *parameter*.
The values of parameters are usually unknown.
In contrast, a number that describes a feature of a *sample* is called a *statistic* (App.\ \@ref(StatisticsAndParameters)).
That is:
* **P**opulations are numerically described by [**p**arameters](#StatisticsAndParameters), and values of parameters are usually unknown.
* **S**amples are numerically described by [**s**tatistics](#StatisticsAndParameters).
::: {.definition #Parameter name="Parameter"}
A [*parameter*](#StatisticsAndParameters) is a number, usually unknown, describing some feature of a population.
:::
::: {.definition #Statistic name="Statistic"}
A [*statistic*](#StatisticsAndParameters) is a number describing some feature of a sample (to estimate an unknown population *parameter*).
:::
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The RQ identifies the population, but in practice only one of the many possible samples is studied.
*Statistics* are estimates of *parameters*, and the value of the *statistic* is not the same for every possible *sample*.
:::
## Average values {#ComputeAverage}
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Pics/iconmonstr-weather-109-240.png" height=30px width="50px"/>
</div>
The average (or *location*, or *central value*) for *quantitative sample data* can be described in many ways.
The most common are:
* the *sample mean* (or *sample arithmetic mean*), which estimates the population mean (Sect.\ \@ref(Mean)); and
* the *sample median*, which estimates the population median (Sect.\ \@ref(Median)).
In both cases, the population [parameter](#StatisticsAndParameters) is *estimated* by a sample [statistic](#StatisticsAndParameters).
Understanding [whether to use the mean or median is important](#CompareMeanMedian).
::: {.tipBox .tip data-latex="{iconmonstr-info-6-240.png}"}
The word 'average' can refer to either means, medians or other measures of centre.
Use the precise term 'mean' or 'median', rather than 'average', when possible!
:::
```{r}
data(MaryRiver)
Mn <- aggregate(MaryRiver$Mean,
by = list(MaryRiver$Month),
FUN = "mean",
na.rm = TRUE,
simplify = TRUE)
Mdn <- aggregate(MaryRiver$Mean,
by = list(MaryRiver$Month),
FUN = "median",
na.rm = TRUE,
simplify = TRUE)
Number <- aggregate(MaryRiver$Mean,
by = list(MaryRiver$Month),
FUN = function(x){length(x) - sum(is.na(x))},
simplify = TRUE)
MRsummary <- data.frame(Month = month.abb,
"Mean" = Mn[, 2],
"Median" = Mdn[, 2],
"Number of days" = Number[, 2])
```
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/river-4221472_640.jpg" height=140px width="200px"/>
</div>
::: {.example #Averages name="Averages"}
Consider the *daily* river flow volume ('streamflow') at the Mary River from 01 October 1959 to 17 January 2019.
The 'average' daily streamflow in February could be described using either the mean or the median:
* the *mean* daily streamflow is 1123.2ML.
* the *median* daily streamflow is 146.1ML.
These both summarise the same data (the 'average' daily streamflow in February), yet give very different answers.
This implies they measure the 'average' in different ways, and have different meanings.
Which is the best 'average' to use?
To decide, both measures of average will need to be studied.
:::
### Average: the mean {#Mean}
The mean of the population is denoted by $\mu$, and its value is almost always unknown.
Instead, the mean of the population is *estimated* by the mean of the sample, denoted by $\bar{x}$.
In this context, the unknown [*parameter*](#StatisticsAndParameters) is $\mu$, and the [*statistic*](#StatisticsAndParameters) is $\bar{x}$.
The sample mean *estimates* the population mean, and every one of the possible samples is likely to have a different sample mean.
:::: {.pronounceBox .pronounce data-latex="{iconmonstr-microphone-7-240.png}"}
::: {style="display: flex;"}
The Greek letter $\mu$ is pronounced 'myoo', as in a**mu**sing.
$\bar{x}$ is pronounced 'ex-bar'.
:::
::: {}
```{r}
htmltools::tags$video(src ="./Movies/mu.mp4",
width = "121",
loop = "FALSE",
controls = "controls",
loop = "loop",
style = "padding:5px; border: 2px solid gray;")
```
:::
::::
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/jersey-cow-4269262_640.jpg" width="200px"/>
</div>
::: {.example #JerseyCows name="A small dataset to work with"}
To demonstrate ideas, consider a small dataset for answering this descriptive RQ: 'For mature Jersey cows, what is the average percentage butterfat in their milk?'
The population mean percentage butterfat (denoted $\mu$) from those cows is to be estimated.
Clearly, milk from every Jersey cow cannot be studied; a *sample* is studied [@data:hand:handbook; @data:sokal:biometry].
The unknown population mean is estimated using the sample mean ($\bar{x}$).
Measurements were taken from milk from 10 cows (Table\ \@ref(tab:BFatData)).
:::
```{r BFatData}
BFat <- c(4.8, 6.5, 5.2, 4.5, 5.2, 5.7, 5.4, 4.8, 5.2, 5.2)
BFat <- array( BFat,
dim = c(1, 10))
if( knitr::is_latex_output() ) {
kable(BFat,
format = "latex",
longtable = FALSE,
booktabs = TRUE,
caption = "The butterfat percentage from a sample of milk from 10 Jersey cows") %>%
add_header_above(header = c("Butterfat percentages" = 10),
bold = TRUE,
align = "c") %>%
kable_styling(font_size = 10)
}
if( knitr::is_html_output() ) {
out <- kable(BFat,
format = "html",
longtable = FALSE,
col.names = rep(" ", 10),
booktabs = TRUE,
caption = "The butterfat percentage from a sample of milk from 10 Jersey cows") %>%
add_header_above(header = c("Butterfat percentages" = 10),
bold = TRUE,
align = "c")
out
}
```
The sample mean is the 'balance point' of the
`r if (knitr::is_latex_output()) {
'observations (Figure\\ \\@ref(fig:MeansFigLATEX), left panel; the online version has an animation.)'
} else {
'observations. The animation below shows how the mean acts as the balance point.'
}`
Alternatively, the mean is the value such that the positive and negative distances of the observations from the mean add to
`r if (knitr::is_latex_output()) {
'zero (Fig.\\ \\@ref(fig:MeansFigLATEX), right panel; again, the online version has an animation.)'
} else {
'zero, as in the animation below.'
}`
Both of these explanations seem reasonable for identifying an 'average' of the data.
```{r animation.hook="gifski", interval=0.25, progress=TRUE, dev=if (is_latex_output()){"pdf"}else{"png"}}
source("R/showBalanceMean.R")
if (knitr::is_html_output()){
FL <- c( seq(5.0, 5.5, by = 0.025),
seq(5.5, 5.2, by =-0.025),
seq(5.2, 5.3, by = 0.025),
rep(5.25, 25) )
for (i in 1:length(FL)){
BalanceMean( FL[i],
numberImages = length(FL),
iteration = i )
}
}
```
```{r animation.hook="gifski", interval=0.20, dev=if (is_latex_output()){"pdf"}else{"png"}}
source("R/showFindMean.R")
if (knitr::is_html_output()){
for (i in 1:length(meanCandidates)){
FindMean( meanCandidates[i],
numberImages = length(meanCandidates), i)
}
}
```
```{r MeansFigLATEX, fig.align="center", fig.width=4.5, fig.height=3.25,fig.cap="Understandimg the mean. Left: The mean is the balance point of the data; right: the mean is the value such that the positive and negative distances sum to zero", out.width = c('50%', '5%', '40%'), fig.show='hold' }
if (knitr::is_latex_output()){
FL <- c( seq(5, 5.5, by = 0.025),
seq(5.5, 5.2, by = -0.025),
seq(5.2, 5.3, by = 0.025),
rep(5.25, 10) )
BalanceMean( FL[49],
numberImages = length(FL),
iteration = length(FL) )
}
knitr::include_graphics("OtherImages/SPACER.png")
if (knitr::is_latex_output()){
FindMean( meanCandidates[43],
numberImages = length(meanCandidates), 43 )
}
```
::: {.definition #Mean name="Mean"}
The *mean* is one way to measure the 'average' value of quantitative data.
The *arithmetic mean* can be considered as the 'balance point' of the data, or the value such that the positive and negative distances from the mean add to zero.
:::
To find the *value* of the sample mean:
* *Add* (denoted by $\sum$) all the observations (denoted by $x$); then
* *Divide* by the number of observations (denoted by $n$).
In symbols:
\[
\bar{x} = \frac{\sum x}{n}.
\]
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/jersey-cow-4269262_640.jpg" width="200px"/>
</div>
::: {.example #ComputeMean name="Computing a sample mean"}
For the Jersey cow data (Example\ \@ref(exm:JerseyCows)), an *estimate* of the population mean (i.e., the sample mean) percentage butterfat is found by summing all $n = 10$ observations and dividing by $n$:
\begin{align*}
\overline{x}
&= \frac{\sum x}{n} = \frac{4.8 + 6.5 + \cdots + 5.2}{10}\\
&= \frac{52.5}{10} = 5.25.
\end{align*}
The sample mean, the best estimate of the population mean, is 5.25 percent.
:::
::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
For the butterfat data (Table\ \@ref(tab:BFatData)), what is the value of $\mu$, the *population* mean?\label{thinkBox:ValueOfMu}
`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
**We do not know!**
We know the value of the *sample* mean, but not the *population* mean.
We have an *estimate* of the value of the population mean by using the sample mean.
(If we already knew the value of the population mean, why would we *estimate* the value from an imperfect sample?)
`r webexercises::unhide()`
`r if (knitr::is_latex_output()) '-->'`
:::
::: {.tipBox .tip data-latex="{iconmonstr-info-6-240.png}"}
Software (such as jamovi or SPSS) or a calculator (in *Statistics Mode*) is usually used to compute the sample mean.
However, knowing *how* these quantities are computed is important.
Software and calculators often produce numerical answers to many decimal places, some of which may not be meaningful or useful.
A simple but useful rule-of-thumb is to round to one or two more significant figures than the original data.
For example, the butterfat data are given to one decimal place.
The *sample mean* weight can be given to two decimal places: $\bar{x} = 5.25$%.
:::
::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
A study of bats [@griffin1960echolocation] recorded the distance at which *Drosophila* were detected for $n = 11$ detections (Table\ \@ref(tab:BatData)).
Estimate the population mean distance at which bats detect the flies.\label{thinkBox:EstimateMu}
`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
The **estimate** of $\mu$ is $\bar{x} = 532/11 = 48.4$cm.
`r if (knitr::is_latex_output()) '-->'`
`r webexercises::unhide()`
:::
```{r BatData}
BatData <- c(62, 52, 68, 23, 34, 45, 27, 42, 83, 56, 40)
BatData <- array(BatData,
dim = c(1, 11))
if( knitr::is_latex_output() ) {
kable(BatData,
format = "latex",
longtable = FALSE,
booktabs = TRUE,
caption = "The distance at which small fruit flies were detected by bats, in cm") %>%
add_header_above(header = c("Detection distance (in cm)" = 11),
bold = TRUE,
align = "c") %>%
kable_styling(font_size = 10)
}
if( knitr::is_html_output() ) {
BatData <- c(62, 52, 68, 23, 34, 45, 27, 42, 83, 56, 40, NA)
BatData <- array(BatData,
dim = c(2,6))
kable(BatData,
format = "html",
longtable = FALSE,
booktabs = TRUE,
caption = "The distance at which small fruit flies were detected by bats, in cm",
col.names = rep("", 6) ) %>%
add_header_above(out,
header = c("Detection distance" = 6),
bold = TRUE,
align = "c")
}
```
### Average: the median {#Median}
A median is a value separating the largest $50$% of the data from the smallest $50$% of the data.
In a dataset with $n$ values, the median is *ordered observation number* $(n + 1)\div 2$.
The median is:
* **not** equal to $(n + 1)\div 2$.
* **not** halfway between the minimum and maximum values in the data.
::: {.tipBox .tip data-latex="{iconmonstr-info-6-240.png}"}
Many calculators cannot find the median.
The median has no commonly-used symbol.
:::
::: {.definition #Median name="Median"}
The **median** is one way to measure the 'average' value of data.
A *median* is a value such that half the values are larger than the median, and half the values are smaller than the median.
:::
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/jersey-cow-4269262_640.jpg" width="200px"/>
</div>
::: {.example #SampleMedian name="Find a sample median"}
To find a sample median for the Jersey cow data (Example\ \@ref(exm:JerseyCows)), first arrange the data *in numerical order* (Table\ \@ref(tab:JerseyCowsSorted)).
The median separates the larger 5 numbers from the smaller 5 numbers.
With $n = 10$ observations, the median is the ordered observation located between the fifth and sixth observations (i.e., at position $(10 + 1)/2 = 5.5$; the *median itself is not 5.5*).
So the sample median is between $5.2$ (ordered observation 5) and $5.2$ (ordered observation 6).
Since these values are the same, the sample median is $5.20$%.
:::
```{r JerseyCowsSorted}
BFat.sort <- sort( c(4.8, 6.5, 5.2, 4.5, 5.2, 5.7, 5.4, 4.8, 5.2, 5.2) )
BFat.sort <- matrix( BFat.sort,
ncol = 10,
byrow = TRUE)
if( knitr::is_latex_output() ) {
kable(BFat.sort,
format = "latex",
longtable = FALSE,
booktabs = TRUE,
caption = "The butterfat percentage from a sample of milk from 10 Jersey cows, in increasing order") %>%
add_header_above(header = c("Butterfat percentages" = 10),
bold = TRUE,
align = "c") %>%
kable_styling(font_size = 10)
}
if( knitr::is_html_output() ) {
out <- kable(BFat.sort,
format = "html",
longtable = FALSE,
col.names = rep(" ", 10),
booktabs = TRUE,
caption = "The butterfat percentage from a sample of milk from 10 Jersey cows, in increasing order")
if ( knitr::is_html_output()) {
add_header_above(out,
header = c("Butterfat percentages" = 10),
bold = TRUE,
align = "c")
} else {
out
}
}
```
::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
For the butterfat data (Table\ \@ref(tab:JerseyCowsSorted)), what is the *population* median?\label{thinkBox:PopMedian}
`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
**We do not know!**
We know the value of the *sample* median, but not the *population* median.
We only have an *estimate* of the value of the population median.
`r webexercises::unhide()`
`r if (knitr::is_latex_output()) '-->'`
:::
::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
For the bat data (Table\ \@ref(tab:BatData)), estimate the population *median* distance at which bats detect the flies.\label{thinkBox:EstimateMedian}
`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
With $n = 11$, the median is the $(11 + 1)/2 = 6$th ordered value, which is 45cm.
`r if (knitr::is_latex_output()) '-->'`
`r webexercises::unhide()`
:::
To clarify:
* If the sample size $n$ is *odd* (such as the bats example), the median is the middle number when the observations are ordered.
* If the sample size $n$ is *even* (such as the butterfat example), the median is halfway between the two middle numbers, when the observations are ordered.
Some software uses slightly different rules when $n$ is even, producing slightly different values for the median.
### Which average to use? {#CompareMeanMedian}
```{r}
LessThanMean <- sum( MaryRiver$Mean[MaryRiver$Month == 2] < 1123.2,
na.rm = TRUE ) /
sum( !is.na(MaryRiver$Mean[MaryRiver$Month == 2]) )
```
Consider the daily streamflow at the Mary River (Bellbird Creek) during February (Example\ \@ref(exm:Averages)) again:
The *mean* daily streamflow is `r round(MRsummary[2, 2], 1)`ML, and the *median* daily streamflow is `r round(MRsummary[2, 3], 1)`ML.
Which is 'best' for measuring the average streamflow?
A dot chart of the daily streamflow (Fig.\ \@ref(fig:DailyStreamflow), using jittering) shows that the data are *very* highly right-skewed, with many *very* large outliers (presumably during flood events):
the maximum value is `r formatC( round(max(MaryRiver$Mean[MaryRiver$Month == 2], na.rm = TRUE), 0), format="fg", big.mark=',')`ML, more than *one hundred times* larger than the mean of `r formatC(round(MRsummary[2, 2], 1), big.mark=',')`ML.
In fact, about `r round(LessThanMean * 100)`% of the observations are *less* than the mean.
The extreme outliers are clear in the tabular summary too (Table\ \@ref(tab:DailyStreamflowTab)).
In contrast, about 50% the values are less than the median (by definition).
For these data, the mean is hardly a *central* value...
```{r DailyStreamflow, fig.cap="A dot plot of the daily streamflow at Mary River from 1960 to 2017, for February ($n = 1650$). The vertical grey line is the mean value. Many large outliers exist, so the data near zero are all squashed together. Note: The values have been jittered in the vertical direction.", fig.align="center", out.width='90%', fig.width=9, fig.height=3}
par(mfrow = c(1, 1))
set.seed(183763286)
stripchart(MaryRiver$Mean[MaryRiver$Month == 2],
main = "Dot plot of daily streamflow at\nMary Creek (Bell River) in Feb.",
xlab = "Daily streamflow (in ML)",
sub = "(From 01 October 1959 to 17 January 2019)",
method = "jitter",
pch = 19,
axes = FALSE,
col = blueTransparent,
jitter = 0.75,
ylim = c(0.2, 1.8),
cex = 0.5)
axis(side = 1,
at = seq(0, 160000, by = 50000),
labels = c("0",
"50,000",
"100,000",
"150,000"))
box()
abline( v = Mn[2,2],
col = "grey")
```
```{r DailyStreamflowTab}
out <- hist(MaryRiver$Mean[MaryRiver$Month == 2],
breaks = seq(0, 160000,
by = 20000),
right = FALSE,
plot = FALSE)
lowerLim <- out$breaks[ -length(out$breaks)]
upperLim <- out$breaks[ -1 ]
Intervals <- paste(format(lowerLim,
big.mark = ",",
scientific = FALSE),
"to under",
format(upperLim,
big.mark = ",",
scientific = FALSE) )
MaryRiverTable <- cbind(Intervals = Intervals,
Counts = out$counts)
if( knitr::is_latex_output() ) {
T1 <- knitr::kable(MaryRiverTable[1:4, ],
format = "latex",
col.names = c("Daily streamflow (ML)",
"Number of days"),
align = c("c", "r"),
booktabs = TRUE) %>%
row_spec(0, bold = TRUE)
T2 <- knitr::kable(MaryRiverTable[5:8, ],
format = "latex",
col.names = c("Daily streamflow (ML)",
"Number of days"),
align = c("c", "r"),
booktabs = TRUE) %>%
row_spec(0, bold = TRUE)
out2 <- knitr::kables(list(T1, T2),
format = "latex",
caption = "Mary River daily streamflow for February (in ML). The mean value is 1,123ML.",
label = "DailyStreamflowTab") %>%
kable_styling(font_size = 10)
prepareSideBySideTable(out2)
}
if( knitr::is_html_output() ) {
kable( MaryRiverTable,
format = "html",
col.names = c("Daily streamflow (ML)",
"Number of days"),
align = c("c", "r"),
booktabs = TRUE,
longtable = FALSE,
caption = "Mary River daily streamflow for February (in ML). The mean value is 1,123ML.") %>%
row_spec(0, bold = TRUE)
}
```
The streamflow data are *very* highly right skewed, which is important here:
* *Means* are best used for approximately symmetric data: the mean is influenced by outliers and skewness.
* *Medians* are best used for data that are highly skewed or contain outliers: the median is *not* influenced by outliers and skewness.
Means tend to be too large if the data contains large outliers or severe right skewness, and too small if the data contains small outliers or severe left skewness.
For the Mary River data, the large outliers---and the fact that they are so *extreme* and abundant---result in the mean being substantially influenced by the outliers.
This explains why the mean is much so larger than the median.
*The median is the better measure of average for these data.*
The mean is generally used if possible (for practical and mathematical reasons), and is the most commonly-used measure of location.
However, the mean is not always appropriate; the median *is not* influenced by outliers and skewness.
The mean and median are similar in approximately symmetric distributions.
Sometimes, quoting *both* the mean and the median may be appropriate.
::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
An engineering study [@data:hald:statistical] examined a new building material to determine the average permeability time.
The time (in seconds) taken for water to permeate $n = 81$ pieces of material was taken.
Using a histogram of the data (Fig.\ \@ref(fig:Permeability)), estimate the value of the population mean and median.
Which 'average' would be best to use (for example, to quote an 'average' permeability time on a specification sheet)?\label{thinkBox:SpecSheet}
`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
The data are skewed, which suggests using the median.
In practice, a larger sample would be needed anyway before giving a value to use on a specification sheet.
`r webexercises::unhide()`
`r if (knitr::is_latex_output()) '-->'`
:::
```{r Permeability, fig.cap="A histogram of the permeability of a type of building material", fig.align="center", fig.width=5, fig.height=3.5}
data(Perm)
hist(Perm$Perm,
xlab = "Permeability (in seconds)",
ylab = "Number of obs.",
las = 1,
xlim = c(0, 180),
breaks = seq(0, 180,
by = 10),
axes = FALSE,
col = plot.colour,
main = "Permeability of building material")
axis(side = 1,
at = seq(0, 180,
by = 20),
las = 2)
axis(side = 2,
las = 1)
```
## Variation {#Variation}
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Pics/iconmonstr-ruler-21-240.png" width="50px"/>
</div>
For quantitative data, the amount of *variation* in the bulk of the data should be described.
Many ways exist to measure the variation in a dataset, including:
* The *range*: very simple and simplistic, so not often used (Sect.\ \@ref(VariationRange)).
* The *standard deviation*: commonly used (Sect.\ \@ref(VariationStdDev)).
* The *interquartile range (or IQR)*: commonly used (Sect.\ \@ref(VariationIQR)).
* *Percentiles*: useful i0n specific situations (Sect.\ \@ref(VariationPercentiles)).
As always, a value computed from the *sample* (the [statistic](#StatisticsAndParameters)) estimates the unknown value in the *population* (the [parameter](#StatisticsAndParameters)), and every sample can produce a different estimate.
### Variation: the range {#VariationRange}
The range is the simplest measure of variation.
::: {.definition #Range name="Range"}
The range is the maximum value *minus* the minimum value.
:::
The range is not often used.
As it is computed only using the two extreme observations, it is highly influenced by outliers.
Sometimes, the *range* is given by stating both the maximum and the minimum value in the data instead of giving the *difference* between these values.
The range is measured in the same measurement units as the data, and is usually quoted with the median.
::: {.example #RangeEG name="The range"}
For Jersey cow data (Example\ \@ref(exm:JerseyCows)), the largest value is 6.5%, and the smallest value is 4.5%; hence
\[
\text{Range} = 6.5 - 4.5 = 2.0 \text{ percent}.
\]
So the sample median percentage butterfat is 5.20%, with a *range* of 2.00%.
:::
### Variation: the standard deviation {#VariationStdDev}
The population standard deviation is denoted by $\sigma$ (the [parameter](#StatisticsAndParameters)) and is estimated by the sample standard deviation $s$ (the [statistic](#StatisticsAndParameters)).
The standard deviation is the most commonly-used measure of variation, but is tedious to compute manually.
You will almost always find the sample standard deviation $s$ using computer software (e.g., jamovi or SPSS) or a calculator (in *Statistics Mode*)).
The *standard deviation* is (approximately) the mean distance that observations are from the mean.
This seems like a reasonable way to measure the amount of variation in data.
:::: {.pronounceBox .pronounce data-latex="{iconmonstr-microphone-7-240.png}"}
::: {style="display: flex;"}
The Greek letter $\sigma$ is pronounced 'sigma'.
:::
::: {}
```{r}
htmltools::tags$video(src ="./Movies/sigma.mp4",
width = "121",
loop = "FALSE",
controls = "controls",
loop = "loop",
style = "padding:5px; border: 2px solid gray;")
```
:::
::::
::: {.definition #StandardDeviation name="Standard deviation"}
The *standard deviation* is, approximately, the average distance of the observations from the mean.
:::
Even though *you do not have to use the formula* to calculate $s$, we will demonstrate to show exactly what $s$ calculates.
The formula is:
\[
s = \sqrt{ \frac{\sum(x - \bar{x})^2}{n - 1} },
\]
where $\bar{x}$ is the sample mean, $x$ represents the individual data values, $n$ is the sample size, and the symbol '$\sum$' means to *add up* (see Sect.\ \@ref(Mean)).
Using the formula requires these steps:
* calculate the sample mean: $\overline{x}$;
* calculate the *deviations* of each observation $x$ from the mean: $x - \bar{x}$;
* square these deviations (to make them all *positive* values): $(x - \bar{x})^2$;
* add these squared deviations: $\sum(x - \bar{x})^2$;
* divide the answer by $n - 1$;
* take the (positive) square root of the answer.
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
*You do not need to use the formula!*
You *must* know how to use software or a calculator to find the standard deviation, what the standard deviation measures, and how to use it.
:::
<div style="float:right; width: 222x; border: 1px; padding:10px">
<img src="Illustrations/jersey-cow-4269262_640.jpg" width="200px"/>
</div>
::: {.example #StdDev name="Standard deviation"}
For the Jersey cow data (Example\ \@ref(exm:JerseyCows)), the squared *deviations* of each observation from the mean of $5.25$ are shown in Fig.\ \@ref(fig:ShowVar).
The sum of the squared distances is 2.7650.
Then, the sample standard deviation is:
\[
s = \sqrt{\frac{2.765}{10 - 1}}
= \sqrt{ 0.3072222}
= 0.5542763.
\]
The sample mean butterfat is 5.25%, with a sample *standard deviation* of 0.554%.
:::
```{r ShowVar, fig.cap="The standard deviation is related to the sum of the squared-distances from the mean", fig.align="center",fig.width=6, out.width='60%'}
source("R/showVar.R")
showVar()
```
::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
The standard deviation for Dataset\ A in Fig.\ \@ref(fig:TwoDatasets) is $s = 2$.
Will the standard deviation of Dataset\ B be: *smaller* than or *greater* than $2$?
Why?\label{thinkBox:CompareSD}
`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
The standard deviation is a bit like the average distance that observations are from the mean.
In Dataset\ B, more observations are closer to the mean, so the average distance would be a smaller number.
This suggests that the standard deviation for Dataset\ B will be **smaller** than the standard deviation for Dataset\ A.
`r webexercises::unhide()`
`r if (knitr::is_latex_output()) '-->'`
:::
```{r TwoDatasets, fig.height=2.0, fig.width=7.75, out.width='90%', fig.cap="Dotplots of two datasets", fig.align="center"}
### DOT CHARTS of two samples with similar mean, range but diff sd
set.seed(100010)
rescale <- function(x, from, to){
minx <- min(x)
maxx <- max(x)
slope <- (to - from) / ( maxx - minx )
intercept <- to - slope*maxx
y <- slope * x + intercept
y
}
len <- 50
tmp1 <- runif(len)
x1 <- rescale(tmp1, -4, 4)
y1.jitter <- jitter(rep(1, length(x1)))
par(mfrow = c(1, 2) )
par(mar = c(2, 0, 2, 1.5) + 0.1)
plot( y = y1.jitter,
x = x1,
pch = 1,
ylim = c( 0.95 * min(y1.jitter),
1.05 * max(y1.jitter) ),
ylab = "",
xlab = "Observations",
main = expression( bold(Dataset~A)*":"~standard~deviation~is~2),
axes = FALSE)
axis(side = 1)
tmp2 <- rt((len - 4), 5)
x2 <- c( rescale(tmp2, -1.5, 1.5), -2.5, 2.5, -4, 4)
y2.jitter <- jitter(rep(1, length(x2)))
par( mar = c (2, 1.5, 2, 0) + 0.1)
plot( y = y2.jitter,
x = x2,
pch = 1,
ylim = c( 0.95 * min(y2.jitter),
1.05 * max(y2.jitter) ),
ylab = "",
xlab = "Observations",
main = expression( bold(Dataset~B) ),
axes = FALSE)
axis(side = 1)
```
The sample standard deviation $s$ is:
* Positive (unless all observations are the same, when it is zero: *no* variation);
* Best used for (approximately) symmetric data;
* Usually quoted with the mean;
* The most commonly-used measure of variation;
* Measured in the same units as the data;
* Influenced by *skewness* and outliers, like the mean.
::: {.thinkBox .think data-latex="{iconmonstr-light-bulb-2-240.png}"}
Consider again the Jersey cow data (Example\ \@ref(exm:JerseyCows)).
Using your calculator's *Statistics Mode*, find the *population* standard deviation and the *sample* standard deviation.\label{thinkBox:MilkStats}
`r if (knitr::is_latex_output()) '<!--'`
`r webexercises::hide()`
The *population* standard deviation is unknown.
The best estimate is the *sample* standard deviation: $s = 0.554$%.
If you do not get this value, you may be pressing the wrong button on your calculator: seek help!
`r webexercises::unhide()`
`r if (knitr::is_latex_output()) '-->'`
:::
### Variation: the IQR {#VariationIQR}
The standard deviation uses the value of $\bar{x}$, so is affected by skewness like the sample mean.
A measure of variation *not* affected by skewness is the inter-quartile range, or IQR.
To understand the IQR, understanding *quartiles* is necessary.
::: {.definition #Quartiles name="Quartiles"}
*Quartiles* describe the shape of the data:
* The first quartile $Q_1$ is a value separating the smallest 25% of observations from the largest 75%.
The $Q_1$ is like the median of the *smaller* half of the data, halfway between the minimum value and the median.
* The second quartile $Q_2$ is a value separating the smallest 50% of observations from the largest 50%.
(This is the *median*.)
* The third quartile $Q_3$ is a value separating the smallest 75% of observations from the largest 25%.
The $Q_3$ is like the median of the *larger* half of the data, halfway between the median and the maximum value.
:::
Quartiles divide the data into four parts of approximately equal numbers of observations, and *a boxplot is a picture of the quartiles*.
The *inter-quartile range*, or the *IQR* is the difference between $Q_3$ and $Q_1$.
Since the IQR measures the range of the central 50% of the data, it is not influenced by outliers.
The IQR is measured in the same measurements units as the data.
::: {.definition #IQR name="IQR"}
The *IQR* is the range in which the middle 50% of the data lie: the difference between the third and the first quartiles.
:::
Quartiles were previously discussed in the context of boxplots (Sect.\ \@ref(Boxplot)).
For example, a boxplot of the egg-krill data (@data:Greenacre2016:reporting; Table\ \@ref(tab:KrillDataTable2)) was shown in Example \@ref(exm:KrillData), and repeated here (Fig.\ \@ref(fig:IQRKrill), left panel).
```{r KrillDataTable2}
Eggs.T <- c(0, 0, 1, 1, 3, 8, 8, 12, 18, 21, 26, 30, 35, 48, 50)
Eggs.C <- c(0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 8, 16, 20, 26, 31)
KrillEggs2 <- cbind( Eggs.T[1:4],
Eggs.T[5:8],
Eggs.T[9:12],
Eggs.T[13:16],
Eggs.C[1:4],
Eggs.C[5:8],
Eggs.C[9:12],
Eggs.T[13:16])
if( knitr::is_latex_output() ) {
kable( KrillEggs2,
format = "latex",
longtable = FALSE,
booktabs = TRUE,
align = c("c"),
caption = "The number of eggs laid by krill, for those in a treatment group and a control group") %>%
add_header_above(header = c("Treatment group " = 4,
" Control group" = 4),
bold = TRUE,
align = "c") %>%
kable_styling(font_size = 10)
}
if( knitr::is_html_output() ) {
out <- kable( KrillEggs2,
format = "html",
longtable = FALSE,
booktabs = TRUE,
align = c("c", "c"),
col.names = rep("", 8),
caption = "The number of eggs laid by krill, for those in a treatment group and for those in a control group")
if ( knitr::is_html_output()) {
kable_styling(out, full_width = FALSE) %>%
add_header_above(header = c("Treatment group" = 4,
"Control group" = 4),
bold = TRUE,
align = "c")
} else {
out
}
}
```
```{r}
KrillT.quantiles <- quantile(Eggs.T)
Eggs.C2 <- Eggs.C[ -which.max(Eggs.C)]
KrillC2.quantiles <- quantile(Eggs.C2)
```
```{r BoxplotKrillTreatment2HTML, fig.cap="A boxplot for the krill-egg data; the boxplot just for the treatment group", fig.width=5, fig.height=4, fig.align="center"}
if (knitr::is_html_output()) {
Eggs.quantiles <- quantile(Eggs.T)
out <- boxplot( list(Treatment = Eggs.T),
main = "A boxplot of the number of\neggs laid by krill: Treatment group",
las = 1,
pch = 19,