diff --git a/preparation/Broom_R_Package.Rmd b/preparation/Broom_R_Package.Rmd new file mode 100644 index 0000000..e007b45 --- /dev/null +++ b/preparation/Broom_R_Package.Rmd @@ -0,0 +1,341 @@ +--- +title: "Correlation" +subtitle: "Statistics With R" +author: "R Workshop" +output: + prettydoc::html_pretty: + theme: cayman + highlight: github +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +require(prettydoc) +require(tidyverse) +require(janitor) +``` + +Broom +============ + +#### Convert Statistical Analysis Objects into Tidy Data Frames + +Convert statistical analysis objects from R into tidy data frames, so that they can more easily be combined, reshaped and otherwise processed with tools like 'dplyr', 'tidyr' and 'ggplot2'. + +The package provides three S3 generics: +* **``tidy``**: summarizes a model's statistical findings such as coefficients of a regression; +* **``augment``**: adds columns to the original data such as predictions, residuals and cluster assignments +* **``glance``**: which provides a one-row summary of model-level statistics. + + + +```R +# Run once, then comment out +install.packages("broom") + +library(magrittr) + +library(broom) +``` + + Installing package into '/home/nbcommon/R' + (as 'lib' is unspecified) + + +### Fit a Linear Model +* Just use a simple example using inbuilt data sets +* Save it as `myModel` + + +```R +lm(mpg ~ wt + cyl, data=mtcars) +``` + + + + Call: + lm(formula = mpg ~ wt + cyl, data = mtcars) + + Coefficients: + (Intercept) wt cyl + 39.686 -3.191 -1.508 + + + + +```R +myModel <- lm(mpg ~ wt + cyl, data=mtcars) +summary(myModel) +``` + + + + Call: + lm(formula = mpg ~ wt + cyl, data = mtcars) + + Residuals: + Min 1Q Median 3Q Max + -4.2893 -1.5512 -0.4684 1.5743 6.1004 + + Coefficients: + Estimate Std. Error t value Pr(>|t|) + (Intercept) 39.6863 1.7150 23.141 < 2e-16 *** + wt -3.1910 0.7569 -4.216 0.000222 *** + cyl -1.5078 0.4147 -3.636 0.001064 ** + --- + Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 + + Residual standard error: 2.568 on 29 degrees of freedom + Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185 + F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12 + + + +#### 1. The `tidy()` function + +**``tidy()``** constructs a data frame that summarizes the model's statistical findings. This includes coefficients and p-values for each term in a regression, per-cluster information in clustering applications, or per-test information for multtest functions. + + + +```R +tidy(myModel) +``` + + + + + + + + + +
termestimatestd.errorstatisticp.value
(Intercept) 39.686261 1.7149840 23.140893 3.043182e-20
wt -3.190972 0.7569065 -4.215808 2.220200e-04
cyl -1.507795 0.4146883 -3.635972 1.064282e-03
+ + + + +```R +myTidyModel <- tidy(myModel) +``` + + +```R +class(myTidyModel) +``` + + +'data.frame' + + + +```R +names(myTidyModel) +``` + + +
    +
  1. 'term'
  2. +
  3. 'estimate'
  4. +
  5. 'std.error'
  6. +
  7. 'statistic'
  8. +
  9. 'p.value'
  10. +
+ + + + +```R +myTidyModel$p.value %>% round(4) +``` + + +
    +
  1. 0
  2. +
  3. 2e-04
  4. +
  5. 0.0011
  6. +
+ + + +#### 2. The `augment()` function + +**``augment()``** add columns to the original data that was modeled. +This includes predictions, residuals, and cluster assignments. + + + +```R +myModel <- lm(mpg ~ wt+cyl, mtcars) +augment(myModel) + +``` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
.rownamesmpgwtcyl.fitted.se.fit.resid.hat.sigma.cooksd.std.resid
Mazda RX4 21.0 2.620 6 22.27914 0.6011667 -1.27914467 0.05482311 2.601105 0.0050772590 -0.51244825
Mazda RX4 Wag 21.0 2.875 6 21.46545 0.4976294 -0.46544677 0.03756521 2.611423 0.0004442585 -0.18478693
Datsun 710 22.8 2.320 4 26.25203 0.7252444 -3.45202624 0.07978891 2.522911 0.0567764620 -1.40157794
Hornet 4 Drive 21.4 3.215 6 20.38052 0.4602669 1.01948376 0.03213611 2.605613 0.0018029260 0.40360828
Hornet Sportabout 18.7 3.440 8 16.64696 0.7752706 2.05304242 0.09117599 2.581072 0.0235271472 0.83877393
Valiant 18.1 3.460 6 19.59873 0.5178496 -1.49872807 0.04068001 2.596911 0.0050205614 -0.59597493
Duster 360 14.3 3.570 8 16.23213 0.7267482 -1.93213120 0.08012014 2.585079 0.0178733213 -0.78461743
Merc 240D 24.4 3.190 4 23.47588 1.0000172 0.92411952 0.15170109 2.606073 0.0091033181 0.39078743
Merc 230 22.8 3.150 4 23.60352 0.9793969 -0.80351937 0.14550945 2.607793 0.0065061176 -0.33855530
Merc 280 19.2 3.440 6 19.66255 0.5108741 -0.46254751 0.03959146 2.611439 0.0004643600 -0.18382951
Merc 280C 17.8 3.440 6 19.66255 0.5108741 -1.86254751 0.03959146 2.588159 0.0075293380 -0.74022926
Merc 450SE 16.4 4.070 8 14.63665 0.6544576 1.76335487 0.06497359 2.590136 0.0116847953 0.71025562
Merc 450SL 17.3 3.730 8 15.72158 0.6819424 1.57842434 0.07054547 2.594578 0.0102875723 0.63767089
Merc 450SLC 15.2 3.780 8 15.56203 0.6718159 -0.36202705 0.06846591 2.612000 0.0005228914 -0.14609271
Cadillac Fleetwood 10.4 5.250 8 10.87130 1.1525645 -0.47129800 0.20151356 2.611060 0.0035498738 -0.20542284
Lincoln Continental10.4 5.424 8 10.31607 1.2633704 0.08393115 0.24212252 2.612898 0.0001501537 0.03755005
Chrysler Imperial 14.7 5.345 8 10.56816 1.2125441 4.13184435 0.22303287 2.458216 0.3189363624 1.82570047
Fiat 128 32.4 2.200 4 26.63494 0.7270859 5.76505710 0.08019462 2.353101 0.1592990291 2.34122168
Honda Civic 30.4 1.615 4 28.50166 0.8820281 1.89833840 0.11801538 2.584888 0.0276449872 0.78728146
Toyota Corolla 33.9 1.835 4 27.79965 0.7988791 6.10035227 0.09681350 2.314308 0.2233281268 2.50007531
Toyota Corona 21.5 2.465 4 25.78934 0.7380797 -4.28933528 0.08263810 2.472103 0.0913548207 -1.74424120
Dodge Challenger 15.5 3.520 8 16.39168 0.7442464 -0.89167980 0.08402476 2.607023 0.0040263378 -0.36287242
AMC Javelin 15.2 3.435 8 16.66291 0.7773252 -1.46291244 0.09165987 2.596811 0.0120218543 -0.59783451
Camaro Z28 13.3 3.840 8 15.37057 0.6623197 -2.07056872 0.06654404 2.581383 0.0165559199 -0.83469849
Pontiac Firebird 19.2 3.845 8 15.35461 0.6616629 3.84538614 0.06641213 2.502378 0.0569730451 1.55006265
Fiat X1-9 27.3 1.935 4 27.48055 0.7700721 -0.18055052 0.08995733 2.612717 0.0001790454 -0.07371481
Porsche 914-2 26.0 2.140 4 26.82640 0.7322422 -0.82640123 0.08133608 2.607877 0.0033281614 -0.33581456
Lotus Europa 30.4 1.513 4 28.82714 0.9282190 1.57285924 0.13069974 2.593440 0.0216355209 0.65704006
Ford Pantera L 15.8 3.170 8 17.50852 0.9023791 -1.70852005 0.12352416 2.590102 0.0237336584 -0.71078295
Ferrari Dino 19.7 2.770 6 21.80050 0.5342815 -2.10049885 0.04330261 2.581252 0.0105550987 -0.83641545
Maserati Bora 15.0 3.570 8 16.23213 0.7267482 -1.23213120 0.08012014 2.601659 0.0072685192 -0.50035506
Volvo 142E 21.4 2.780 4 24.78418 0.8176667 -3.38417906 0.10142065 2.524357 0.0727399065 -1.39047125
+ + + +#### 3. The ``glance()`` function + +**``glance()``** construct a concise one-row summary of the model. This typically contains values such as R2, adjusted R2, and residual standard error that are computed once for the entire model. + + + +```R +glance(myModel) + + +``` + + + + + + + +
p.valuedflogLik
6.808955e-123 -74.00503
+ + + +## K-Means + + +```R +kmeans(iris[,1:4],3 ) +``` + + + K-means clustering with 3 clusters of sizes 96, 33, 21 + + Cluster means: + Sepal.Length Sepal.Width Petal.Length Petal.Width + 1 6.314583 2.895833 4.973958 1.7031250 + 2 5.175758 3.624242 1.472727 0.2727273 + 3 4.738095 2.904762 1.790476 0.3523810 + + Clustering vector: + [1] 2 3 3 3 2 2 2 2 3 3 2 2 3 3 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 3 3 2 2 2 3 2 2 + [38] 2 3 2 2 3 3 2 2 3 2 3 2 2 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 + [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 + [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 + [149] 1 1 + + Within cluster sum of squares by cluster: + [1] 118.651875 6.432121 17.669524 + (between_SS / total_SS = 79.0 %) + + Available components: + + [1] "cluster" "centers" "totss" "withinss" "tot.withinss" + [6] "betweenss" "size" "iter" "ifault" + + + +```R +KMmodel <- kmeans(iris[,1:4],3 ) +summary(KMmodel) +``` + + + Length Class Mode + cluster 150 -none- numeric + centers 12 -none- numeric + totss 1 -none- numeric + withinss 3 -none- numeric + tot.withinss 1 -none- numeric + betweenss 1 -none- numeric + size 3 -none- numeric + iter 1 -none- numeric + ifault 1 -none- numeric + + + +```R +tidy(KMmodel) +``` + + + + + + + + + +
x1x2x3x4sizewithinsscluster
5.0060003.4280001.4620000.24600050 15.151001
5.9016132.7483874.3935481.43387162 39.820972
6.8500003.0736845.7421052.07105338 23.879473
+ + + + +```R +# augment(Model,Data) +``` + + +```R +augment(KMmodel,iris[,1:4]) %>%head() +``` + + + + + + + + + + + + +
Sepal.LengthSepal.WidthPetal.LengthPetal.Width.cluster
5.13.51.40.21
4.93.01.40.21
4.73.21.30.21
4.63.11.50.21
5.03.61.40.21
5.43.91.70.41
+ + + + +```R +glance(KMmodel) +``` + + + + + + + +
totsstot.withinssbetweenssiter
681.370678.85144602.51922