-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnw_eda_project.rmd
831 lines (661 loc) · 37.9 KB
/
nw_eda_project.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
Data Analysis of Red Wine Quality by Nathaniel Wharton
========================================================
## Introduction
A tidy dataset of variants of the Portuguese "Vinho Verde" wine were used for this analysis. The dataset comes from a 2009 study. It consists largely of sensory (output) variables and physicochemical (input) variables. This was used as the source for e.g: field descriptions:
https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Load all of the packages
# suppressMessages(install.packages('ggplot2')) # for plots
# suppressMessages(install.packages('GGally')) # for correlation plots
#suppressMessages(install.packages('memisc')) # for linear model
#suppressMessages(install.packages('gridExtra')) # for linear model
library('ggplot2')
library('GGally')
library('memisc')
library('gridExtra')
# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.
# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Load_the_Data}
# Load the red wine data
wine <- read.csv('wineQualityReds.csv')
```
# Univariate Plots Section
## About the Data
### Column Names
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_Column_Names}
names(wine)
```
### Summary of the Column Data and Data Types
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_Column_Data}
str(wine)
```
The data includes 1,599 observations and thirteen variables.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_Summary_Data}
summary(wine)
```
## Let's Look at histograms and boxplots of the columns:
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_fixed.acitity_Histogram}
grid.arrange(
ggplot(aes(x=fixed.acidity), data=wine) +
geom_histogram(bins=50) +
ggtitle("Distribution of Fixed Acidity"),
ggplot(aes(x="fixed.acidity", y=fixed.acidity),
data = wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_boxplot(alpha=1/5, color = 'red') +
ggtitle("Boxplot Fixed Acidity"),
ncol = 2
)
```
Fixed.acidity is measured as the concentration (in g/dm^3) of tartaric acid. Most acids in wine fall in this category. We see a fairly normal distribution here.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_volatile.acidity_Histogram}
grid.arrange(
ggplot(aes(x=volatile.acidity), data=wine) +
geom_histogram(bins=70) +
ggtitle("Distribution of Volatile Acidity"),
ggplot(aes(x="volatile.acidity", y=volatile.acidity),
data = wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_boxplot(alpha=1/5, color = 'red') +
ggtitle("Boxplot Volatile Acidity"),
ncol=2
)
```
Volatile.acidity is the concentration (in g/dm^3) of acetic acid in the wine. Higher levels of this can lead to, "an unpleasant, vinegar taste.". Again, a fairly normal distribution with a bit of a long tail, slightly bi-nodal.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_citric.acid_Histogram}
grid.arrange(
ggplot(aes(x=citric.acid), data=wine) +
geom_histogram(bins=70) +
ggtitle("Distribution of Citric Acid"),
ggplot(aes(x="citric.acid", y=citric.acid),
data = wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_boxplot(alpha=1/5, color = 'red') +
ggtitle("Boxplot Citric Acid"),
ncol=2
)
```
Citric acid concentration is in (g/dm^3). Apparently it, "adds 'freshness' and flavor to wines". This looks like a sightly positively-skewed data. There appear to be a number of observations with very low levels of citric acid, and spikes at 0.25 g/dm^3 and 0.50 (g/dm^3). The outlier at 1.0 gives a longer-tail.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_residual.sugar_Histogram}
grid.arrange(
ggplot(aes(x=residual.sugar), data=wine) +
geom_histogram(bins=70) +
ggtitle("Distribution of Residual Sugar"),
ggplot(aes(x="residual.sugar", y=residual.sugar),
data = wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_boxplot(alpha=1/5, color = 'red') +
ggtitle("Boxplot Residual Sugar"),
ggplot(aes(x=residual.sugar), data=wine) +
geom_histogram(bins=70) +
scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))) +
ggtitle("Distribution of log_10( Residual Sugar)"),
ncol=2
)
```
Residual sugar concentration (g/dm^3) has an early spike and a very long tail. It's the amount of sugar remaining after fermentation stops.
Taking the log of the residual sugar concentration smoothes out the distribution a bit.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_chlorides_Histogram}
grid.arrange(
ggplot(aes(x=chlorides), data=wine) +
geom_histogram(bins=70) +
ggtitle("Distribution of Chlorides"),
ggplot(aes(x="chlorides", y=chlorides),
data = wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_boxplot(alpha=1/5, color = 'red') +
ggtitle("Boxplot Chlorides"),
ggplot(aes(x=chlorides), data=wine) +
geom_histogram(bins=70) +
scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))) +
ggtitle("Distribution of log10( Chlorides )"),
ncol=2
)
```
Chlorides represent the amount of salt in the wine. The concentration of sodium chloride (in g/dm^3). There is a spike of chlorides and a long tail of ourliers.
Transforming the value via a log, we see more of a spike with outliers than a bell curve.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_free.sulfur.dioxide_Histogram}
grid.arrange(
ggplot(aes(x=free.sulfur.dioxide), data=wine) +
geom_histogram(bins=70) +
ggtitle("Distribution of Free Sulfur Dioxide"),
ggplot(aes(x="free.sulfur.dioxide", y=free.sulfur.dioxide),
data = wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_boxplot(alpha=1/5, color = 'red') +
ggtitle("Boxplot Free Sulfur Dioxide"),
ncol=2
)
```
A positively-skewed distribution of free.sulfur.dioxide (in (mg/dm^3)) with a long tail is observed. Free.sulfur.dioxide prevents microbial growth and wine oxidation.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_total.sulfur.dioxide_Histogram}
grid.arrange(
ggplot(aes(x=total.sulfur.dioxide), data=wine) +
geom_histogram(bins=70) +
ggtitle("Distribution of Total Sulfur Dioxide"),
ggplot(aes(x="total.sulfur.dioxide", y=total.sulfur.dioxide),
data = wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_boxplot(alpha=1/5, color = 'red') +
ggtitle("Boxplot Total Sulfur Dioxide"),
ggplot(aes(x=total.sulfur.dioxide), data=wine) +
geom_histogram(bins=50) +
scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))) +
ggtitle("Distribution of log10 (Total Sulfur Dioxide)"),
ncol=2
)
```
The total amount of sulfur dioxide includes free and bound forms of S02 (in (mg/dm^3)). At concentrations of free SO2 over 50 ppm, SO2 becomes evident in the taste and nose of a wine. Here we see a positively-skewed distribution.
Taking the log, we see a more-normal distribution of the total.sulfur.dioxide.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_density_Histogram}
grid.arrange(
ggplot(aes(x=density), data=wine) +
geom_histogram(bins=60) +
ggtitle("Distribution of Density"),
ggplot(aes(x="density", y=density),
data = wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_boxplot(alpha=1/5, color = 'red') +
ggtitle("Boxplot Density"),
ncol=2
)
```
Without transformation, we see a nice bell curve for density. Density here is in (g/cm^3). Apparently the density depends a bit on the percent of alcohol and sugar content.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_pH_Histogram}
grid.arrange(
ggplot(aes(x=pH), data=wine) +
geom_histogram(bins=60) +
ggtitle("Distribution of pH"),
ggplot(aes(x="pH", y=pH),
data = wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_boxplot(alpha=1/5, color = 'red') +
ggtitle("Boxplot pH"),
ncol = 2
)
```
We see anormal distribution of wines by pH. Not being previously familiar with the pH of wine, I was surprised to see it so ascidic (neutral water has a pH of 7).
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_sulphates_Histogram}
grid.arrange(
ggplot(aes(x=sulphates), data=wine) +
geom_histogram(bins=60) +
ggtitle("Distribution of Sulphates"),
ggplot(aes(x="sulphates", y=sulphates),
data = wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_boxplot(alpha=1/5, color = 'red') +
ggtitle("Boxplot Sulphates"),
ggplot(aes(x=sulphates), data=wine) +
geom_histogram(bins=50) +
scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))) +
ggtitle("Distribution of Log10( Sulphates )"),
ncol = 2
)
```
Sulphates measures concentration of potassium sulphate (in g/dm3). It's an additive that can contribute to S02 gas levels, and acts as an antimicrobial and antioxidant.
Sulphates are bit more normalized with log scale applied.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_alcohol_Histogram}
grid.arrange(
ggplot(aes(x=alcohol), data=wine) +
geom_histogram(bins=60) +
ggtitle("Distribution of % Alcohol"),
ggplot(aes(x="alcohol", y=alcohol),
data = wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_boxplot(alpha=1/5, color = 'red') +
ggtitle("Boxplot Alcohol"),
ggplot(aes(x=alcohol), data=wine) +
geom_histogram(bins=60) +
scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))) +
ggtitle("Distribution of log10 (% Alcohol)"),
ncol = 2
)
```
Alcohol (in % by volume) is positively skewed.
Alcohol maintains its positive skew even with a log transformation.
```{r echo=FALSE, message=FALSE, warning=FALSE, Univariate_Plots_quality_bar}
ggplot(aes(x=quality), data=wine) +
geom_bar() +
scale_x_continuous(breaks = seq(0,10,1)) +
ggtitle("Distribution of Quality")
```
Finally we get to quality scores (on a scale of 0 to 10). We see that overwhelmingly, most wines received a 5 or 6. Strikingly, no values for the extremes: 0,1,2 or 9 and 10 are represented.
# Univariate Analysis
### Dataset Structure
The dataset for red wine is in a tidy format with separate observations of a particular wine on one row. The dataframe is wide with separate columns for each variable.
The 'X' column is an id stored as an integer and Quality (the, "output variable") is stored as an integer value. All other columns ("input variables") contain measurements stored as double precision floating point numbers.
### What is/are the main feature(s) of interest in your dataset?
The main feature of interest in this dataset is quality. Quality is the single "output" variable that we can try to determine using the various "input" variables. Strikingly, though quality is supposed to be rated on a scale of 0-10, there are no observed values of 0,1,2 or 9 and 10 in the dataset. Most values are in the middle, either 5,6, or a 7, and there are a few observations of 3,4, and 8.
### What other features in the dataset do you think will help support your \
investigation into your feature(s) of interest?
Though potentially, any of the, "input variables" could help us understand the quality "output variables", the notes, proclaim we may see either positive or negative correlations between certain inputs and quality: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt,
Particularly we can look to see these:
- volatile acidity - when too high can lead to an, "unpleasant, vinegar taste".
- citric acid - can add, 'freshness' and flavor".
- total sulfur dioxide - "at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine""
### Did you create any new variables from existing variables in the dataset?
Later we see that 3 groupings were made for quality (rather than using the 0-10 scale present), but other than that, no new variables were created from the dataset (except to create a subset of data for correlations that didn't include the observation identifier).
### Of the features you investigated, were there any unusual distributions? \
Did you perform any operations on the data to tidy, adjust, or change the form \
of the data? If so, why did you do this?
Alcohol has an interesting non-normal distribution that wasn't much affected by log tranformations. Some plots were re-arranged to cast-out outliers, but generally few operations were performed to change the form of the data.
# Bivariate Plots Section
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_matrix_of_correlations}
theme_set(theme_minimal(4))
# set seed for reproducable results
set.seed(1836)
wine_subset <- wine[,c(2:13)]
# make correlation numbers smaller and viewable
g <- ggpairs(data = wine_subset,
upper = list(continuous = wrap("cor", size = 2))
)
print(g, bottomHeightProportion = 0.5, leftWidthProportion = .5)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_matrix_of_correlations_2}
theme_set(theme_minimal(12))
# The same correlations can be seen clearer in this heatmap:
ggcorr(wine_subset, label = TRUE, hjust = 0.9, size=3)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_quality_vs_volatile.acidity}
ggplot(aes(x=factor(quality), y=volatile.acidity), data=wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_smooth(method='lm', aes(group = 1)) +
geom_boxplot(alpha = 0.5,color = 'blue') +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4) +
geom_line(stat = 'summary', fun.y= mean, color='orange') +
ggtitle("Quality vs. Volatile Acidity")
```
```{r echo=FALSE, message=FALSE, warning=FALSE,Bivariate_Plots_quality_cor_volatile.acidity}
cor.test(wine$quality, wine$volatile.acidity, method="pearson")$estimate
```
We see that (as predicted), **lower** mean volatile.acidity is correlated with higher wine quality.
This is in line with the general expectation that volatile.acidity is, "unpleasant" when increased. The orange line represents the mean. The blue dotted lines represent quantiles of 10%, 50% (the median), and 90%. This pattern will be repeated below.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_quality_vs_citric_acid}
ggplot(aes(x=factor(quality), y=citric.acid), data=wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_smooth(method='lm', aes(group = 1)) +
ggtitle("Quality vs. Citric Acid Concentration") +
geom_boxplot(alpha = 0.5,color = 'blue') +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, ?Bivariate_Plots_quality_cor_citric.acid}
cor.test(wine$quality, wine$citric.acid, method="pearson")$estimate
```
We see that (as predicted), there is a small correlation that higher citric acid is correlated with higher wine quality.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_quality_vs_total.sulfur.dioxide}
ggplot(aes(x=factor(quality), y=total.sulfur.dioxide), data=wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_smooth(method='lm', aes(group = 1)) +
geom_line(stat = 'summary', fun.y= mean, color='orange') +
ggtitle("Quality vs. Total Sulfur Dioxide") +
geom_boxplot(alpha = 0.5,color = 'blue') +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_quality_cor_total.sulfur.dioxide}
cor.test(wine$quality, wine$total.sulfur.dioxide, method="pearson")$estimate
```
We see a small negative correlation of total.sulfur.dioxide, in-line with expectations.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_quality_vs_alcohol}
ggplot(aes(x=factor(quality), y=alcohol), data=wine) +
geom_point(alpha=1/5, position = position_jitter()) +
geom_smooth(method='lm', aes(group = 1)) +
geom_line(stat = 'summary', fun.y= mean, color='orange') +
ggtitle("Quality vs. % Alcohol") +
geom_boxplot(alpha = 0.5,color = 'blue') +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_quality_cor_alcohol}
cor.test(wine$quality, wine$alcohol, method="pearson")$estimate
```
The strongest correlation for quality was between % Alcohol Content and Quality with a Pearson's r of 0.48. The mean value of alcohol % increases with quality.
## Other Relationships Explored
Since the correlation between Alcohol and Quality was strongest, it was examined first. The strongest correlation appearred to be between it and density.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_alcohol_vs_density}
ggplot(aes(x=alcohol, y=density), data=wine) +
geom_point(alpha=1/10) +
geom_smooth(method='lm', aes(group = 1)) +
geom_line(stat = 'summary', fun.y= mean, color='orange') +
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.1),
linetype = 2, # dashed line
color = 'blue'
) +
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.5),
linetype = 2, # dashed line
color = 'blue'
) +
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.9),
linetype = 2, # dashed line
color = 'blue'
) +
ggtitle("% Alcohol vs. Density")
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_alcohol_cor_density}
cor.test(wine$alcohol, wine$density, method="pearson")$estimate
```
A negative correlation (r = -.50) was found between percent of alcohol and density. Thus, the more alcohol, the less density was seen.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_volatile.acidity_vs_citric.acid}
ggplot(aes(x=volatile.acidity, y=citric.acid), data=wine) +
geom_point(alpha=1/5) +
geom_smooth(method='lm', aes(group = 1)) +
geom_line(stat = 'summary', fun.y= mean, color='orange') +
xlim(c(0,1.2)) + # trim outliers
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.1),
linetype = 2, # dashed line
color = 'blue'
) +
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.5),
linetype = 2, # dashed line
color = 'blue'
) +
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.9),
linetype = 2, # dashed line
color = 'blue'
) +
ggtitle("Volatile Acidity vs. Citric Acid")
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_volatile.acidity_cor_citric.acid}
cor.test(wine$volatile.acidity, wine$citric.acid, method="pearson")$estimate
```
There was a fairly significant (r = -.55) negative correlation between volatile acidity and citric acid.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_free.sulfur.dioxide_vs_total.sulfur.dioxide}
ggplot(aes(x=free.sulfur.dioxide, y=total.sulfur.dioxide), data=wine) +
geom_point(alpha=1/3) +
geom_smooth(method='lm', aes(group = 1)) +
xlim(c(0,45)) + # trim outliers
geom_line(stat = 'summary', fun.y= mean, color='orange') +
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.1),
linetype = 2, # dashed line
color = 'blue'
) +
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.5),
linetype = 2, # dashed line
color = 'blue'
) +
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.9),
linetype = 2, # dashed line
color = 'blue'
) +
ggtitle("Free Sulfur Dioxide vs. Total Sulfur Dioxide")
```
```{r echo=FALSE, Bivariate_Plots_free.sulfur.dioxide_cor_free.sulfur.dioxide}
cor.test(wine$free.sulfur.dioxide, wine$total.sulfur.dioxide, method="pearson")$estimate
```
There was a significant positive correlation (r=.67) between free and total sulfur dioxide, though this was to be expected.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_density_vs_fixed_acidity}
ggplot(aes(x=density, y=fixed.acidity), data=wine) +
geom_point(alpha=1/3) +
geom_smooth(method='lm', aes(group = 1)) +
geom_line(stat = 'summary', fun.y= mean, color='orange') +
xlim(c(0.993, 1.002)) + # trimmed outliers
# note removed quantiles because graph was unreadable with them.
ggtitle("Density vs Fixed Acidity")
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_density_cor_fixed.acidity}
cor.test(wine$density, wine$fixed.acidity, method="pearson")$estimate
```
Another strong positive correlation (r = .67) was found between density and fixed acidity.
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_pH_vs_fixed_acidity}
ggplot(aes(x=pH, y=fixed.acidity), data=wine) +
geom_point(alpha=1/3) +
xlim(c(3,3.6)) + # trimmed outliers
geom_smooth(method='lm', aes(group = 1)) +
geom_line(stat = 'summary', fun.y= mean, color='orange') +
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.1),
linetype = 2, # dashed line
color = 'blue'
) +
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.5),
linetype = 2, # dashed line
color = 'blue'
) +
geom_line(stat = 'summary', fun.y= quantile,
fun.args = list(probs = 0.9),
linetype = 2, # dashed line
color = 'blue'
) +
ggtitle("pH vs. Fixed Acidity")
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots_pHy_cor_fixed.acidity}
cor.test(wine$pH, wine$fixed.acidity, method="pearson")$estimate
```
A strong negative correlation (r = -.68) in the data was found between pH and Fixed Acidity. This makes sense as lower pH is used to measure higher acidity.
# Bivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. How did the feature(s) of interest vary with other features in \
the dataset?
To summarize, correlations were discovered between Quality (the feature of interest) and:
- alcohol (r= .48) (strongest positive correlation)
- citric.acid (r = .23)
- volatile.acidity (r = -.39) (strongest negative correlation)
- total.sulfur.dioxide (r= -.19)
Generally each of these were expected except for the strong correlation with alcohol content and quality.
### Did you observe any interesting relationships between the other features \
(not the main feature(s) of interest)?
Correlations were also found between:
- alcohol and density (r=-0.50)
- volatile.acidity and citric acid (r= -.55)
- free.sulfur.dioxide and total.sulfur.dioxide (r = 0.67)
- density and fixed.acidity (r = .67)
- pH and fixed.acidity (r = -.68)
### What was the strongest relationship you found?
The highest correlation (r = -.68) in the data was the negative correlation found between pH and Fixed Acidity. This makes sense as lower pH values are used as a measurement of higher acidity.
# Multivariate Plots Section
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_Plots_quality_vs_alcohol_vs_volatile.acidity}
alcohol_median <- median(wine$alcohol)
volatile.acidity_median <- median(wine$volatile.acidity)
ggplot(aes(x=alcohol, y=volatile.acidity, color=factor(quality)), data=wine) +
geom_point(alpha = 0.8, size=1) +
geom_smooth(method = "lm", se = FALSE, size=1) +
geom_vline(aes(xintercept=alcohol_median), linetype = 2) + # dashed line
geom_hline(aes(yintercept=volatile.acidity_median), linetype = 2) + # dashed line
scale_color_brewer(type='seq',
guide=guide_legend(title='Quality')) +
theme(plot.background = element_rect(fill = '#DDDDDD')) + # grey background
ggtitle("Multivariate: Alcohol vs Volatile Acidity vs Quality")
```
Three variables are plotted here -- analysis is below.
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_Plots_quality_vs_grouped_alcohol_vs_volatile.acidity}
# group wine by quality:
wine$quality.bucket <- cut(wine$quality, c(0, 4, 6, 10))
alcohol_median <- median(wine$alcohol)
volatile.acidity_median <- median(wine$volatile.acidity)
ggplot(aes(x=alcohol, y=volatile.acidity, color=factor(quality.bucket)), data=wine) +
geom_point(alpha = 1) +
xlab("% Alcohol") +
ylab("Volatile Acidity (Acetic Acid in g/dm^3)") +
labs(color='Quality Rating') +
geom_smooth(method = "lm", se = FALSE, size=1) +
scale_color_brewer(type='seq') + # use palette = 'Set1' for vibrant colors
geom_vline(aes(xintercept=alcohol_median), linetype = 2) + # dashed line
geom_hline(aes(yintercept=volatile.acidity_median), linetype = 2) + # dashed line
theme(plot.background = element_rect(fill = '#DEDEDE')) + # grey background
ggtitle("Red Wine: Alcohol vs Volatile Acidity vs Quality Rating Group")
```
Rather than just using variable quality levels, here the only difference is that quality was plotted in distinct groups. The aim was for the groupings to "pop out" a more. It is possible to see that higher qualities (the 6-10 bucket) tend to appear to the lower right of the graph. The 0-4 qualities tend to appear to the upper left, and the 4-6, average qualities are found in-between. The overall finding is that higher percent alcohol and lower volatile acidity tends to be associated with higher rated wine.
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_Plots_q_vs_alcohol_vs_citric.acid}
alcohol_median <- median(wine$alcohol)
citric.acid_median <- median(wine$citric.acid)
ggplot(aes(x=alcohol, y=citric.acid, color=factor(quality)), data=wine) +
geom_point(alpha = 1) +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(type='seq',
guide=guide_legend(title='Quality')) +
geom_vline(aes(xintercept=alcohol_median), linetype = 2) + # dashed line
geom_hline(aes(yintercept=citric.acid_median), linetype = 2) + # dashed line
theme(plot.background = element_rect(fill = '#DDDDDD')) + # grey background
ggtitle("Multivariate: Alcohol vs Citric Acid vs Quality")
```
Three variables are again plotted here -- analysis is below.
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_Plots_grouped_q_vs_alcohol_vs_citric.acid}
alcohol_median <- median(wine$alcohol)
citric.acid_median <- median(wine$citric.acid)
ggplot(aes(x=alcohol, y=citric.acid, color=quality.bucket), data=wine) +
geom_point(alpha = 1) +
geom_smooth(method = "lm", se = FALSE,size=1) +
xlab("Percent Alcohol") +
ylab("Citric Acid (g/dm^3)") +
labs(color='Quality Rating') +
scale_color_brewer(type='seq') + # use palette = 'Set1' for vibrant colors
geom_vline(aes(xintercept=alcohol_median), linetype = 2) + # dashed line
geom_hline(aes(yintercept=citric.acid_median), linetype = 2) + # dashed line
theme(plot.background = element_rect(fill = '#DEDEDE')) + # grey background
ggtitle("Alcohol vs Citric Acid vs Quality Rating")
```
Here the quality bins are again used to show how citric acid, and percent alcohol affect quality. The high quality grouping lies to the upper right, and the low quality lies to the lower left (with a high variance in this case). Thus, we see again that higher citric acid levels and higher percent alcohol are both generally correlated with higher quality ratings. Of note, also is that there are many observations with no citric acid at all.
```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_Plots_fsd_vs_tsd}
free.sulfur.dioxide_median <- median(wine$free.sulfur.dioxide)
total.sulfur.dioxide_median <- median(wine$total.sulfur.dioxide)
ggplot(aes(x=free.sulfur.dioxide, y=total.sulfur.dioxide, color=quality.bucket), data=wine) +
geom_jitter(alpha=1/2, aes(color=factor(quality.bucket))) +
xlab("Free Sulfur Dioxide (mg/dm^3)") +
ylab("Total Sulfur Dioxide (mg/dm^3)") +
xlim(c(0,45)) + # trim outliers
ylim(c(0,150)) + # trim outliers
labs(color='Quality Rating') +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(palette = 'Set1') +
scale_color_brewer(type='seq',
guide=guide_legend(title='Quality')) + # palette = 'Set1'
theme(plot.background = element_rect(fill = '#DDDDDD')) + # grey background
geom_vline(aes(xintercept=free.sulfur.dioxide_median), linetype = 2) + # dashed
geom_hline(aes(yintercept=total.sulfur.dioxide_median), linetype = 2) + # dashed
ggtitle("Red Wine: Free Sulfur Dioxide vs Total Sulfur Dioxide vs Quality Rating")
```
Analysis is below.
```{r echo=FALSE, Multivariate_Plots_linear_model}
# linear model
m1 <- lm(quality ~ alcohol, data = wine)
m2 <- update(m1, ~ . + volatile.acidity)
m3 <- update(m2, ~ . + total.sulfur.dioxide)
m4 <- update(m3, ~ . + density)
mtable(m1, m2, m3, m4, sdigits = 3)
```
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. Were there features that strengthened each other in terms of \
looking at your feature(s) of interest?
Looking at the plot of, "Alcohol vs Volatile Acidity vs Quality", it's evident that the higher quality wines tend to fall to the lower right of the graph and the lower quality wines fall to the upper left. Thus higher alcohol content and lower volatile.acidity is associated with higher-quality wines.
Also plotted was: Alcohol vs Citric Acid vs Quality and it's observed that the higher quality wines tended to the upper right quadrant - and lower quality wines fell in the lower left -- in line with expectations.
Finally, for, "Free Sulfur Dioxide vs Total Sulfur Dioxide vs Quality Rating"
More detail is given on the relationship between total sulfur dioxide and free sulfur dioxide (r = .67). The plot reveals that there is no clear relationship between them and quality as we observe great variance in quality plots. This result was not unexpected, but it was interesting to see what the plot looked like.
### Were there any interesting or surprising interactions between features?
### Did you create any models with your dataset? Discuss the \
strengths and limitations of your model.
As an exercise, a basic linear model was created, composed of alcohol, volatile.acidity, total.sulfur.dioxide, and density inputs. It largely did not perform very well, as its R-squared value was just 0.325. Oddly, adding citric.acid (r = .23) as an input didn't appear to improve the results in spite of its correlation with quality. Also attempted was adding the logarithm of total.sulfur.dioxide, but it did not improve the results.
------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_One}
# group wine by quality:
wine$quality.bucket <- cut(wine$quality, c(0, 4, 6, 10))
alcohol_median <- median(wine$alcohol)
volatile.acidity_median <- median(wine$volatile.acidity)
ggplot(aes(x=alcohol, y=volatile.acidity, color=factor(quality.bucket)), data=wine) +
geom_point(alpha = 1) +
xlab("% Alcohol") +
ylab("Volatile Acidity (Acetic Acid in g/dm^3)") +
labs(color='Quality Rating') +
geom_smooth(method = "lm", se = FALSE, size=1) +
scale_color_brewer(type='seq') + # use palette = 'Set1' for vibrant colors
geom_vline(aes(xintercept=alcohol_median), linetype = 2) + # dashed line
geom_hline(aes(yintercept=volatile.acidity_median), linetype = 2) + # dashed line
theme(plot.background = element_rect(fill = '#DDDDDD')) + # grey background
ggtitle("Red Wine: Alcohol vs Volatile Acidity vs Quality Rating Group")
```
### Description One
The previous analysis of this graph will be not be repeated here (see above for the previous description). We additionally see that by analyzing trend lines, with low quality wines (tending to appear to the upper left), volitile acidity tends to increase with an increased percent of alcohol, while the same trend is slightly opposite for average quality wines, and quite flat for high quality wines (tending to the lower right).
### Plot Two
```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_Two}
alcohol_median <- median(wine$alcohol)
citric.acid_median <- median(wine$citric.acid)
ggplot(aes(x=alcohol, y=citric.acid, color=quality.bucket), data=wine) +
geom_point(alpha = 1) +
geom_smooth(method = "lm", se = FALSE,size=1) +
xlab("Percent Alcohol") +
ylab("Citric Acid (g/dm^3)") +
labs(color='Quality Rating') +
scale_color_brewer(type='seq') + # use palette = 'Set1' for vibrant colors
geom_vline(aes(xintercept=alcohol_median), linetype = 2) + # dashed line
geom_hline(aes(yintercept=citric.acid_median), linetype = 2) + # dashed line
theme(plot.background = element_rect(fill = '#DEDEDE')) + # grey background
ggtitle("Red Wine: Alcohol vs Citric Acid vs Quality Rating")
```
### Description Two
Adding to the previous analysis of this graph (see above), we see (via the trend lines) for the low and high quality wines, a decrease in percent of citric acid concentration with an increase in alcohol percentage. Oddly, there is little affect for the average quality wines. The position of the high quality wines having higher citric acid is clearly distinguished from low quality wines via the trend lines, though average quality wines tend almost converge with high quality wines at a level of 14% alcohol concentration.
### Plot Three
```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_Three}
free.sulfur.dioxide_median <- median(wine$free.sulfur.dioxide)
total.sulfur.dioxide_median <- median(wine$total.sulfur.dioxide)
ggplot(aes(x=free.sulfur.dioxide, y=total.sulfur.dioxide, color=quality.bucket), data=wine) +
geom_jitter(alpha=1/2, aes(color=factor(quality.bucket))) +
xlab("Free Sulfur Dioxide (mg/dm^3)") +
ylab("Total Sulfur Dioxide (mg/dm^3)") +
xlim(c(0,45)) + # trim outliers
ylim(c(0,150)) + # trim outliers
labs(color='Quality Rating') +
geom_smooth(method = "lm", se = FALSE,size=1) +
scale_color_brewer(palette = 'Set1') +
scale_color_brewer(type='seq',
guide=guide_legend(title='Quality')) + # palette = 'Set1'
theme(plot.background = element_rect(fill = '#DDDDDD')) + # grey background
geom_vline(aes(xintercept=free.sulfur.dioxide_median), linetype = 2) + # dashed
geom_hline(aes(yintercept=total.sulfur.dioxide_median), linetype = 2) + # dashed
ggtitle("Red Wine: Free Sulfur Dioxide vs Total Sulfur Dioxide vs Quality Rating")
```
### Description Three
Again, past analysis will not be revisited here, but it can be clearly seen that the slope of the relationship between free sulfur dioxide and total sulfur dioxide is consistent for all three quality trendlines, thus giving further evidence of that we're seeing a relationship of dependent variables i.e: that we may be seeing different variables that hold a similar relationship.
------
# Reflection
Many of the data revalations in this study came from the correlation matrix between variables. The data plots largely verified / confirmed the correlations / distributions were valid and revealed additional data variances and outliers. It was interesting to see that though e.g: density had a relatively high correlation with alcohol (r = .50) and that alcohol had a high correlation with quality (r = .48) that density (and other influencing variables) did not have a high correlation with quality (r = -.18 ). The linear model did not work out as well as would have been ideal, as an R-squared of 0.325 has limited predictive utility.
Revealing plots were created for highly-correlated variables, such alcohol, volatile acidity and quality. Adding trendlines in the graph proved fruitful, further revealed the clear distinctions among different quality levels.
While the trendlines in the graph, "Multivariate: Alcohol vs Citric Acid vs Quality" were complex, trends were made clearer by grouping qualities in the graph, "Alcohol vs Citric Acid vs Quality Rating", a success.
I struggled for quite a while, researching how to adjust the font-size of the correlation matrix to make it readable. It was surprisingly complex to get a readable plot. The ggcorr function produced a more-useable heatmap/correlation matrix, though there were still some issues with the text (to the lower left).
I also struggled a bit with color palates and using the factor() function to enable proper distinct plotting of trendlines.
Other researched items in the sources (below) indicate other areas where internet research was used to implement fixes.
As for future work, it would be useful to have a richer dataset to test with e.g: the type of grape, the geographical location of the vineyard, the vintage, the type of cask the grapes were stored in, the label of the grape, to know which reviewer gave a review for each grape: e.g: reviewer A could have different tastes than reviewer B.,
### Sources:
Boxplot with 1 axis: https://stackoverflow.com/a/40700387/234975
Smaller text for correlation matrix: https://stackoverflow.com/a/39716408/234975
Legend text manipulation: https://stackoverflow.com/a/38938781/234975
Color palette setting: https://books.google.fr/books?id=_iVFgKTRYrQC&lpg=PA74&ots=XY9TTXtbgC&dq=ggplot%20%22smaller%20points%22&pg=PA77#v=onepage&q=ggplot%20%22smaller%20points%22&f=false
horizontal / vertical lines: http://www.cookbook-r.com/Graphs/Lines_(ggplot2)/
log ticks: http://ggplot2.tidyverse.org/reference/annotation_logticks.html
background color: https://stackoverflow.com/a/6736412/234975