forked from acatlin/FALL2020TIDYVERSE
-
Notifications
You must be signed in to change notification settings - Fork 0
/
M_Skonberg Tidyverse CREATE.Rmd
98 lines (62 loc) · 4.99 KB
/
M_Skonberg Tidyverse CREATE.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
title: "DATA 607 Tidyverse CREATE"
author: "Magnus Skonberg"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
toc_float: yes
number_sections: no
theme: cerulean
highlight: tango
font-family: Arial
pdf_document:
toc: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Background
The purpose of this assignment is to create a programming sample "vignette" that demonstrates how to use one or more of the capabilities of the selected Tidyverse package with our selected dataset.
#### Data Source
The source of data is cited APA-style below:
* Marcos Pesotto. (2018). **Happiness and Alcohol Consumption** [Data file]. Retrieved from https://www.kaggle.com/marcospessotto/happiness-and-alcohol-consumption?select=HappinessAlcoholConsumption.csv
The data considers 122 nations and explores the correlation between happiness (via Happiness as well as other social indices) and alcohol consumption (via beer, spirit, and wine consumption per capita).
I thought it might be interesting to explore / visualize whether there was any relationship between happiness and alcohol consideration.
> Does drinking make us merry? OR does it drown down our mood?
That is the question ...
--------------------------------------------------------------------------------
### Load tidyverse library
First things first: load the tidyverse library.
```{r load-packages, include=TRUE}
library(tidyverse)
```
### Read .csv and display data
After downloading the .csv file from Kaggle and uploading it to Github, we read the corresponding data (in raw form) and then familiarize ourselves with the dataset by displaying column names and the 1st 6 observations.
```{r}
#Read .csv data
happy_alc <- read_csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/tidyverse/HappinessAlcoholConsumption.csv")
#Familiarize ourselves with the dataset
colnames(happy_alc)
head(happy_alc)
```
### Explore ggplot2
To steal from [the tidyverse site](https://ggplot2.tidyverse.org/):
*It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).*
Prior to visualizing our data, we mutated our dataset by adding a column. The operation performed, is available via the dplyr() library which is also a part of the tidyverse. The point of mutating our dataset (adding a column) was to gather the sum of 3 existing columns and store this sum in a new column. In this case, the sum of Beer_PerCapita, Spirit_PerCapita, and Wine_PerCapita was taken to form the $TotalAlc$ column. $TotalAlc$ was used to represent total alcohol consumption per capita.
Once we had our desired variables, we visualized our data by plotting $TotalAlc$ (on the y axis) vs. $HappinessScore$ (on the x axis). Each data point represents a different nation, but rather than having all 122 nations as different colors which would take away from being able to display the plot (because of a HUGE color scheme), we colored our data points based on region to provide guidance in interpreting regional happiness and drinking habits.
The labels of the plot were then updated using the labs() function. An informative title, subtitle, x, and y axis were allocated with the aim of reading these axes providing a casual reader / observer the ability to understand what the plot shows.
Once we had points, plot, and labels, all that remained was to fit a line to the plot to more clearly observe any trends in data. To do so, we used the geom_smooth() function to add a smoothed conditional regression line (where "lm" is provided as an argument to fit a linear model).
The result is shown below:
```{r}
#Take the sum of 3 columns (Beer_PerCapita, Spirit_PerCapita, and Wine_PerCapita) to form 1 column: Total
happy_alc_rev <- happy_alc %>%
mutate(TotalAlc = select(., Beer_PerCapita:Wine_PerCapita) %>% rowSums(na.rm = TRUE))
ggplot(happy_alc_rev, aes(x=HappinessScore, y=TotalAlc, color=Region)) +
geom_point() +
labs(title = "Alcohol Consumption vs. Happiness", subtitle = "(A visualization by region for 122 nations)", x = "Happiness Score", y = "Total Alcohol Consumption") +
geom_smooth(method=lm, color = "black")
```
### Analysis
Based on the above plot it seems there's a positive correlation between happiness and alcohol consumption. Thus drinking *may* make us merry. This could be for a number of reasons: better quality of life where there's expendable income to drink with, an active social scene where individuals drink, etc.
With that said, the spread of data is quite vast with high standard deviation and residual values. Thus **further investigation would be required to produce any conclusive findings.**