-
Notifications
You must be signed in to change notification settings - Fork 109
/
Copy pathice_cream_survey.Rmd
205 lines (151 loc) · 9.53 KB
/
ice_cream_survey.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
# Ice Cream Survey
Jake Stamell
```{r, include=FALSE}
knitr::opts_chunk$set(echo=T, warning=F, message=F)
```
## Overview
### Description
For my community contribution, I sent out a short (4 question) survey to the class on their ice cream preferences. I asked age, country of origin, cup or cone, and favorite flavor. Country and flavor were open text responses, which I hoped would cause some variation in input requiring cleaning (misspellings, alternative flavor names, etc.).
### Goals of this community contribution
1) Share the ice cream preferences of our class!
2) Demonstrate how to clean messy text responses in order to ease analysis and visualization
3) Provide an example of visualizing multiple categorical variables
## Loading packages and reading in data
```{r}
library(googlesheets) # For accessing the responses
library(tidyverse) # For data cleaning and visualizing
library(data.table) # Alternative to dplyr
library(countrycode) # For handling country names
library(gridExtra) # For visualizations
```
```{r}
# This key should allow anyone to access the raw results
ice_cream_key <- "18uYxylZazzedLVzo4hSLEemQNKZ0uNse_V2aVyARvj8"
ice_cream <- gs_key(ice_cream_key)
ice_cream_responses <- ice_cream %>% gs_read(ws="Form Responses 1")
setDT(ice_cream_responses) # using data.table instead of dplyr
names(ice_cream_responses) <- c("Timestamp","Age","Country","Method","Flavor")
```
## Understanding what cleaning is required
For country, respondents were allowed to input whatever they wanted. This caused issues with USA and Mexico, where the former was submitted in multiple formats and the latter included the name with and without accents. A bigger issue is the number of countries with only one respondent, indicating that we will need to combine responses in some way.
```{r}
ice_cream_responses[,.N,by=Country][order(-N)]
```
For flavor, again respondents could input any text. As expected, this caused many variations on the same flavors and even one misspelling. Even after accounting for this, the same problem remains of having many categories with few responses. The approach I will take for summarizing this will be to group similar flavors (e.g. chocolate chip and chocolate chip cookie dough). I consider myself somewhat of an ice cream expert (making it is a hobby of mine); I will leverage this knowledge in grouping ice cream flavors.
```{r}
ice_cream_responses[,.N,by=Flavor][order(-N)]
```
Lastly, there are no issues with the data entry for age; however, we need to group it in some way as well.
```{r}
ice_cream_responses[order(Age),.N,by=Age]
```
## Cleaning and prepping the data
### Country
We start by removing the accent in Mexico. Then, we can take care of two issues at once: duplicate country names and too many countries with small number of responses. By leveraging the countrycode package, we can group countries by continent. This leaves a small number of categories, which will ease our visualizations. Unfortunately, I did not have too many European respondents for this survey so I will create a secondary continent variable that groups them with the Americas.
```{r}
ice_cream_responses[,country_encoding := Encoding(Country)]
ice_cream_responses[country_encoding=="UTF-8", Country := iconv(Country,from="UTF-8",to="ASCII//TRANSLIT")]
ice_cream_responses[,country_encoding := NULL]
ice_cream_responses[,Continent := fct_infreq(countrycode(sourcevar= Country, origin= "country.name", destination= "continent"))]
ice_cream_responses[Continent=="Asia", Continent2 := "Asia"]
ice_cream_responses[Continent!="Asia", Continent2 := "Americas\nand Europe"]
ice_cream_responses[,Continent2 := fct_infreq(Continent2)]
ice_cream_responses[,.N,by=Continent][order(-N)]
```
### Flavor
For flavor, we start by converting everything to lower case and making sure there is no extra whitespace. The approach for grouping flavors is to use regex to find specific strings and rename the flavors accordingly. Note the order used in "flavors_to_identify". Some of these will be matched multiple times (e.g. choc and chocolate chip will match chocolate chip cookie dough). Therefore, I have ordered it so that the last match is the one I want to assign and will be the one used. Also, note the use of "choc" to deal with the mispelling of "chocolate" in one response. (This is a little bit of a quick workaround so as to not have to explicitly deal with the issue.) I again create a secondary flavor variable with only 3 categories to compare chocolate-based ice cream against vanilla-based.
```{r}
ice_cream_responses[,Flavor := str_squish(str_to_lower(Flavor))]
flavors_to_identify <- c('strawberry', 'choc', 'vanilla', 'chocolate chip', 'mint')
flavor_names <- c('Strawberry', 'Chocolate', 'Vanilla', 'Choc chip', 'Mint')
flavor_names2 <- c('Other', 'Chocolate', 'Vanilla', 'Vanilla', 'Other')
for(i in seq(flavors_to_identify)){
ice_cream_responses[str_which(Flavor, flavors_to_identify[i]),
`:=` (Flavor_group=flavor_names[i],
Flavor_group2=flavor_names2[i])]
}
ice_cream_responses[is.na(Flavor_group), `:=` (Flavor_group="Other",
Flavor_group2="Other")]
ice_cream_responses[, `:=`(Flavor_group= fct_relevel(fct_infreq(Flavor_group),"Other",after=Inf),
Flavor_group2= fct_relevel(fct_infreq(Flavor_group2),"Other",after=Inf))]
ice_cream_responses[,.N,by=Flavor_group][order(-N)]
```
### Age
This is quick to clean as we can just split into 2 groups (basically recent grads and those who have some work experience).
```{r}
ice_cream_responses[Age < 23, Age_group := "<23"]
ice_cream_responses[Age >= 23, Age_group := "23+"]
ice_cream_responses[,Age_group := factor(Age_group, levels = c("<23","23+"))]
```
## Visualizing the data
### Getting an overview
We start with a basic plot to understand the distribution of the transformed variables.
```{r, fig.width=9, echo=F}
age_summary <- ggplot(ice_cream_responses) +
geom_bar(aes(x=Age_group)) +
labs(title="Distribution of ages")
location_summary <- ggplot(ice_cream_responses) +
geom_bar(aes(x=Continent)) +
labs(title="Distribution of continents")
method_summary <- ggplot(ice_cream_responses) +
geom_bar(aes(x=Method)) +
labs(title="Distribution of cup vs. cone")
flavor_summary <- ggplot(ice_cream_responses) +
geom_bar(aes(x=Flavor_group)) +
labs(title="Distribution of flavors") +
scale_y_continuous(breaks=scales::pretty_breaks())
grid.arrange(age_summary, location_summary, method_summary, flavor_summary)
```
Now using the simplified categories.
```{r, fig.width=9, echo=F}
location_summary2 <- ggplot(ice_cream_responses) +
geom_bar(aes(x=Continent2)) +
labs(title="Distribution of continents")
flavor_summary2 <- ggplot(ice_cream_responses) +
geom_bar(aes(x=Flavor_group2)) +
labs(title="Distribution of flavors") +
scale_y_continuous(breaks=scales::pretty_breaks())
grid.arrange(age_summary, location_summary2, method_summary, flavor_summary2)
```
### Ice cream preferences by continent and age
Unforunately, even with the simplified bucketing of the variables, splitting the data by all 4 variables of interest reduces each category to a very small count.
```{r}
ice_cream_responses[,.N,by=.(Age_group,Continent2,Method,Flavor_group2)]
```
Therefore, we will need to compare 2 or 3 variables a at a time to find the interesting patterns.
```{r,echo=F, fig.width=5, fig.height=5}
ggplot(ice_cream_responses[,.N,by=.(Age_group,Continent2,Flavor_group2)]) +
geom_col(aes(x=Age_group,y=N,fill=Flavor_group2), position="dodge") +
facet_wrap(vars(Continent2),nrow=2) +
labs(y="count",
title="Flavor preference by age and continent")
ggplot(ice_cream_responses[,.N,by=.(Age_group,Continent2,Method)]) +
geom_col(aes(x=Age_group,y=N,fill=Method), position="dodge") +
facet_wrap(vars(Continent2),nrow=2) +
labs(y="count",
title="Cup/Cone preference by age and continent")
```
Even breaking down the responses by continent produces buckets that are a little too small for comparison. Still, it looks like respondents from Asia are more likely to favor "Other" flavor than the classics as compared to respondents from Americas/Europe. Additionally, <23/Asia and 23+/Americas,Europe may like cups more than the others. However, maybe there is something else at play?
```{r,echo=F, fig.width=5, fig.height=3}
ggplot(ice_cream_responses[,.N,by=.(Method,Flavor_group2)]) +
geom_col(aes(x=Flavor_group2,y=N,fill=Method), position="dodge") +
labs(y="count",
title="Cup/Cone preference by flavor preference")
```
Turns out people who like vanilla are more likely to also prefer a cup! Let's dig in to age now
```{r,echo=F, fig.width=5, fig.height=3}
ggplot(ice_cream_responses[,.N,by=.(Age_group,Flavor_group2)]) +
geom_col(aes(x=Age_group,y=N,fill=Flavor_group2), position="dodge") +
labs(y="count",
title="Flavor preference by age")
ggplot(ice_cream_responses[,.N,by=.(Age_group,Method)]) +
geom_col(aes(x=Age_group,y=N,fill=Method), position="dodge") +
labs(y="count",
title="Cup/Cone preference by age")
```
Age doesn't apear to play a major role in preferences. Maybe this isn't too surprising since I chose an arbitrary split for the two groups!
## Takeaways
N = 35 is very small when you have 4 variables of interest with multiple categories each. Analyzing respones required bucketing each variable of interest into 2 or 3 categories.
Chocolate and vanilla based ice creams are about equally preferred.
Cones are much more popular than cups. Respondents who liked vanilla were more likely to prefer cups.
Age did not play a major role in preferences, likely due to a small age range of 17 years.