-
Notifications
You must be signed in to change notification settings - Fork 0
/
part2.R
291 lines (227 loc) · 12.1 KB
/
part2.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
# Part 2 Agenda ------------------------------------------------------------
# Installing and loading packages
# Loading datasets
# Data Manipulation with dplyr
# Graphing with ggplot2
# Statistical analyses: t-test, linear regression
# Troubleshooting
# Installing and Loading Packages -----------------------------------------
# You can use existing functions in base R or create your own functions, and you
# can *also* use functions written by other people. Recall that R is
# open-source, meaning anyone can write functions that can be freely used by
# other people. It's like Wikipedia in that anyone can write something, but
# there are editors who check that edits make sense. When people publish their
# functions to be used by others, those functions are contained in *packages*.
# To use a package and the functions therein, you need to (1) install the
# package, (2) attach the package to your R session (what we have open right now
# is an R session).
# You only need to install the package once (or when it gets updates), but you
# need to attach it to every session you want to use it in. It's like any other
# software: you need to download Google Chrome once to have it on your computer
# (and sometimes you need to redownload to update it), but every time you want
# to use it you have to open it.
# Let's start with a package we'll be using next week which is for graphing.
# It's called ggplot2, which is short for "grammar of graphics 2".
# Don't worry about the details of the graphing right now, we'll look into that
# next week.
# to install, need to do this once
install.packages("dplyr") # you need the quotation marks
# to attach to your session, need to do this every session
library(dplyr)
library("dplyr") # you can use quotation marks or not when attaching
# dplyr stands for "data pliers," and is used for manipulating data. Let's look
# at some example datasets included in the dplyr package, starting with mtcars.
# This has information about cars' fuel consumption.
# Loading datasets --------------------------------------------------------
# Because the dataset is included in dplyr, which we've loaded, it's already
# available to us. A few useful functions for looking at your data in R:
head(mtcars) # look at the first few rows
tail(mtcars) # look at the last few rows
summary(mtcars) # get an overview of values
str(mtcars) # overview of data types
View(mtcars) # view the entire dataset
# Data Manipulation with dplyr --------------------------------------------
# Unlike in Excel, you can't access the data directly when Viewing it. Why?
# Because it's better to change it using your code in a reproducible way (no
# wondering if values were changed, forgetting what was changed or moved, etc.),
# and that's where dplyr comes in! It's an R package used for data manipulation,
# with a few main functions: filter (choosing certain rows), select (choosing
# certain columns), mutate (adding/changing columns), arrange (order of rows),
# summarize (summarizing, as the name suggests).
# Filter
# Our dataset has information about cars' number of cylinders. Generally, more
# cylinders = more power. We can use equality/relational statements to filter
# the dataset to cars that have more than/less than/equal to a certain number of
# cylinders.
filter(mtcars, cyl == 6) # equal to, notice the two ==
filter(mtcars, cyl < 6) # less than 6
filter(mtcars, cyl > 6) # greater than 6
filter(mtcars, cyl <= 6) # less than or equal to
# Take cyl == 6. We can see that there are 7 rows. But if we call mtcars, it's
# still the full dataset.
mtcars
# That's because we didn't save it to a new dataset. If we want to work with a
# new dataset, we would need to save it to a variable.
mtcars_6cyl <- filter(mtcars, cyl == 6)
# Select
# So filter chooses certain rows. Select chooses certain columns! Let's say we
# just wanted to look at the car's miles per gallon (mpg) and cylinders (cyl).
# either way works
select(mtcars, mpg, cyl)
select(mtcars, c(mpg, cyl))
# again, if we wanted a new dataset, we would have to save it
mtcars_mpg_cyl <- select(mtcars, c(mpg, cyl))
# You can also *deselect* certain columns using a minus sign.
select(mtcars, -mpg) # removes mpg
select(mtcars, -c(mpg, cyl)) # removes mpg and cyl
# Mutate
# Mutate adds or changes column values. For example, we have the variable qsec
# which is the time it takes a car to go a quarter of a mile, in seconds. Maybe
# we want to know how quickly it can go one mile. Let's multiply by four.
mutate(mtcars, milesec = 4*qsec)
# Group By
# grouping is a powerful operation for data manipulation. Sometimes you want
# information about an entire group, for example maybe we want the average
# horsepower across cylinders. To do that, we first group the dataset by number
# of cylinders, then use the mutate function to calculate average horsepower for
# each group, and then ungroup and go on with our day.
ungroup(mutate(group_by(mtcars, cyl), hp_average = mean(hp)))
# You might notice that it's getting a little difficult ot read! This is where
# the "pipe" operator comes in. In older versions of R, it looks like this %>%
# and you have to attach the packages dplyr or magrittr to your session to use
# it. In R 4.1 and newer, there is a pipe built into base R, which looks like
# |>. The pipe makes it easier to read your code, because it takes the result of
# what you just ran, and passes it on to the next thing you're going to run. Our
# same operation looks like this using the pipe:
mtcars %>% # take the dataset and pass it forward
group_by(cyl) %>% # group the dataset by cyl and pass it forward
mutate(hp_average = mean(hp)) %>% # calculate the mean horsepower
ungroup() # ungroup the dataset and print the result
# again, if you want to save the result of your calculation, you need to save it
# to a variable
mtcars_hp <- mtcars %>%
group_by(cyl) %>%
mutate(hp_average = mean(hp)) %>%
ungroup()
# Arrange
# Arrange organizes the rows in ascending or descending order. The default is
# ascending, i.e., lowest values at the top. If we want to see the lowest
# horsepower cars first:
arrange(mtcars, hp)
mtcars %>% arrange(hp) # with the pipe
# highest horsepower, use desc() for descending
arrange(mtcars, desc(hp))
mtcars %>% arrange(desc(hp))
# Summarize/Summarise (either spelling works)
# Summarize creates a new data frame summarizing some data. If the data are
# ungrouped, it returns one row. If the data are grouped, it returns one row per
# group. See some functions in the documentation: ?summarize.
mtcars %>%
summarize(mean = mean(hp),
n = n())
# There are a bunch of other functions, as well, like renaming columns, removing
# them, doing rowwise operations, etc. https://dplyr.tidyverse.org/
# Graphing with ggplot2 ---------------------------------------------------
install.packages("ggplot2") # you have to do this once
library(ggplot2) # have to attach to session every time
# Note that I usually put ALL packages I'll be using at the very top of the
# document, along with the data I'll be using.
# ggplot stands for "grammar of graphics" plotting, which is the idea that just
# like languages have grammars and once you know that you can construct totally
# new sentences, graphs also have a "grammar," fundamental elements that you can
# use to build all kinds of graphs.
# The main ingredients in a graph are (1) the plotting region (this would be
# like a piece of paper, in the physical world) and (2) the shapes you're using
# in your graph (this is your type of graph) In ggplot2, the shapes are
# generally called "geom"s, and some examples of geoms are histograms,
# scatterplots, bar graphs, line graphs, boxplots, violin plots. Let's
# investigate some graphing with a second dataset, starwars, which is included
# with dplyr and has information about 14 Star Wars characters like their
# height, mass, sex, species, etc. I saved it as a csv so we can practice
# loading files from outside of R.
# Option 1: relative filepath. RECOMMENDED.
starwars <- read.csv("starwars.csv")
# Option 2: absolute filepath. NOT RECOMMENDED. Your code will not work when
# someone else loads it because it will be referencing a specific filepath on
# *your* computer.
# starwars <- read.csv("/Users/yourname/R_tutorial/starwars.csv")
# first few rows
head(starwars)
ggplot(starwars) + # this says "I'm making a plot", like getting out a piece of paper
geom_point(mapping = aes(x = height, y = mass)) # this is the "point" geom, for a scatterplot. mapping = aes(...) describes what's on your axes
# you can use mapping = aes() in the ggplot() function, in which case it applies
# to ALL geoms unless overwritten, or in an individual geom (just applies to
# that geom).
# There's a lot of graphing, we're not going to come close to doing everything
# possible. There are some resources in the tutorial notes document for further
# learning. The main things we're going to touch on are a few basic shapes and
# colours.
# Graph of height by sex.
ggplot(starwars, mapping = aes(x = sex, y = height)) +
geom_col()
# Some geoms distinguish between colour (the outside edge) and fill (the inside)
ggplot(starwars, mapping = aes(x = sex, colour = sex)) +
geom_bar()
ggplot(starwars, mapping = aes(x = sex, fill = sex)) +
geom_bar()
# some don't distinguish
ggplot(starwars) +
geom_point(mapping = aes(x = height, y = mass, colour = gender))
# you can also do data manipulation and then pipe into a graph. NOTE THE
# DIFFERENCE: ggplot uses plus signs (+) to join things together, not pipes
# (%>%).
starwars %>%
ggplot() +
geom_point(mapping = aes(x = height, y = mass, colour = gender))
starwars %>%
filter(!is.na(height)) %>%
ggplot() +
geom_point(mapping = aes(x = height, y = mass, colour = gender))
# Statistical Analyses ----------------------------------------------------
# t-test
# Let's use a t-test to assess whether humans' weights are significantly
# different from other species. First, we need to create new datasets for humans
# and others.
humans <- starwars %>%
filter(species == "Human") %>%
select(name, mass, species)
others <- starwars %>%
filter(species != "Human") %>%
select(name, mass, species)
# Then we can compare weights using the t.test function included in base R via
# the stats package.
t.test(humans$mass, others$mass)
# Some assumption checks
# normality
hist(humans$mass)
hist(others$mass)
# Linear regression
# you can use lm, also in the base stats package
# How are height and mass related?
model <- lm(mass ~ height, data = starwars)
summary(model)
# a linear regression with categories is an anova
# for example, effect of gender on height
model2 <- lm(height ~ gender, data = starwars)
summary(model2)
# the default is feminine, which is the intercept. the difference between the
# two is marked by gendermasculine, and the effect is not significant.
# Troubleshooting
# You WILL run into problems/errors/bugs while coding. It's all part of the
# process, and you get better at coding by running into problems and then
# solving them. It sounds basic, but truly a huge part of troubleshooting and
# debugging is just Googling. There are tons of answers on StackExchange,
# reddit, ChatGPT, etc. All you need to do is copy-paste your exact error
# message into Google. A few things to note: (1) try to avoid copy-pasting
# solutions. Instead, look at what it says, try to understand, and then
# implement your own solution. It's fine to copy-paste sometimes, just try to
# copy-paste less as time goes on. It will help you really understand your code.
# (2) With ChatGPT, same thing, try to understand (or ask it to explain until
# you understand), and note that I personally have seen ChatGPT give nonsense,
# not-working code. You need to understand what you're trying to do before you
# can code it up or recognize when there are mistakes. (3) If you're asking a
# question on a forum like StackExchange, make sure you are very detailed about
# what you're trying to do, what solutions you have tried already, and include
# data that replicates your problem. Also make sure you've searched to see if
# someone has asked that question already. If you don't do these four things,
# you're less likely to get an answer.