forked from andreamazzella/IntRo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintRo5 transform.qmd
398 lines (269 loc) · 12.3 KB
/
intRo5 transform.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
---
title: "5: Transforming data"
subtitle: "Introduction to R for health data"
author: Andrea Mazzella [(GitHub)](https://github.com/andreamazzella)
---
------------------------------------------------------------------------
# Content
- Data transformation
- calculate new variables
- calculate age
- categorise continuous variables
- combine levels of a categorical variable
- create a dummy variable
- combine values of two categorical variables
------------------------------------------------------------------------
# Recap from topic 4
From default dataset `Theoph`, identify at which time point participant 11 had a plasma theophylline concentration higher than 7.5 mg/L (without scrolling through the dataset manually).
------------------------------------------------------------------------
Load packages
```{r}
library(tidyverse)
```
------------------------------------------------------------------------
# Data transformation
During data analysis, it's very common to need to derive new information from the raw data. For example:
- We have date of births, but we don't have ages
- BMI is continuous, but we might want to analyse it divided in clinical categories.
- We have data on disability with three levels: able-bodied, mild disability, severe disability, but we want to change it to a binary variable (disability yes/no).
------------------------------------------------------------------------
## Calculate new variables
In `dplyr`, function `mutate()` is used to add new variables (or change existing ones). The first argument is the dataset name, then you have the new variable name, an equals sign, and then the formula to use to calculate the values in the new variable. Formulae can incorporate values from other columns.
As an example, let's convert height from its current unit (cm) into meters.
The following code reads: take dataset "dm", change it by adding a new column called "height_m", and populate its values with the corresponding "height" value on the same row, divided by 100.
```{r}
# dplyr
dm |> mutate(height_m = height / 100)
```
This change is temporary; if you want this column to be permanently added to the dataset, you need to assign it.
```{r}
dm <- dm |> mutate(height_m = height / 100)
```
NB: There is no output to this. If you want to check whether it worked, you can have a look at the dataset. By default, new variables are appended at the end.
This can also be done in base R by exposing the variable with `$`, doing the calculation, and assigning it to a new *variable name* (not to the dataset).
```{r}
# Base R
# dm$height_m <- dm$height / 100
```
Whenever you calculate a new variable, I recommend having a quick sanity check to make sure that it did what you intended.
```{r}
# Sanity check
dm |> select(height, height_m)
```
*Exercise 1.*
Convert weight into pounds (1 kg = 2.20462 lbs) and add this to the dataset.
(Note: if you make a mistake and you want to recalculate the variable, you don't need to drop the old one first. When you assign it again, R will overwrite the old variable)
```{r}
#| label: new_variable
```
------------------------------------------------------------------------
## Calculate age
Now we can calculate each observation's age at the study start - let's say this is 05 Sep 2022.
Dealing with dates in all programming languages is tricky. We'll come back to this in the next topic; for now, please know that:
- Computers think of dates as *number of days* from a specific time point.
- `lubridate::dmy()` transforms text to an R date.
```{r}
# Create a calculated age variable
dm <- dm |>
mutate(
# Calculate age as a period
age_interval = interval(start = date_birth, end = dmy("05-Sep-2022")),
age_period = as.period(age_interval),
# Convert age period into decimal years
age_duration = as.duration(age_period),
age_decimal = as.numeric(age_duration, "years"),
# Drop the decimal digits
age_years = floor(age_decimal)
)
# Sanity check
dm |> select(date_birth, starts_with("age"))
```
*Exercise 2.*
1. Explore the distribution of age with a histogram.
2. What are the median age, Q1 and Q3?
```{r}
#| label: histogram
```
------------------------------------------------------------------------
## Categorise continuous variables
We might also want to transform a continuous variable into a categorical one, for example to reflect clinically meaningful groups.
Imagine we want to categorise BMI into underweight (BMI lower than 18.5), normal weight (18.5 - 25), overweight (25-30) and obese (above 30). Let's see what the minimum and maximum BMI are:
```{r}
summary(dm$bmi)
```
We can then use `mutate()`, this time with helper function `cut()` - we need to specify the `breaks`, i.e., which values delimit groups.
```{r}
dm <- dm |>
mutate(bmi_group = cut(bmi, breaks = c(0, 18.5, 25, 30, Inf), right = FALSE))
# Check that the new variable was created correctly
dm |>
group_by(bmi_group) |>
summarise(min(bmi),
max(bmi),
n())
```
By default `cut()`:
- creates a left-open and right-closed interval, i.e., (25,30] excludes 25 and includes 30.
- creates labels in mathematical notation.
We can add custom labels to each group by adding the argument `labels =`. NB: labels should have 1 fewer element than `breaks =`.
```{r}
dm <- dm |>
mutate(bmi_group_new = cut(bmi, right = FALSE,
breaks = c(0, 18.5, 25, 30, Inf),
labels = c("underweight",
"regular weight",
"overweight",
"obese")
))
dm |> count(bmi_group, bmi_group_new)
```
NB: categorising continuous variables, and especially dichotomising them, is potentially highly problematic - see this paper: Turner E, Dobson J & Pocock S. (2010). [Categorisation of continuous risk factors in epidemiological publications: A survey of current practice](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2972292/).
*Exercise 3.*
Categorise age into groups.
```{r}
#| label: categories
```
------------------------------------------------------------------------
## Combine levels of a categorical variable
Sometimes we might need to combine two or more levels of a categorical variables.
For example, the `continent` variable contains both South and North America; imagine we want to combine them into a single level, Americas.
To do this, we can use `mutate()` to create the new variable and `fct_collapse()` to combine the levels of the old one. This function, from package `forcats`, takes these arguments:
- the old variable
- the first manually defined value, followed by `=`, followed by a vector of the old values.
- (any other new manually defined names)
```{r}
# Explore levels
dm |> count(continent)
# Regroup
dm <- dm |>
mutate(macrocontinent = fct_collapse(continent,
americas = c("n_america", "s_america"),
eurasia = c("europe", "asia")))
# Check it worked
dm |> count(continent, macrocontinent)
```
You might also want to lump together the levels with few observations - for example, there are less than 10 people in this dataset doing swimming, yoga, or gymnastics; let's lump them into an "other" category.
For this, we can use the `fct_lump_` functions from `forcats`.
```{r}
# Check before
dm |> count(exercise_type) |> arrange(desc(n))
# Lump
dm <- dm |> mutate(exercise_type_2 = fct_lump_min(exercise_type, 10))
# Sanity check
dm |> count(exercise_type, exercise_type_2) |> arrange(desc(n))
```
------------------------------------------------------------------------
### Create a dummy variable
If you want to reduce the levels to two (dummy variable), I recommend creating a logical variable (i.e., that contains only TRUE or FALSE), rather than a "yes"/"no" or 1/0 variable.
For example, the "disability" variable has three levels: "able-bodied", "mild" and "severe"; let's turn this into a `disabled` variable that can be either TRUE (= yes) or FALSE (= no).
```{r}
# Data exploration
dm |> count(disability)
# Regroup
dm <- dm |> mutate(disabled = disability == "mild" | disability == "severe")
# Check it worked
dm |> count(disability, disabled)
```
NB: do *not* use the `%in%` shortcut in these circumstances, otherwise missing values will be incorrectly recoded as FALSE!
*Exercise 4.* (Pick one)
1. Create a new variable with three levels: eats meat, eats fish, doesn't eat either, using `diet` as reference.
2. Create a new binary variable that indicates whether a patient is inactive, using `exercise` as reference.
```{r}
#| label: group_levels
```
------------------------------------------------------------------------
## Combine values of two categorical variables
We can also create a new categorical variables that summarises two other categorical variables.
For example, we can create a new variable, `frailty`, that reflects the combination of disability and lack of exercise. Suppose we want to specify that someone with a mild disability is moderately frail if they don't exercise, and only mildly frail if they do.
To do this, we may use `case_when()`. This allows us to check for many logical statements and give specific values when these are met.
```{r}
# Cross-tabulation
dm |> count(disability, exercise)
# Create the new variable
dm <- dm |>
mutate(frailty = case_when(disability == "severe" ~ "severely frail",
disability == "mild" & exercise == "none" ~ "moderately frail",
disability == "mild" & exercise != "none" ~ "mildly frail",
disability == "able-bodied" ~ "not frail",
is.na(disability) | is.na(exercise) ~ NA_character_
))
# Check it worked
dm |> count(disability, exercise, frailty)
```
------------------------------------------------------------------------
*Exercise 5.*
1. Import the `bristol.rds` dataset.
2. Explore the data.
3. Categorise the type of stool in the Bristol scale into three levels: "constipation" (type 1-2), "normal" (type 3-5), "diarrhoea" (type 6-7).
4. Save a copy as `bristol_v2.rds`.
```{r}
#| label: recap
# Import
# Explore
# Categorise
# Export
```
------------------------------------------------------------------------
# Solutions
```{r}
#| label: solution_recap
theoph <- datasets::Theoph
theoph |> filter(Subject == 11 & conc > 7.5)
```
```{r}
#| label: solution_new_variable
dm <- dm |> mutate(pounds = weight * 2.20462)
```
```{r}
#| label: solution_histogram
ggplot(dm, aes(age, fill = gender)) + geom_histogram(bins = 20)
summary(dm$age)
```
```{r}
#| label: solution_categorise
# Summarise age
summary(dm$age)
# Categorise
dm <- dm |>
mutate(age_grp = cut(age,
breaks = c(16, 18, 30, 60, 81),
labels = c("underage", "young adult", "adult", "older adult"),
right = FALSE))
# Sanity check
dm |>
group_by(age_grp) |>
summarise(min(age),
max(age),
n())
```
```{r}
#| label: solution_group_levels
dm |> count(diet)
dm |> count(exercise)
dm <- dm |>
mutate(diet_new = fct_collapse(diet, no_meat = c("vegan", "vegetar")),
inactive = exercise == "none")
dm |> count(exercise, inactive)
dm |> count(diet, diet_new)
```
```{r}
#| label: solution_recap
# Import
bristol <- read_rds("data/bristol.rds")
# Explore
View(bristol)
glimpse(bristol)
count(bristol, bristol_type)
# Categorise
bristol <- bristol |>
mutate(stool_type = case_when(bristol_type <= 2 ~ "constipation",
between(bristol_type, 3, 5) ~ "normal",
bristol_type >= 6 ~ "diarrhoea"
))
# NB: this may also be done with cut()
# Check
bristol |> count(bristol_type, stool_type)
# Export
# bristol |> write_rds("data/bristol_v2.rds")
```
------------------------------------------------------------------------