-
Notifications
You must be signed in to change notification settings - Fork 3
/
Style_guide.Rmd
398 lines (265 loc) · 19.2 KB
/
Style_guide.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
# Grattan's coding style and the `tidyverse` {#coding-style}
```{r message = FALSE, echo = FALSE}
library(tidyverse)
```
The benefits of a common coding style are well explained [by Hadley Wickham](http://r-pkgs.had.co.nz/style.html):
> Good style is important because while your code only has one author, it’ll usually have multiple readers. This is especially true when you’re writing code with others. In that case, it’s a good idea to agree on a common style up-front.
Below we describe the **key** elements of Grattan coding style, without being too tedious about it all. There are many elements of coding style we don't cover in this guide; if you're unsure about anything, [consult the `tidyverse` guide](https://style.tidyverse.org/).
You should also see the [Using R at Grattan](#organising-projects) page for guidelines about setting up your project.
A core principle for coding at Grattan is that your code should be **readable by humans**. For this reason, we prefer to use the `tidyverse` set of packages for data reading, manipulation and visualisation.
## The `tidyverse` {#tidyverse}
```{r tidyverse-hex, echo=FALSE}
knitr::include_graphics("atlas/tidyverse.png")
```
The `tidyverse` is central to our work at Grattan. The `tidyverse` is a [collection of related R packages](https://www.tidyverse.org/packages/) for importing, wrangling, exploring and visualising data in R. The packages are designed to work well together.
### Why do we use the tidyverse?
The `tidyverse` makes life easier!
The core `tidyverse` packages, like `ggplot2`, `dplyr`, and `tidyr`, are extremely popular. The `tidyverse` is probably the most popular 'dialect' of R. This means that any problem you encounter with the `tidyverse` will have been encountered many times before by other R users, so a solution will only be a Google search away.
The `tidyverse` packages are all designed to work well together, with a consistent underlying philosophy and design. This means that coding habits you learn with one `tidyverse` package, like `dplyr`, are also applicable to other packages, like `tidyr`.
They're designed to work with data frames^[The tidyverse works with 'tibbles', which are a tidyverse-specific variant of the data frame. Don't worry about the difference between tibbles and data frames.], a rectangular data object that will be familiar to spreadsheet users that is very intuitive and convenient for the sort of work we do at Grattan. In particular, the `tidyverse` is built around the concept of [*tidy data*](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html), which has a specific meaning in this context that we'll come to later. The fact that `tidyverse` packages are all built around one type of data object makes them easier to work with.
The creator of the `tidyverse`, Hadley Wickham, places great value on code that is expressive and comprehensible to humans. This means that code written in the `tidyverse` idiom is often able to be understood even if you haven't encountered the functions before. For example, look at this chunk of code:
```{r, eval = FALSE}
my_data %>%
filter(age >= 30) %>%
mutate(relative_income = income / mean(income))
```
Without knowing what `my_data` looks like, and even if you haven't encountered these functions before, this should be reasonably intuitive. We're taking some data, and then^[you can read the pipe, `%>%`, as 'and then'] only keeping observations that relate to people aged 30 and older, then calculating a new variable, `relative_income`. The name of a `tidyverse` function - like `filter`, `group_by`, `summarise`, and so on - generally gives you a pretty good idea what the function is going to do with your data, which isn't always the case with other approaches.
Here's one way to do the same thing in base R:
```{r, eval = FALSE}
transform(my_data[my_data$age >= 30, ],
relative_income = income / mean(income))
```
The base R code gets the job done, but it's clunkier, less expressive, and harder to read. A core principle of coding at Grattan is that you should strive to make your work **human readable**.
Code written with `tidyverse` functions is often faster than its base R equivalents. But most of our work at Grattan is with small-to-medium sized datasets (with fewer than a million rows or so), so speed isn't usually a major concern anyway.^[When working with very large datasets, it might be worth gaining speed using other packages, such as [`data.table`](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html). Fortunately, using the `dtplyr` package you can get most of the speed benefits of `data.table` and stick to familiar `dplyr` syntax.]
The most valuable resource we deal with at Grattan is our time. Computers are cheap, people are not. If your code executes quickly, but it takes your colleague many hours to decipher it, the cost of the extra QC time more than outweighs the saving through faster computation. The `tidyverse` packages strike a balance between expressive, comprehensible code and computational efficiency that suits the nature of our work at Grattan. This balance is the right one for most of our work, most of the time.
Most R scripts at Grattan should start with `library(tidyverse)`. Most of your work will be in data frames, and most of the time the `tidyverse` contains the core tools you'll need to do that work.
## Load packages first
Our analysis scripts will almost always involve loading some [packages](#packages). These should be laoded at the top of a script, in one block like this:
```{r eval=FALSE}
library(tidyverse)
library(grattantheme)
```
If you're loading a package from Github, it's a good idea to leave a [comment](#use-comments) to say where it came from, like this:
```{r eval=FALSE}
library(tidyverse)
library(grattantheme) # remotes::install_github("grattan/grattantheme")
library(strayr) # remotes::install_github("runapp-aus/strayr")
```
Don't scatter `library()` calls throughout your script - put them all at the top.
The only thing that should come before loading your packages is the script preamble.
## Script preamble
Describe what your script does in the first few lines using comments or within an RMarkdown document.
**Good**
```{r, eval = FALSE}
# This script reads ABS data downloaded from TableBuilder and combines into a single data object containing post-secondary education levels by age and gender by SA3.
```
Your preamble might also pose a research question that the script will answer.
**Good**
```{r, eval = FALSE}
# Do women have higher levels of educational attainment than men, within the same geographical areas and age groups?
```
Your preamble shouldn't be a terse, inscrutable comment.
**Bad**
```{r, eval = FALSE}
# make ABS ed data graph
```
If it's hard to concisely describe what your script does in a few lines of plain English, that might be a sign that your script does too many things. Consider breaking your analysis into a series of scripts. See [Organising R Projects at Grattan](#organising-projects) for more.
Your preamble should anticipate and answer any questions other people might have when reviewing your script. For example:
**Good**
```{r, eval = FALSE}
# This script calculates average income by age group and sex using the ABS Household Expenditure Survey and joins this to health information by age groups and sex from the National Health Survey. Note that we can't use the income variable in the NHS for this purpose, as it only contains information about respondents' income decile, not the income itself.
```
The preamble should pertain the the code contained in the specific script. If you have comments or information about your analysis as a whole, put it in your [README file](#README).
## Use comments {#use-comments}
Comments are necessary where the code _alone_ doesn't tell the full story. Comments should tell the reader **why** you're doing something, rather than just **what** you're doing.
For example, comments are important when groups are coded with numbers rather than character strings, because this might not be obvious to someone reading your script:
**Necessary to comment**
```{r, eval = FALSE}
data %>%
filter(gender == 1, # Keep only male observations
age == "05") # Keep only 35-39 year-olds.
```
Without the comment, readers of your code might not be aware that `1` in this dataset corresponds to `male`, or that `age == "05"` refers to 35-39 year olds. Without the comment, the code is not self-explanatory.
If your code _is_ self-explanatory, you can include or omit comments as you see fit.
**Not necessary (but okay if included)**
```{r, eval = FALSE}
# We want to only look at women aged 35-39
data %>%
filter(gender == "Female",
age >= 35 & age <= 39)
```
You should also include comments where your code is more complex and may not be easily understood by the reader. If you're using a function from a package that isn't commonly used at Grattan, include a comment to explain what it does.
_Err on the side of commenting more_, rather than less, throughout your code. Something may seem obvious to you when you're writing your code, but it might not be obvious to the person reading your code, even if that person is you in the future. Better to over-comment than under-comment.
Comments can go above code chunks, or next to code - there are examples of both above.
Try to keep them up-to-date.
```{r code-comments, echo = FALSE, out.width='100%'}
knitr::include_graphics("atlas/code-comments.png")
```
## Breaking your script into chunks
It's useful to break a lengthy script into chunks with `-----` (three or more hyphens).
**Good**
```{r, eval = FALSE}
# Read file A -----
a <- read_csv("data/a.csv")
# Read file B -----
b <- read_csv("data/b.csv")
# Combine files A and B ----
c <- bind_rows(a, b)
```
(In practice, you'll have more than one line of code in each block.)
This helps you, and others, navigate your code better, using the navigation tool built in to RStudio. In the script editor pane of RStudio, at the bottom left, there's a little navigation tool that helps you easily jump between named sections of your script.
```{r rstudio_nav, echo = FALSE}
knitr::include_graphics("atlas/rstudio_navigation.png")
```
Breaking your script into chunks with `-----` also makes your code easier to read.
## Assigning values to objects
In R, you work with objects. An object might be a data frame, or a vector of numbers or letters, or a list. Functions can be objects, too.
**Use the `<-` operator to assign values to objects**. Here are some **good** examples:
```{r assignment-good, eval=FALSE}
schools <- read_csv("data/schools_data.csv")
three_letters <- c("a", "b", "c")
lf <- labour_force %>%
filter(status != "NILF")
```
Avoid `->`, `=` and `assign()`. Here are some **bad** examples::
```{r assignment-bad, eval=FALSE}
schools = read_csv("data/schools_data.csv")
assign("three_letters", c("a", "b", "c"))
labour_force %>%
filter(status != "NILF") -> lf
```
All these bad operators will work, but they are best avoided. The `=` operator is avoided for reasons of visual consistency, style, and to avoid confusion. `assign()` is avoided because it can lead to unexpected behaviour, and is usually not the best way to do what you want to do. The `->` operator is avoided because it's easy to miss when skimming over code.
The `<<-` operator should also be avoided.
## Naming objects and variables
It's important to be consistent when naming things. This saves you time when writing code. If you use a consistent naming convention, you don't need to stop to remember if your object is called `ed_by_age` or `edByAge` or `ed.by.age`. Having a consistent naming convention across Grattan also makes it easy to read and QC each other's code.
Grattan uses _words separated by underscores_ `_` (aka 'snake_case') to name objects and variables. This is [common practice across the Tidyverse](https://style.tidyverse.org/syntax.html#object-names).
Object names should be descriptive and not-too-long. This is a trade-off, and one that's sometimes hard to get right. However, using snake_case provides consistency:
**Good object names**
```{r, eval = FALSE}
sa3_population
gdp_growth_vic
uni_attainment
```
**Bad object names**
```{r, eval = FALSE}
sa3Pop
GDPgrowthVIC
uni.attainment
```
Variable names face a similar trade-off. Again, try to be descriptive and short using snake_case:
**Good variable names**
```{r, eval = FALSE}
gender
gdp_growth
highest_edu
```
**Bad variable names**
```{r, eval = FALSE}
s801LHSAA
gdp.growth
highEdu
chaosVar_name.silly
var2
```
When you load data from outside Grattan, such as ABS microdata, variables will often have bad names. It is worth taking the time at the top of your script to [rename your variables](https://dplyr.tidyverse.org/reference/select.html), giving them consistent, descriptive, short, snake_case names. An easy way to do this is using `clean_names()` function from the `janitor` package:
```{r janitor-example}
df_with_bad_names <- data.frame(firstColumn = c(1:3),
Second.column = c(4:6))
df_with_good_names <- janitor::clean_names(df_with_bad_names)
df_with_good_names
```
The most important thing is that your code is internally consistent - you should stick to one naming convention for all your objects and variables. Using snake_case, which we strongly recommend, reduces friction for other people reading and editing your code. Using short names saves effort when coding. Using descriptive names makes your code easier to read and understand.
## Spacing
Giving your code room to breathe greatly helps readability for future-you and others who will have to read your code. Code without ample whitespace is hard to read, justasitishardertoreadEnglishsentenceswithoutspaces.
### Assign and equals
Put a space each side of an assign operator `<-`, equals `=`, and other 'infix operators' (`==`, `+`, `-`, and so on).
**Good**
```{r, eval = FALSE}
uni_attainment <- filter(data, age == 25, gender == "Female")
```
**Bad**
```{r, eval = FALSE}
uni_attainment<-filter(data,age==25,gender=="Female")
```
Exceptions are operators that _directly connect_ to an object, package or function, which should **not** have spaces on either side: `::`, `$`, `@`, `[`, `[[`, etc.
**Good**
```{r, eval = FALSE}
uni_attainment$gender
uni_attainment$age[1:10]
readabs::read_abs()
```
**Bad**
```{r, eval = FALSE}
uni_attainment $ gender
uni_attainment$ age [ 1 : 10]
readabs :: read_abs()
```
### Commas
Always put a space _after_ a comma and not before, just like in regular English.
**Good**
```{r, eval = FALSE}
select(data, age, gender, sa2, sa3)
```
**Bad**
```{r, eval = FALSE}
select(data,age,gender,sa2,sa3)
select(data ,age ,gender ,sa2 ,sa3)
```
### Parentheses
Do not use spaces around parentheses in most cases:
**Good**
```{r, eval = FALSE}
mean(x, na.rm = TRUE)
```
**Bad**
```{r, eval = FALSE}
mean (x, na.rm = TRUE)
mean( x, na.rm = TRUE )
```
For spacing rules around `if`, `for`, `while`, and `function`, see [the Tidyverse guide](https://style.tidyverse.org/syntax.html#parentheses).
## Short lines, line indentation and the pipe `%>%`
Keeping your lines of code short and indenting them in a consistent way can help make reading code much easier. If you are supplying multiple arguments to a function, it's generally a good idea to put each argument on a new line - hit enter/return after the comma, like in the `rename` and `filter` examples below. Indentation makes it clear where a code block starts and finishes.
Using pipes (`%>%`) instead of nesting functions also makes things clearer.^[The pipe is from the `magrittr` package and is used to chain functions together, so that the output from one function becomes the input to the next function. The pipe is loaded as part of the [`tidyverse`](#tidyverse).] The pipe should always have a space before it, and should generally be followed by a new line, as in this example:
**Good: short lines and indentation**
```{r, eval = FALSE}
young_qual_income <- data %>%
rename(gender = s801LHSAA,
uni_attainment = high.ed) %>%
filter(income > 0,
age >= 25 & age <= 34) %>%
group_by(gender,
uni_attainment) %>%
summarise(mean_income = mean(income,
na.rm = TRUE))
```
Without indentation, the code is harder to read. It's not clear where the chunk starts and finishes, and which bits of code are arguments to which functions.
**Bad: short lines, no indentation**
```{r, eval = FALSE}
young_qual_income <- data %>%
rename(gender = s801LHSAA,
uni_attainment = high.ed) %>%
filter(income > 0,
age >= 25 & age <= 34) %>%
group_by(gender, uni_attainment) %>%
summarise(mean_income = mean(income, na.rm = TRUE))
```
Long lines are also bad and hard to read.
**Bad: long lines**
```{r, eval = FALSE}
young_qual_income <- data %>% rename(gender = s801LHSAA, uni_attainment = high.ed) %>% filter(income > 0, age >= 25 & age <= 34) %>% group_by(gender, uni_attainment) %>% summarise(mean_income = mean(income, na.rm = TRUE))
```
When you want to take the output of a function and pass it as the input to another function, use the pipe (`%>%`). Don't write ugly, inscrutable code like this, where multiple functions are wrapped around other functions.
**War-crime bad: long lines without pipes**
```{r, eval = FALSE}
young_qual_income<-summarise(group_by(filter(rename(data,gender=s801LHSAA,uni_attainment=high.ed),income>0,age>=25&age<=34),uni_attainment),mean_income=mean(income,na.rm=TRUE))
```
Writing clear code chunks, where functions are strung together with a pipe (`%>%`), makes your code much more expressive and able to be read and understood. This is another reason to favour R over something like Excel, which pushes people to piece together functions into Frankenstein's monsters like this:
```{r, eval=FALSE}
=IF($G16 = "All day", INDEX(metrics!$D$8:$H$66, MATCH(INDEX(correspondence!$B$2:$B$23, MATCH('convert to tibble'!M$4, correspondence!$A$2:$A$23, 0)), metrics!$B$8:$B$66, 0), MATCH('convert to tibble'!$E16, metrics!$D$4:$H$4, 0)), "NA")
```
I just threw up in my mouth a little bit.
The pipe function `%>%` can make code more easy to write and read. The pipe can create the temptation to string together lots and lots of functions into one block of code. This can make things harder to read and understand.
Resist the urge to use the pipe to make code blocks too long. A block of code should generally do one thing, or a small number of things.
## Omit needless code
Don't retain code that ultimately didn't lead anywhere. If you produced a graph that ended up not being used, don't keep the code in your script - if you want to save it, move it to a subfolder named 'archive' or similar. Your code should include the steps needed to go from your raw data to your output - and not extraneous steps. If you ask someone to QC your work, they shouldn't have to wade through 1000 lines of code just to find the 200 lines that are actually required to produce your output.
When you're doing data analysis, you'll often give R interactive commands to help you understand what your data looks like. For example, you might view a dataframe with `View(mydf)` or `str(mydf)`. This is fine, and often necessary, when you're doing your analysis. **Don't keep these commands in your script**. These type of commands should usually be entered straight into the R console, not in a script. If they're in your script, delete them.