Writing good quality code makes life easier for other people and your future self. It can also reduce the risk of errors (h/t to Best Coding Practices for R by Vikram Singh Rawat for much of this content).
Work within an RStudio Project.
- RStudio Projects act as your working directory, where the root is the directory created, or chosen, when setting up a new project.
- Once you create a project it is easier to manage your files and folders and it’s easier to give it somebody as well.
- With bigger projects, consider creating sub-folders for e.g. code (.R, .Rmd), data (.csv, .data, .rds) and outputs (.docx, .png, .pdf).
Once you have arranged the files and folders in a logical way, then comes the fact that the code itself should be arranged in such a way that feels easy to parse. Always remember: Code is read more often then it’s written.
"When you feel the need to write a comment, first try to refactor the code so that any comment becomes superfluous." – Martin Fowler, Refactoring
-
Explain why it is doing what it is doing. This will make it easier for users to know what the code is for
data %>% select(1:161) # column 161 is where the data stops and it becomes footnotes
data %>% select(1:161) # selecting columns
- Just explain what it is doing
- Keep any commented just-in-case code that isn't needed
- Include #TODO notes within the the code. Keep these separate in a README or in the GitHub issues tab
- Rely on (un)commenting code to change behaviour
- You can quickly lose track!
RStudio gives you an ability to create section with ctrl
+ shift
+ r
, or you can create one by adding 4 dashes (-) or 4 hash symbol (#) after a comment. The same applies for RMarkdown. For example:
```
# Load data ----
# Data wrangling ####
```
It also helps you jump between sections using shift
+ alt
+ j
. You can easily switch between sections and fold them at will. It helps you not only in navigation but maintaining a layout of the entire code as well.
You can also create sub-section in R by adding hash symbols in front of a section:
```
## some comment ----
### another comment ----
#### and yet another comment ----
```
When you write code there are standard practices that are used across domains that you should definitely use, in the following order:
- Call your libraries.
- Set all default variables or global options and all the path variables.
- Source all the code.
- Call all data files.
This coherence keeps all your code easy to find.
Indentation makes code readable. No matter what language you work in, your code should be properly indented so that the nature of the code is understandable.
```
foo <- function(first_arg, second_arg, third_arg){
create_file <- readxl::read_excel(path = first_arg, sheet = second_arg, range = third_arg)
}
```
```
foo <- function(
first_arg, second_arg, third_arg
){
create_file <- readxl::read_excel(path = first_arg,
sheet = second_arg,
range = third_arg)
}
```
```
y=ts(data=c(23,391,728,512,10),start=2010)
```
```
y = ts(data = c(23, 391, 728, 512, 10),
start = 2010)
```
In RStudio you can press ctrl
+ shift
+ a
to autoformat your code.
"Code is read more often than it is written." — Guido van Rossum (creator of Python)
"Programs are meant to be read by humans and only incidentally for computers to execute."
— Donald Knuth, The Art of Computer Programming
Writing code that humans can understand is an investment to have maintainable and reusable code. If it is easy to read, maintenance and edits will be much quicker in the future and it will be much easier for other people to work with your code. The shortest, most efficient code for the computer is likely not the optimal code for human readability.
Proper naming conventions will help collaboration in big teams and it makes the code easier to maintain. The three most widely-used conventions amongst programming communities are:
- camelCase: Names start with a small letter and every subsequent will start with upperCase
- PascalCase: Similar to camelCase but the first letter of the string is also UpperCase
- snake_case: All names are lower case with an underscore between words
Whichever convention you choose, make sure to stick with it for the duration of your project.
height_in_metres()
is better than converter()
for a function that converts height into metres. This makes it more obvious what your code is doing and makes your code more readable
Most of the data we read is from either an Excel file or a poorly designed database, thus we see column names with spaces and dots and other rogue characters. Remember these rules for column names:
- Stick to consistent naming convention (see above)
- Assume everything is case-sensitive.
- Do not use special characters ever.
- Do not add spaces.
In R you can use numbers to refer to columns, but (to parapharse a noted scholar) just because you can doesn’t mean you should.
```
# In base R
mtcars[,c(1,3:5,8)]
# Using tidyverse
mtcars %>% select(1, 3:5, 8)
```
```
# In base R
mtcars[, c("mpg", "disp", "hp", "drat", "vs")]
# Using tidyverse
mtcars %>%
select(mpg, disp, hp, drat, vs)
```
- Use
case_when()
within thetidyverse
collection for conditional statements. This is a more readable way of dealing with many ifelse statements. Standardif_else
statements can be very useful but can become hard to understand if too many are used.
- Use deeply nested
for
loops orif_else
statements
The default for Scottish Government is the Tidyverse styleguide
Lint your code:
- Linting: The automated checking of your source code for programmatic and stylistic errors
- Lintr checks your code for style, syntax errors and possible semantic issues
- Use relative file paths
- Use the library
here()
here()
finds your project files based on the current working directory at the time when the package is loaded.
- Use the library
- Use absolute file paths as this reduces the reproducability of the code
- It won't work on another users computer
- It won't work if any files move
- Ideally keep to fewer than 250 lines. This makes scripts more manageable
- Break up and
source()
sections like data processing, and variable or function assigning
- Keep everything for a large project in one script
- For example, break up the UI, Server & Golbal sections of a large Shiny app
"You should consider writing a function whenever you've copied and pasted a block of code more than twice." – H. Wickham (our lord and saviour).
Functions allow you to automate common tasks rather than repeatedly writing the same code. R for Data Science provides a good explanation for using functions.
(tl;dr: Don't Repeat Yourself (DRY))
- Use several smaller functions rather that one large one. This includes small, well named, helper functions
- Use the built in proper functions to simplify
if(is.numeric())
is better thanif(class(x) == "numeric" || class(x) == "integer")
- Use double colon when calling functions if you have not loaded a library already
- E.g.
readxl::read_xlsx()
- This makes it more obvious where the function is coming from
- And avoids ambiguous code when two packages have a function with the same name
- E.g.
- Repeat yourself
Automate the new data checks so you don't miss anything unexpected. Especially if you expect the data to be updated.
For example, automate checks for:
- Outliers
- NA's
- Class
- Expected column names
- Coding in the open encourages:
- The use of good code practice so others can read your code
- Collaboration
- Code reviews and pull requests
- Code in the open if you can, e.g. use Scottish Government Analysis
- Share sensitive information such as unpublished data, API keys or passwords
Here is the Gov Data Science guidance on writing READMEs
- Contact details
- How to run the code
- A licence (Scottish Government default is Open Government Licence)
- An estimate of run time if it takes longer than a couple minutes
- README's should help the user:
- Understand what the project is
- Learn how to use the project
Make a note of the library version number incase future updates to the library break the functionality of your code. In doing this, anyone looking at the code in the future can see when it last worked and begin troubleshooting.
All content is available under the Open Government Licence v3.0, except where otherwise stated.