-
Notifications
You must be signed in to change notification settings - Fork 6
/
21-understability.Rmd
134 lines (87 loc) · 10.1 KB
/
21-understability.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: ''
output:
html_document:
toc: true
theme: united
---
# Understandable {#understand}
It is crucial that we write code in a way that can be understood easily by others (and our future selves) and passed on to other staff with minimal additional instruction.
## Code structure {#structure2}
There should be clear logical separation of parameters, assumptions, data and code, with appropriate documentation.
### Data {-}
* It should be easy for others to change parameters and data without needing to understand the full codebase.
* Where possible, it's best practice to use lookup/reference files as much as possible, rather than hard-coding variables. An alternative approach is to store all parameters in one script, making it easy for the variables to be located and updated.
* Small data files that are not MoJ data can be committed to the Github repo (e.g. parameters, csvs containing assumptions, mock data for unit tests, lookup tables) - these should be stored in a 'data' folder within the project folder.
* Prefer text format (e.g. csv) to Excel.
* Please note that unpublished MoJ data should never be committed to your Github repository - even a 'private' repo isn't secure. Instead, data should be stored in an [S3 bucket](https://user-guidance.services.alpha.mojanalytics.xyz/amazon-s3.html#amazon-s3-2) or in Athena.
### Documentation {-#readme}
* Your project should include a README - by writing a README for your project you're helping others (and your future self) to understand and run your project. Find our template README [here](https://github.com/moj-analytical-services/our-coding-standards/blob/master/README_template.md).
* You should add a description to your Github repository and tag it with appropriate [topics](https://help.github.com/en/github/administering-a-repository/classifying-your-repository-with-topics). This will allow your project to be discoverable and reusable by others. Github will also automatically tag it according to the [language](https://help.github.com/en/github/creating-cloning-and-archiving-repositories/about-repository-languages) used.
* Any functions you create should have [appropriate documentation](#functions).
* Try to make your Git commit messages as informative as possible.
* Your code should be appropriately commented. Comments are for explaining why something is needed, not how it works. Note that there are no hard and fast rules on how to do this - [this blog](https://blog.ndepend.com/correct-way-comment-code/) gives an overview on some of the differing opinions on this.
> Good code is its own best documentation. As you're about to add a comment, ask yourself, "How can I improve the code so that this comment isn't needed?" Improve the code and then document it to make it even clearer.” - Steve McConnell
### Abstractions {-#functions}
Abstractions (e.g. functions, packages, modules etc.) should be used where possible. This makes code easier to understand, maintain and extend.
It will often be the case that will be a pre-existing package or module that contains the functions you need for your project. The best way to find these is usually through a quick google search and, if you're struggling to find what you're looking for, it's worth asking on the relevant slack channel (e.g. [#r](https://app.slack.com/client/T1PU1AP6D/C1PUCG719) and [#python](https://app.slack.com/client/T1PU1AP6D/C1Q09V86S) in the ASD workspace).
If you end up using a piece of code three times, it's probably worth turning it into a function and separating it out into a separate script. For more information on how to write functions, see [here][211]. As a general rule, functions should be less than about 50 lines long.
All non trivial functions should be documented using the programming language’s accepted standard:
* For Python follow [PEP8](https://www.python.org/dev/peps/pep-0008/) and particularly [PEP257](https://www.python.org/dev/peps/pep-0257/).
* For R, use [roxygen2](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html).
If your functions are short and well documented, there is often little need for additional code comments. See the [Tidyverse Style Guide](https://style.tidyverse.org/functions.html#naming) for some basic guidance on how to make your functions understandable.
[211]:https://github.com/moj-analytical-services/writing_functions_in_r
## Code style {#style}
There are often many ways of tackling a given problem. As a team, it makes sense to standardise our approach, not because one approach is necessarily better than all others, but because collaboration is easier if there is more commonality in our approaches.
### Defaults {-#defaults}
This section sets out sensible defaults which you are expected to follow. They are not strict rules, but you will be expected to explain the benefits of alternative approaches if you want to do something different.
+-------------------------+---------------------------------------------------------------+
| Language | Defaults |
+=========================+===============================================================+
| R | * Follow the [Tidyverse Style Guide][213] |
| | * Default to packages from the [Tidyverse](http://tidyverse.org/), because they have been carefully designed to work together effectively as part of a modern data analysis workflow. More info can be found here: [R for Data Science by Hadley Wickham](http://r4ds.had.co.nz). For example: |
| | * Prefer tibbles to data.frames |
| | * Use ggplot2 rather than base graphics |
| | * Use the pipe `%>%` appropriately, but not always e.g. see [here](https://twitter.com/hadleywickham/status/603883121197514752). |
| | * Prefer `purrr` to the `apply` family of functions. See [here](http://r4ds.had.co.nz/iteration.html#the-map-functions) |
| | * Use the package name when calling a function. For example, using dplyr::mutate() rather than just mutate() |
| | * R Packages are the [fundamental unit of reproducible R code.](http://r-pkgs.had.co.nz/) |
+-------------------------+---------------------------------------------------------------+
| Python | * Follow [PEP8](https://www.python.org/dev/peps/pep-0008/)
| | * Use Python 3 |
| | * Use pandas for data analysis |
| | * Use `loc` and `iloc` to write to data frames |
| | * Use Altair for basic data visualisation |
| | * Use Scikit Learn for machine learning |
| | * Use SQLAlchemy and pandas for database interactions, rather than writing your own SQL |
+-------------------------+---------------------------------------------------------------+
| Encoding and CSVs | Use unicode. This means you should convert inputs that include non-ASCII characters to unicode as early as possible in your data processing workflow. If you are outputting to text files, these should be encoded in utf-8. |
+-------------------------+---------------------------------------------------------------+
| SQL | Use Postgres or SQLite where possible, rather than other SQL database backends. |
+-------------------------+---------------------------------------------------------------+
| GIS | * Use PostGIS as your GIS backend |
| | * Prefer conducting your GIS analysis in code, e.g. using SQL, rather than point and click in a GUI |
| | * Use QGIS if you need a GUI. |
+-------------------------+---------------------------------------------------------------+
In addition to the above coding defaults, the following actions will help you to write understandable code.
### Naming conventions {-#names}
* Use meaningful names - well-named functions and variables can remove the need for a comment and make life a little easier for other readers, including your future self. Avoid meaningless names like ‘obj’ / ‘result’ / ‘foo’.
* Don’t be cute or jokey when naming things.
* Use single-letter variables only where the letter represents a well-known mathematical property (e.g. e = mc^2), or where their meaning is otherwise clear.
### Clear and concise code {-#ccc}
* Choose clarity over cleverness - use advanced language tricks with care.
* Avoid unnecessary repetition in you code. For example, if you end up using the same piece of code 3 times it's probably worth turning it into a function.
* Less code is usually better - but not at the expense of clarity.
* Use code comments well (see [above](#readme)).
### Linters {-#linter}
A linter is a tool that analyses code to check for programmatic and stylistic errors. You should apply a linter to review your code formatting, which will mean our coding style will be consistent across projects and make it easier for others to understand.
In RStudio, the keyboard shortcut 'ctrl+shift+A' will reformat your code and automate some of the process of passing the linter. If you apply the linter as you work, rather than at the end, you will find it much easier to write code that passes the linter first time.
* For R use [Lintr][212] and follow the [Tidyverse Style Guide][213].
* For Python, use [pylint][214] and follow [PEP8][215].
If possible, set up your linters and tests to run automatically on all pull requests, using [Github Actions](https://help.github.com/en/actions/automating-your-workflow-with-github-actions).
[212]:https://cran.r-project.org/web/packages/lintr/readme/README.html
[213]:https://style.tidyverse.org/index.html
[214]:http://pylint.pycqa.org/en/latest/
[215]:https://www.python.org/dev/peps/pep-0008/
### Error messages {-#errors}
Errors will occur, so write your error messages in a way that provide useful information to end users and people working on the code.