best-practice-and-impact · sarahcollyer · Mar 10, 2025 · Oct 21, 2024 · Oct 21, 2024 · Oct 23, 2024
diff --git a/book/_config.yml b/book/_config.yml
@@ -26,6 +26,7 @@ sphinx:
     todo_include_todos: true
     language: en
     html_show_copyright: false
+    myst_heading_anchors: 3
 
 latex:
   latex_documents:

diff --git a/book/checklist_higher.md b/book/checklist_higher.md
@@ -101,7 +101,7 @@ Quality assurance checklist from [the quality assurance of code for analysis and
 ### Testing
 
 - Core functionality is unit tested as code. See [`pytest` for Python](https://docs.pytest.org/en/stable/) and [`testthat` for R](https://testthat.r-lib.org/).
-- Code based tests are run regularly.
+- Code based tests are run regularly and after every significant change to the code.
 - Bug fixes include implementing new unit tests to ensure that the same bug does not reoccur.
 - Informal tests are recorded near to the code.
 - Stakeholder or user acceptance sign-offs are recorded near to the code.
@@ -117,7 +117,7 @@ Quality assurance checklist from [the quality assurance of code for analysis and
 - Required libraries and packages are documented, including their versions.
 - Working operating system environments are documented.
 - Example configuration files are provided.
-- Where appropriate, code runs independent of operating system (e.g. suitable management of file paths).
+- Where appropriate, code runs independently of the operating system (for example there is suitable management of file paths for different operating systems).
 - Dependencies are managed separately for users, developers, and testers.
 - There are as few dependencies as possible.
 - Package dependencies are managed using an environment manager such as
@@ -250,7 +250,7 @@ Quality assurance checklist from
 
 - [ ] Core functionality is unit tested as code. See [`pytest` for Python](https://docs.pytest.org/en/stable/) and
  [`testthat` for R](https://testthat.r-lib.org/). 
-- [ ] Code based tests are run regularly.
+- [ ] Code based tests are run regularly and after every significant change to the code base.
 - [ ] Bug fixes include implementing new unit tests to ensure that the same bug does not reoccur.
 - [ ] Informal tests are recorded near to the code.
 - [ ] Stakeholder or user acceptance sign-offs are recorded near to the code.

diff --git a/book/checklists.md b/book/checklists.md
@@ -3,12 +3,12 @@
 This sections aims to provide a checklist for quality assurance of analytical projects in government.
 
 As per the [Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government),
-quality assurance should be proportional to the complexity and risk of your analysis.
+quality assurance should be proportionate to the complexity and risk of your analysis.
 With this in mind, we have provided checklists for three levels of quality assurance.
 
 We recommend that you consider the risk and complexity associated with your project.
 Given this assessment, you should select and tailor the checklists that we have provided.
-The values and risk tolerance varies between government departments, so it is important that these are considered when deciding what quality assurance is adequate.
+Risk tolerance varies between government departments, so it is important that you consider the operational context for the code when deciding what quality assurance is adequate.
 You may choose to select elements from each level of quality assurance to address the specific risks associated with your work.
 
 

diff --git a/book/code_documentation.md b/book/code_documentation.md
diff --git a/book/configuration.md b/book/configuration.md
@@ -2,15 +2,15 @@
 
 Configuration describes how your code runs when you execute it.
 
-In analysis, we may want to run our analysis code using different inputs or parameters.
+In analysis, we often want to run our analysis code using different inputs or parameters.
 And we likely want other analysts to be able to run our code on different machines, for example, to reproduce our results.
 This section describes how we can define analysis configuration that is easy to update and can remain separate from the logic in our analysis.
 
 
 ## Basic configuration
 
 Configuration for your analysis code should include high level parameters (settings) that can be used to easily adjust how your analysis runs.
-This might include paths to input and output files, database connection settings and model parameters that are likely to be adjusted between runs.
+This might include paths to input and output files, database connection settings, and model parameters that are likely to be adjusted between runs.
 
 In early development of our analysis, lets imagine that we have a script that looks something like this:
 
@@ -36,7 +36,7 @@ prediction.to_csv("outputs/predictions.csv")
 ```{code-tab} r R
 # Note: this is not an example of good practice
 # This is intended as example of what early pipeline code might look like
-data <- read.csv("C:/a/very/specific/path/to/input_data.csv") 
+data <- utils::read.csv("C:/a/very/specific/path/to/input_data.csv") 
 
 set.seed(42)
 split <- caTools::sample.split(data, SplitRatio = .3)
@@ -47,10 +47,9 @@ test_data <- data[!split, ]
 model <- glm(formula = outcome ~ a + b + c, family = binomial(link = "logit"), data = train_data, method = "model.frame")
 # model <- glm(formula = outcome ~ a + b + c, family = binomial(link = "logit"), data = train_data, method = "glm.fit")
 
-
 prediction <- predict(model, test_data, type = "response")
 
-write.csv(prediction, "outputs/predictions.csv")
+utils::write.csv(prediction, "outputs/predictions.csv")
 
 ```
 
@@ -60,23 +59,23 @@ Here we're reading in some data and splitting it into subsets for training and t
 We use one subset of variables and outcomes to train our model and then use the subset to test the model.
 Finally, we write the model's predictions to a `.csv` file.
 
-The file paths that are used to read and write data in our script are particular to our working environment.
-These files and paths may not exist on an other analyst's machine.
-As such, other analysts would need to read through the script and replace these paths in order to run our code.
+The file paths we use to read and write data in our script are particular to our working environment.
+These files and paths may not exist on another analyst's machine.
+As such, to run our code, other analysts need to read through the script and replace these paths.
 As we'll demonstrate below, collecting flexible parts of our code together makes it easier for others to update them.
 
 When splitting our data and using our model to make predictions, we've provided some parameters to the functions that we have used to perform these tasks.
-Eventually, we might reuse some of these parameters elsewhere in our script (e.g. the random seed)
+Eventually, we might reuse some of these parameters elsewhere in our script (e.g., the random seed)
 and we are likely to adjust these parameters between runs of our analysis.
-To make it easier to adjust these consistently throughout our script, we should store them in variables.
+We should store them in variables to make it easier to adjust these consistently throughout our script.
 We should also store these variables with any other parameters and options, so that it's easy to identify where they should be adjusted.
 
 Note that in this example we've tried our model prediction twice, with different parameters.
 We've used comments to switch between which of these lines of code runs.
 This practice is common, especially when we want to make a number of changes when developing how our analysis should run.
 However, commenting sections of code in this way makes it difficult for others to understand our code and reproduce our results.
-Another analyst would not be sure which set of parameters was used to produce a given set of predictions, so we should avoid this form of ambiguity.
-Below, we'll look at some better alternatives to storing and switching our analysis parameters.
+We should avoid this form of ambiguity because another analyst would not be sure which set of parameters was used to produce a given set of predictions.
+Below, we'll look at some better alternatives for storing and switching analysis parameters.
 
 ````{tabs}
 
@@ -122,7 +121,7 @@ test_split_proportion = .3
 model_method = "glm.fit"
 
 #analysis
-data <- read.csv(input_path) 
+data <- utils::read.csv(input_path) 
 
 set.seed(random_seed)
 split <- caTools::sample.split(data, SplitRatio = test_split_proportion)
@@ -134,7 +133,7 @@ model <- glm(formula = outcome ~ a + b + c, family = binomial(link = "logit"), d
 
 prediction <- predict(model, test_data, type = "response")
 
-write.csv(prediction, output_path)
+utils::write.csv(prediction, output_path)
 ```
 
 ````
@@ -144,32 +143,30 @@ We're able to use basic objects (like lists and dictionaries) to group related p
 We then reference these objects in the analysis section of our script.
 
 Our configuration could be extended to include other parameters, including which variables we're selecting to train our model.
-However, it is important that we keep the configuration simple and easy to maintain.
-Before moving aspects of code to the configuration it's good to consider whether it improves your workflow.
-If it is something that is dependent on the computer that you are using (e.g. file paths) or is likely to change between runs of your analysis,
-then it's a good candidate for including in your configuration.
+However, we must keep the configuration simple and easy to maintain.
+Before moving aspects of code to the configuration, consider whether it improves your workflow.
+You should include things that are dependent on the computer that you are using (e.g., file paths) or are likely to change between runs of your analysis, in your configuration.
 
 
 ## Use separate configuration files
 
-We can use independent configuration files to take our previous example one step further.
+We can take our previous example one step further using independent configuration files.
 We simply take our collection of variables, containing parameters and options for our analysis, and move them to a separate file.
-As we'll describe in the following subsections, these files can be written in the same language as your code or other simple languages.
+These files can be written in the same language as your code or other simple languages, as we'll describe in the following subsections.
 
 Storing our analysis configuration in a separate file to the analysis code is a useful separation.
 It means that we can version control our code based solely on changes to the overall logic - when we fix bugs or add new features.
 We can then keep a separate record of which configuration files were used with our code to generate specific results.
-We can easily switch between multiple configurations, by providing our analysis code with different configuration files.
+We can easily switch between multiple configurations by providing our analysis code with different configuration files.
 
-You may not want to version control your configuration file,
-for example if it includes file paths that are specific to your machine or references to sensitive data.
-In this case, you should include a sample or example configuration file, so that others can use this as a template to configure the analysis for their own environment.
-It is key that this template is kept up to date, so that it is compatible with your code.
+You may not want to version control your configuration file if it includes file paths that are specific to your machine or references to sensitive data.
+In this case, include a sample or example configuration file, so others can use this as a template to configure the analysis for their own environment.
+It is key to keep this template up to date, so that it is compatible with your code.
 
 
 ### Use code files for configuration
 
-To use another code script as our configuration file, we can copy our parameter variables directly from our scripts.
+We can copy our parameter variables directly from our scripts to use another code script as our configuration file.
 Because these variables are defined in the programming language that our analysis uses, it's easy to access them in our analysis script.
 In Python, variables from these config files can be imported into your analysis script.
 In R, your script might `source()` the config file to read the variables into the R environment.
@@ -180,7 +177,7 @@ In R, your script might `source()` the config file to read the variables into th
 Many other file formats can be used to store configuration parameters.
 You may have come across data-serialisation languages (including YAML, TOML, JSON and XML), which can be used independently of your programming language.
 
-If we were to represent our example configuration from above in YAML, this would look like:
+If we represent our example configuration from above in YAML, it would look like this:
 
 ```yaml
 input_path: "C:/a/very/specific/path/to/input_data.csv"
@@ -194,8 +191,8 @@ prediction_parameters:
     max_v: 1000
 ```
 
-Configuration files that are written in other languages may need to be read using relevant libraries.
-The YAML example above could be read into our analysis as follows:
+You can use relevant libraries to read configuration files that are written in other languages.
+For example, we could read the YAML example into our analysis like this:
 
 ````{tabs}
 
@@ -221,7 +218,7 @@ data <- read.csv(config$input_path)
 Configuration file formats like YAML and TOML are compact and human-readable.
 This makes them easy to interpret and update, even without knowledge of the underlying code used in the analysis.
 Reading these files in produces a single object containing all of the `key:value` pairs defined in our configuration file.
-In our analysis, we can then select our configuration parameters using their keys.
+We can then select our configuration parameters using their keys in our analysis.
 
 
 ## Use configuration files as arguments
@@ -231,8 +228,8 @@ Although this allows us to separate our configuration from the main codebase, we
 This is not ideal, as for the code to be run on another machine the configuration file must be saved on the same path.
 Furthermore, if we want to switch the configuration file that the analysis uses we must change this path or replace the configuration file at the specified path.
 
-To overcome this, we can adjust our analysis script to take the configuration file path as an argument when the analysis script is run.
-This can be achieved in a number of ways, but we'll discuss a minimal example here:
+We can adjust our analysis script to take the configuration file path as an argument when the analysis script is run to overcome this.
+We can achieve this in a number of ways, but we'll discuss a minimal example here:
 
 ````{tabs}
 
@@ -283,7 +280,7 @@ This means that we don't need to change our code to account for changes to the c
 
 ```{note}
 It is possible to pass configuration options directly as arguments in this way, instead of referencing a configuration file.
-However, use of configuration files should be preferred as they allow us to document which configuration
+However, you should use configuration files as they allow us to document which configuration
 has been used to produce our analysis outputs, for reproducibility.
 ```
 
@@ -293,17 +290,17 @@ has been used to produce our analysis outputs, for reproducibility.
 
 Environment variables are variables that are available in a particular environment.
 In most analysis contexts, our environment is the user environment that we are running our code from.
-This might be your local machine or dedicated analysis platform.
+This might be your local machine or an analysis platform.
 
-If your code depends on credentials of some kind, these must not be written in your code.
-Passwords and keys could be stored in configuration files, but there is a risk that these files may be included in [version control](version_control.md).
-To avoid this risk, it is best to store this information in local environment variables.
+If your code depends on credentials of some kind, do not write these in your code.
+You can store passwords and keys in configuration files, but there is a risk that these files may be included in [version control](version_control.md).
+To avoid this risk, store this information in local environment variables.
 
 Environment variables can also be useful for storing other environment-dependent variables.
 For example, the location of a database or a software dependency.
-This might be preferred over a configuration file if very few other options are required by the code.
+We might prefer this over a configuration file the code requires very few other options.
 
-In Unix systems (e.g. Linux and Mac), environment variables can be set in the terminal using `export` and deleted using `unset`:
+In Unix systems (e.g., Linux and Mac), you can set environment variables in the terminal using `export` and delete them using `unset`:
 
 ```none
 export SECRET_KEY="mysupersecretpassword"
@@ -317,9 +314,9 @@ setx SECRET_KEY "mysupersecretpassword"
 reg delete HKCU\Environment /F /V SECRET_KEY
 ```
 
-These can alternatively be defined using a graphical interface under `Edit environment variables for your account` in your Windows settings.
+You can alternatively define them using a graphical interface under `Edit environment variables for your account` in your Windows settings.
 
-Once stored in environment variables, these variables will remain available in your environment until they are deleted.
+Once stored in environment variables, these variables will remain available in your environment until you delete them.
 
 You can access this variable in your code like so:
 
@@ -337,4 +334,4 @@ my_key <- Sys.getenv("SECRET_KEY")
 
 ````
 
-It is then safer for this code to be shared with others, as it is not possible to acquire your credentials without access to your environment.
+It is then safer for this code to be shared with others, as they can't acquire your credentials without access to your environment.