Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

50 normalise book #220

Merged
merged 40 commits into from
Mar 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
2a8685b
added oxford commas for consistency
beth-e-jones Oct 21, 2024
d83f87c
punctuation and typo changes
beth-e-jones Oct 21, 2024
e2015a2
minor punctuation edits on Principles
beth-e-jones Oct 23, 2024
fd20a7d
punctuation and grammatical edits
beth-e-jones Oct 23, 2024
a7730df
punctuation consistency in Readable Code
beth-e-jones Oct 23, 2024
27f0d8e
consistency edits for Structure chapter
beth-e-jones Oct 23, 2024
276f42d
minor edits on code documentation chapter
beth-e-jones Oct 24, 2024
680e77f
punctuation edits as far as Vignettes in project_docs
beth-e-jones Oct 24, 2024
8a56c84
edits for continuity in punctuation
beth-e-jones Oct 24, 2024
ebb8bf4
minor edits for punctuation and style consistency
beth-e-jones Oct 24, 2024
b61174b
consistency in formatting for data chapter
beth-e-jones Oct 24, 2024
8184c8e
removed duplicated paragrapjs
beth-e-jones Oct 24, 2024
3f3919d
consistency in style and punctuation for CI chapter
beth-e-jones Oct 24, 2024
4b1eb79
Merge branch 'main' of https://github.com/best-practice-and-impact/qa…
beth-e-jones Oct 29, 2024
6299459
reduced passive sentences in introduction
beth-e-jones Oct 29, 2024
3fdd502
removed several passive sentences in glossary
beth-e-jones Oct 29, 2024
4fb416d
removed passive sentences in managers guide
beth-e-jones Oct 29, 2024
7c9a4bb
removed passive sentences in Principles chapter
beth-e-jones Oct 30, 2024
00fd4a1
reduced passive sentences in Modular code chapter
beth-e-jones Oct 30, 2024
a152fcc
reduced passive sentences in Readable Code
beth-e-jones Oct 30, 2024
5495130
edited passive sentences in modular code section
beth-e-jones Oct 30, 2024
fddbd6d
reduced passive sentences in Project Structure chapter
beth-e-jones Oct 31, 2024
09fa804
reduced passive sentences in Documenting Code
beth-e-jones Oct 31, 2024
cd49e18
reduced passive sentences in project documentation section
beth-e-jones Oct 31, 2024
203cfba
reduced passive sentences in Documenting Projects section
beth-e-jones Oct 31, 2024
c831519
editing passive sentences in version control section
beth-e-jones Oct 31, 2024
1215d1e
editing passive sentences in version control (done to line 600)
beth-e-jones Oct 31, 2024
227b773
reduced passive sentences in version control
beth-e-jones Nov 4, 2024
442f339
reduced passive sentences in configuration section
beth-e-jones Nov 4, 2024
7df59bc
reduced passive sentences in data management section
beth-e-jones Nov 4, 2024
5e156ee
reduced passive sentences in peer review chapter
beth-e-jones Nov 4, 2024
c6985b3
edited passive voice and simplified some parts in automating qa chapter
ellie-o Nov 20, 2024
a1f04b2
Merging branch to pick up testing chapter edits so book can be normal…
ellie-o Jan 13, 2025
71ec243
Reduced passive voice in testing chapter. Some grammar edits and simp…
ellie-o Jan 15, 2025
4c055fe
converted some passive sentences to active in testing chapter
beth-e-jones Jan 20, 2025
0879cac
fix broken internal link in version control chapter
sarahcollyer Feb 20, 2025
d37b251
remove reference to broken link
sarahcollyer Feb 20, 2025
1cd0ebb
Proof read all chapters Martin R
zarbspace Feb 27, 2025
526f34d
Fix comments highlighted in review
sarahcollyer Mar 10, 2025
0ceb1a7
Update config to work with auto generated header anchors (#221)
sarahcollyer Mar 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions book/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ sphinx:
todo_include_todos: true
language: en
html_show_copyright: false
myst_heading_anchors: 3

latex:
latex_documents:
Expand Down
6 changes: 3 additions & 3 deletions book/checklist_higher.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ Quality assurance checklist from [the quality assurance of code for analysis and
### Testing

- Core functionality is unit tested as code. See [`pytest` for Python](https://docs.pytest.org/en/stable/) and [`testthat` for R](https://testthat.r-lib.org/).
- Code based tests are run regularly.
- Code based tests are run regularly and after every significant change to the code.
- Bug fixes include implementing new unit tests to ensure that the same bug does not reoccur.
- Informal tests are recorded near to the code.
- Stakeholder or user acceptance sign-offs are recorded near to the code.
Expand All @@ -117,7 +117,7 @@ Quality assurance checklist from [the quality assurance of code for analysis and
- Required libraries and packages are documented, including their versions.
- Working operating system environments are documented.
- Example configuration files are provided.
- Where appropriate, code runs independent of operating system (e.g. suitable management of file paths).
- Where appropriate, code runs independently of the operating system (for example there is suitable management of file paths for different operating systems).
- Dependencies are managed separately for users, developers, and testers.
- There are as few dependencies as possible.
- Package dependencies are managed using an environment manager such as
Expand Down Expand Up @@ -250,7 +250,7 @@ Quality assurance checklist from

- [ ] Core functionality is unit tested as code. See [`pytest` for Python](https://docs.pytest.org/en/stable/) and
[`testthat` for R](https://testthat.r-lib.org/).
- [ ] Code based tests are run regularly.
- [ ] Code based tests are run regularly and after every significant change to the code base.
- [ ] Bug fixes include implementing new unit tests to ensure that the same bug does not reoccur.
- [ ] Informal tests are recorded near to the code.
- [ ] Stakeholder or user acceptance sign-offs are recorded near to the code.
Expand Down
4 changes: 2 additions & 2 deletions book/checklists.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
This sections aims to provide a checklist for quality assurance of analytical projects in government.

As per the [Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government),
quality assurance should be proportional to the complexity and risk of your analysis.
quality assurance should be proportionate to the complexity and risk of your analysis.
With this in mind, we have provided checklists for three levels of quality assurance.

We recommend that you consider the risk and complexity associated with your project.
Given this assessment, you should select and tailor the checklists that we have provided.
The values and risk tolerance varies between government departments, so it is important that these are considered when deciding what quality assurance is adequate.
Risk tolerance varies between government departments, so it is important that you consider the operational context for the code when deciding what quality assurance is adequate.
You may choose to select elements from each level of quality assurance to address the specific risks associated with your work.


Expand Down
120 changes: 61 additions & 59 deletions book/code_documentation.md

Large diffs are not rendered by default.

81 changes: 39 additions & 42 deletions book/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

Configuration describes how your code runs when you execute it.

In analysis, we may want to run our analysis code using different inputs or parameters.
In analysis, we often want to run our analysis code using different inputs or parameters.
And we likely want other analysts to be able to run our code on different machines, for example, to reproduce our results.
This section describes how we can define analysis configuration that is easy to update and can remain separate from the logic in our analysis.


## Basic configuration

Configuration for your analysis code should include high level parameters (settings) that can be used to easily adjust how your analysis runs.
This might include paths to input and output files, database connection settings and model parameters that are likely to be adjusted between runs.
This might include paths to input and output files, database connection settings, and model parameters that are likely to be adjusted between runs.

In early development of our analysis, lets imagine that we have a script that looks something like this:

Expand All @@ -36,7 +36,7 @@ prediction.to_csv("outputs/predictions.csv")
```{code-tab} r R
# Note: this is not an example of good practice
# This is intended as example of what early pipeline code might look like
data <- read.csv("C:/a/very/specific/path/to/input_data.csv")
data <- utils::read.csv("C:/a/very/specific/path/to/input_data.csv")

set.seed(42)
split <- caTools::sample.split(data, SplitRatio = .3)
Expand All @@ -47,10 +47,9 @@ test_data <- data[!split, ]
model <- glm(formula = outcome ~ a + b + c, family = binomial(link = "logit"), data = train_data, method = "model.frame")
# model <- glm(formula = outcome ~ a + b + c, family = binomial(link = "logit"), data = train_data, method = "glm.fit")


prediction <- predict(model, test_data, type = "response")

write.csv(prediction, "outputs/predictions.csv")
utils::write.csv(prediction, "outputs/predictions.csv")

```

Expand All @@ -60,23 +59,23 @@ Here we're reading in some data and splitting it into subsets for training and t
We use one subset of variables and outcomes to train our model and then use the subset to test the model.
Finally, we write the model's predictions to a `.csv` file.

The file paths that are used to read and write data in our script are particular to our working environment.
These files and paths may not exist on an other analyst's machine.
As such, other analysts would need to read through the script and replace these paths in order to run our code.
The file paths we use to read and write data in our script are particular to our working environment.
These files and paths may not exist on another analyst's machine.
As such, to run our code, other analysts need to read through the script and replace these paths.
As we'll demonstrate below, collecting flexible parts of our code together makes it easier for others to update them.

When splitting our data and using our model to make predictions, we've provided some parameters to the functions that we have used to perform these tasks.
Eventually, we might reuse some of these parameters elsewhere in our script (e.g. the random seed)
Eventually, we might reuse some of these parameters elsewhere in our script (e.g., the random seed)
and we are likely to adjust these parameters between runs of our analysis.
To make it easier to adjust these consistently throughout our script, we should store them in variables.
We should store them in variables to make it easier to adjust these consistently throughout our script.
We should also store these variables with any other parameters and options, so that it's easy to identify where they should be adjusted.

Note that in this example we've tried our model prediction twice, with different parameters.
We've used comments to switch between which of these lines of code runs.
This practice is common, especially when we want to make a number of changes when developing how our analysis should run.
However, commenting sections of code in this way makes it difficult for others to understand our code and reproduce our results.
Another analyst would not be sure which set of parameters was used to produce a given set of predictions, so we should avoid this form of ambiguity.
Below, we'll look at some better alternatives to storing and switching our analysis parameters.
We should avoid this form of ambiguity because another analyst would not be sure which set of parameters was used to produce a given set of predictions.
Below, we'll look at some better alternatives for storing and switching analysis parameters.

````{tabs}

Expand Down Expand Up @@ -122,7 +121,7 @@ test_split_proportion = .3
model_method = "glm.fit"

#analysis
data <- read.csv(input_path)
data <- utils::read.csv(input_path)

set.seed(random_seed)
split <- caTools::sample.split(data, SplitRatio = test_split_proportion)
Expand All @@ -134,7 +133,7 @@ model <- glm(formula = outcome ~ a + b + c, family = binomial(link = "logit"), d

prediction <- predict(model, test_data, type = "response")

write.csv(prediction, output_path)
utils::write.csv(prediction, output_path)
```

````
Expand All @@ -144,32 +143,30 @@ We're able to use basic objects (like lists and dictionaries) to group related p
We then reference these objects in the analysis section of our script.

Our configuration could be extended to include other parameters, including which variables we're selecting to train our model.
However, it is important that we keep the configuration simple and easy to maintain.
Before moving aspects of code to the configuration it's good to consider whether it improves your workflow.
If it is something that is dependent on the computer that you are using (e.g. file paths) or is likely to change between runs of your analysis,
then it's a good candidate for including in your configuration.
However, we must keep the configuration simple and easy to maintain.
Before moving aspects of code to the configuration, consider whether it improves your workflow.
You should include things that are dependent on the computer that you are using (e.g., file paths) or are likely to change between runs of your analysis, in your configuration.


## Use separate configuration files

We can use independent configuration files to take our previous example one step further.
We can take our previous example one step further using independent configuration files.
We simply take our collection of variables, containing parameters and options for our analysis, and move them to a separate file.
As we'll describe in the following subsections, these files can be written in the same language as your code or other simple languages.
These files can be written in the same language as your code or other simple languages, as we'll describe in the following subsections.

Storing our analysis configuration in a separate file to the analysis code is a useful separation.
It means that we can version control our code based solely on changes to the overall logic - when we fix bugs or add new features.
We can then keep a separate record of which configuration files were used with our code to generate specific results.
We can easily switch between multiple configurations, by providing our analysis code with different configuration files.
We can easily switch between multiple configurations by providing our analysis code with different configuration files.

You may not want to version control your configuration file,
for example if it includes file paths that are specific to your machine or references to sensitive data.
In this case, you should include a sample or example configuration file, so that others can use this as a template to configure the analysis for their own environment.
It is key that this template is kept up to date, so that it is compatible with your code.
You may not want to version control your configuration file if it includes file paths that are specific to your machine or references to sensitive data.
In this case, include a sample or example configuration file, so others can use this as a template to configure the analysis for their own environment.
It is key to keep this template up to date, so that it is compatible with your code.


### Use code files for configuration

To use another code script as our configuration file, we can copy our parameter variables directly from our scripts.
We can copy our parameter variables directly from our scripts to use another code script as our configuration file.
Because these variables are defined in the programming language that our analysis uses, it's easy to access them in our analysis script.
In Python, variables from these config files can be imported into your analysis script.
In R, your script might `source()` the config file to read the variables into the R environment.
Expand All @@ -180,7 +177,7 @@ In R, your script might `source()` the config file to read the variables into th
Many other file formats can be used to store configuration parameters.
You may have come across data-serialisation languages (including YAML, TOML, JSON and XML), which can be used independently of your programming language.

If we were to represent our example configuration from above in YAML, this would look like:
If we represent our example configuration from above in YAML, it would look like this:

```yaml
input_path: "C:/a/very/specific/path/to/input_data.csv"
Expand All @@ -194,8 +191,8 @@ prediction_parameters:
max_v: 1000
```

Configuration files that are written in other languages may need to be read using relevant libraries.
The YAML example above could be read into our analysis as follows:
You can use relevant libraries to read configuration files that are written in other languages.
For example, we could read the YAML example into our analysis like this:

````{tabs}

Expand All @@ -221,7 +218,7 @@ data <- read.csv(config$input_path)
Configuration file formats like YAML and TOML are compact and human-readable.
This makes them easy to interpret and update, even without knowledge of the underlying code used in the analysis.
Reading these files in produces a single object containing all of the `key:value` pairs defined in our configuration file.
In our analysis, we can then select our configuration parameters using their keys.
We can then select our configuration parameters using their keys in our analysis.


## Use configuration files as arguments
Expand All @@ -231,8 +228,8 @@ Although this allows us to separate our configuration from the main codebase, we
This is not ideal, as for the code to be run on another machine the configuration file must be saved on the same path.
Furthermore, if we want to switch the configuration file that the analysis uses we must change this path or replace the configuration file at the specified path.

To overcome this, we can adjust our analysis script to take the configuration file path as an argument when the analysis script is run.
This can be achieved in a number of ways, but we'll discuss a minimal example here:
We can adjust our analysis script to take the configuration file path as an argument when the analysis script is run to overcome this.
We can achieve this in a number of ways, but we'll discuss a minimal example here:

````{tabs}

Expand Down Expand Up @@ -283,7 +280,7 @@ This means that we don't need to change our code to account for changes to the c

```{note}
It is possible to pass configuration options directly as arguments in this way, instead of referencing a configuration file.
However, use of configuration files should be preferred as they allow us to document which configuration
However, you should use configuration files as they allow us to document which configuration
has been used to produce our analysis outputs, for reproducibility.
```

Expand All @@ -293,17 +290,17 @@ has been used to produce our analysis outputs, for reproducibility.

Environment variables are variables that are available in a particular environment.
In most analysis contexts, our environment is the user environment that we are running our code from.
This might be your local machine or dedicated analysis platform.
This might be your local machine or an analysis platform.

If your code depends on credentials of some kind, these must not be written in your code.
Passwords and keys could be stored in configuration files, but there is a risk that these files may be included in [version control](version_control.md).
To avoid this risk, it is best to store this information in local environment variables.
If your code depends on credentials of some kind, do not write these in your code.
You can store passwords and keys in configuration files, but there is a risk that these files may be included in [version control](version_control.md).
To avoid this risk, store this information in local environment variables.

Environment variables can also be useful for storing other environment-dependent variables.
For example, the location of a database or a software dependency.
This might be preferred over a configuration file if very few other options are required by the code.
We might prefer this over a configuration file the code requires very few other options.

In Unix systems (e.g. Linux and Mac), environment variables can be set in the terminal using `export` and deleted using `unset`:
In Unix systems (e.g., Linux and Mac), you can set environment variables in the terminal using `export` and delete them using `unset`:

```none
export SECRET_KEY="mysupersecretpassword"
Expand All @@ -317,9 +314,9 @@ setx SECRET_KEY "mysupersecretpassword"
reg delete HKCU\Environment /F /V SECRET_KEY
```

These can alternatively be defined using a graphical interface under `Edit environment variables for your account` in your Windows settings.
You can alternatively define them using a graphical interface under `Edit environment variables for your account` in your Windows settings.

Once stored in environment variables, these variables will remain available in your environment until they are deleted.
Once stored in environment variables, these variables will remain available in your environment until you delete them.

You can access this variable in your code like so:

Expand All @@ -337,4 +334,4 @@ my_key <- Sys.getenv("SECRET_KEY")

````

It is then safer for this code to be shared with others, as it is not possible to acquire your credentials without access to your environment.
It is then safer for this code to be shared with others, as they can't acquire your credentials without access to your environment.
Loading