Merge pull request #183 from alanocallaghan/carp-review-comments

Carpentries review comments
carpentries-incubator · Dec 23, 2024 · d88b308 · d88b308
2 parents 1956d9e + 6af2cd3
commit d88b308
Show file tree

Hide file tree

Showing 37 changed files with 154 additions and 4,322 deletions.
diff --git a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
@@ -36,7 +36,7 @@ knitr_fig_path("01-")
 ```
 
 
-# What are high-dimensional data? 
+## What are high-dimensional data?
 
 *High-dimensional data* are defined as data with many features (variables observed).
 In recent years, advances in information technology have allowed large amounts of data to
@@ -48,8 +48,7 @@ blood test results, behaviours, and general health. An example of what high-dime
 in a biomedical study is shown in the figure below. 
 
 
-
-```{r table-intro, echo = FALSE, fig.cap = "Example of a high-dimensional data table with features in the columns and individual observations (patients) in rows.", fig.alt = "Table displaying a high-dimensional data set with many features in individual columns relating to health data such as blood pressure, heart rate and respiratory rate. Each row contains the data for individual patients."}
+```{r table-intro, echo = FALSE, fig.cap = "Example of a high-dimensional data table with features in the columns and individual observations (patients) in rows.", fig.alt = "Table displaying a high-dimensional data set with many columns representing features related to health, such as blood pressure, heart rate and respiratory rate. Each row contains the data for an individual patient. This type of high-dimensional data could contain hundreds or thousands of columns (features/variables) and thousands or even millions of rows (observations/samples/patients)."}
 knitr::include_graphics("../fig/intro-table.png")
 ```
 
@@ -65,7 +64,7 @@ for practical high-dimensional data analysis in the biological sciences.
 
 
 
-> ## Challenge 1 
+> ### Challenge 1
 > 
 > Descriptions of four research questions and their datasets are given below.
 > Which of these scenarios use high-dimensional data?
@@ -84,7 +83,7 @@ for practical high-dimensional data analysis in the biological sciences.
 >    (age, weight, BMI, blood pressure) and cancer growth (tumour size,
 >    localised spread, blood test results).
 > 
-> > ## Solution
+> > ### Solution
 > > 
 > > 1. No. The number of features is relatively small (4 including the response variable since this is an observed variable).
 > > 2. Yes, this is an example of high-dimensional data. There are 200,004 features.
@@ -98,7 +97,7 @@ Now that we have an idea of what high-dimensional data look like we can think
 about the challenges we face in analysing them.
 
 
-# Why is dealing with high-dimensional data challenging? 
+## Why is dealing with high-dimensional data challenging?
 
 Most classical statistical methods are set up for use on low-dimensional data
 (i.e. with a small number of features, $p$). 
@@ -118,7 +117,7 @@ of the challenges we are facing when working with high-dimensional data. For ref
 the lesson are described in the [data page](https://carpentries-incubator.github.io/high-dimensional-stats-r/data/index.html).
 
 
-> ## Challenge 2 
+> ### Challenge 2
 > 
 > For illustrative purposes, we start with a simple dataset that is not technically
 > high-dimensional but contains many features. This will illustrate the general problems 
@@ -139,7 +138,7 @@ the lesson are described in the [data page](https://carpentries-incubator.github
 > 3. Plot the relationship between the variables (hint: see the `pairs()` function). What problem(s) with
 > high-dimensional data analysis does this illustrate?
 > 
-> > ## Solution
+> > ### Solution
 > > 
 > > 
 > > ```{r dim-prostate, eval = FALSE}
@@ -171,7 +170,7 @@ Note that function documentation and information on function arguments will be u
 this lesson. We can access these easily in R by running `?` followed by the package name.
 For example, the documentation for the `dim` function can be accessed by running `?dim`.
 
-> ## Locating data with R - the **`here`** package
+> ### Locating data with R - the **`here`** package
 > 
 > It is often desirable to access external datasets from inside R and to write 
 > code that does this reliably on different computers. While R has an inbulit 
@@ -213,7 +212,7 @@ in these datasets makes high correlations between variables more likely. Let's
 explore why high correlations might be an issue in a Challenge.
 
 
-> ## Challenge 3
+> ### Challenge 3
 > 
 > Use the `cor()` function to examine correlations between all variables in the 
 > `prostate` dataset. Are some pairs of variables highly correlated using a threshold of 
@@ -227,7 +226,7 @@ explore why high correlations might be an issue in a Challenge.
 > Fit a multiple linear regression model predicting patient age using both
 > variables. What happened?
 > 
-> > ## Solution
+> > ### Solution
 > >
 > > Create a correlation matrix of all variables in the `prostate` dataset
 > >
@@ -289,7 +288,7 @@ regularisation, which we will discuss in the lesson on high-dimensional
 regression.
 
 
-# What statistical methods are used to analyse high-dimensional data? 
+## What statistical methods are used to analyse high-dimensional data? 
 
 We have discussed so far that high-dimensional data analysis can be challenging since variables are difficult to visualise, 
 leading to challenges identifying relationships between variables and suitable response variables; we may have
@@ -336,7 +335,7 @@ through clustering cells with similar gene expression patterns. The *K-means*
 episode will explore a specific method to perform clustering analysis. 
 
 
-> ## Using Bioconductor to access high-dimensional data in the biosciences
+> ### Using Bioconductor to access high-dimensional data in the biosciences
 > 
 > In this workshop, we will look at statistical methods that can be used to
 > visualise and analyse high-dimensional biological data using packages available
@@ -392,7 +391,7 @@ episode will explore a specific method to perform clustering analysis.
 > common challenge in analysing high-dimensional genomics data.   
 {: .callout}
 
-# Further reading
+## Further reading
 
 - Buhlman, P. & van de Geer, S. (2011) Statistics for High-Dimensional Data. Springer, London.
 - [Buhlman, P., Kalisch, M. & Meier, L. (2014) High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application](https://doi.org/10.1146/annurev-statistics-022513-115545).
@@ -406,7 +405,7 @@ methods that could be used to analyse high-dimensional data. See
 Some related (an important!) content is also available in 
 [Responsible machine learning](https://carpentries-incubator.github.io/machine-learning-responsible-python/). 
 
-# Other resources suggested by former students
+## Other resources suggested by former students
 
 - [Josh Starmer's](https://www.youtube.com/c/joshstarmer) youtube channel. 
 

diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd
@@ -35,7 +35,7 @@ source(here("bin/chunk-options.R"))
 knitr_fig_path("02-")
 ```
 
-# DNA methylation data
+## DNA methylation data
 
 For the following few episodes, we will be working with human DNA
 methylation data from flow-sorted blood samples, described in [data](https://carpentries-incubator.github.io/high-dimensional-stats-r/data/index.html). DNA methylation assays
@@ -138,12 +138,12 @@ Heatmap(methyl_mat_ord,
         top_annotation = columnAnnotation(age = age_ord))
 ```
 
-> ## Challenge 1
+> ### Challenge 1
 >
 > Why can we not just fit many linear regression models relating every combination of feature 
 > (`colData` and assays) and draw conclusions by associating all variables with significant model p-values?
 >
-> > ## Solution
+> > ### Solution
 > >
 > > There are a number of problems that this kind of approach presents.
 > > For example:
@@ -178,7 +178,7 @@ have a single outcome (age) which will be predicted using 5000 covariates
 The examples in this episode will focus on the first type of problem, whilst 
 the next episode will focus on the second.
 
-> ## Measuring DNA Methylation
+> ### Measuring DNA Methylation
 >
 > DNA methylation is an epigenetic modification of DNA. Generally, we
 > are interested in the proportion of methylation at many sites or
@@ -214,7 +214,7 @@ the next episode will focus on the second.
 > therefore can be easier to work with in statistical models.
 {: .callout}
 
-# Regression with many outcomes
+## Regression with many outcomes
 
 In high-throughput studies, it is common to have one or more phenotypes
 or groupings that we want to relate to features of interest (eg, gene
@@ -299,7 +299,7 @@ And, of course, we often have an awful lot of features and need to
 prioritise a subset of them! We need a rigorous way to prioritise genes
 for further analysis.
 
-# Fitting a linear model
+## Fitting a linear model
 
 So, in the data we have read in, we have a matrix of methylation values
 $X$ and a vector of ages, $y$. One way to model this is to see if we can
@@ -342,7 +342,7 @@ outlined previously. Before we introduce this approach, let's go into
 detail about how we generally check whether the results of a linear
 model are statistically significant.
 
-# Hypothesis testing in linear regression
+## Hypothesis testing in linear regression
 
 Using the linear model we defined above, we can ask questions based on the 
 estimated value for the regression coefficients. For example, do individuals
@@ -477,7 +477,7 @@ we're estimating and the uncertainty we have in that effect. A large effect with
 uncertainty may not lead to a small p-value, and a small effect with
 small uncertainty may lead to a small p-value.
 
-> ## Calculating p-values from a linear model
+> ### Calculating p-values from a linear model
 >
 > Manually calculating the p-value for a linear model is a little bit
 > more complex than calculating the t-statistic. The intuition posted
@@ -509,12 +509,12 @@ small uncertainty may lead to a small p-value.
 > ```
 {: .callout}
 
-> ## Challenge 2
+> ### Challenge 2
 >
 > In the model we fitted, the estimate for the intercept is 0.902 and its associated 
 > p-value is 0.0129. What does this mean?
 >
-> > ## Solution
+> > ### Solution
 > >
 > > The first coefficient in a linear model like this is the intercept, which measures 
 > > the mean of the outcome (in this case, the methylation value for the first CpG)
@@ -535,7 +535,7 @@ small uncertainty may lead to a small p-value.
 > {: .solution}
 {: .challenge}
 
-# Fitting a lot of linear models
+## Fitting a lot of linear models
 
 In the linear model above, we are generally interested in the second regression
 coefficient (often referred to as *slope*) which measures the linear relationship
@@ -557,7 +557,7 @@ efficient, and it would also be laborious to do programmatically. There are ways
 to get around this, but first let us talk about what exactly we are doing when 
 we look at significance tests in this context.
 
-# Sharing information across outcome variables
+## Sharing information across outcome variables
 
 We are going to introduce an idea that allows us to
 take advantage of the fact that we carry out many tests at once on
@@ -609,7 +609,7 @@ may have seen when running linear models. Here, we define a *model matrix* or
 coefficients that should be fit in each linear model. These are used in
 similar ways in many different modelling libraries.
 
-> ## What is a model matrix?
+> ### What is a model matrix?
 > R fits a regression model by choosing the vector of regression coefficients 
 > that minimises the differences between outcome values and predicted values 
 > using the covariates (or predictor variables). To get predicted values,
@@ -715,12 +715,12 @@ continuous measures like these, it is often convenient to obtain a list
 of features which we are confident have non-zero effect sizes. This is
 made more difficult by the number of tests we perform.
 
-> ## Challenge 3
+> ### Challenge 3
 >
 > The effect size estimates are very small, and yet many of the p-values
 > are well below a usual significance level of p \< 0.05. Why is this?
 >
-> > ## Solution
+> > ### Solution
 > >
 > > Because age has a much larger range than methylation levels, the
 > > unit change in methylation level even for a strong relationship is
@@ -799,15 +799,15 @@ understand, but it is useful to develop an intuition about why these approaches
 precise and sensitive than the naive approach of fitting a model to each
 feature separately.
 
-> ## Challenge 4
+> ### Challenge 4
 >
 > 1.  Try to run the same kind of linear model with smoking status as
 >     covariate instead of age, and making a volcano plot. *Note:
 >     smoking status is stored as* `methylation$smoker`.
 > 2.  We saw in the example in the lesson that this information sharing
 >     can lead to larger p-values. Why might this be preferable?
 >
-> > ## Solution
+> > ### Solution
 > >
 > > 1.  The following code runs the same type of model with smoking
 > >     status:
@@ -859,7 +859,7 @@ feature separately.
 {: .challenge}
 ```
 
-> ## Shrinkage
+> ### Shrinkage
 >
 > Shrinkage is an intuitive term for an effect of information sharing,
 > and is something observed in a broad range of statistical models.
@@ -902,7 +902,7 @@ feature separately.
 # todo: callout box explaining DESeq2
 ```
 
-# The problem of multiple tests
+## The problem of multiple tests
 
 With such a large number of features, it would be useful to decide which
 features are "interesting" or "significant" for further study. However,
@@ -943,7 +943,7 @@ threshold in a real experiment, it is likely that we would identify many
 features as associated with age, when the results we are observing are
 simply due to chance.
 
-> ## Challenge 5
+> ### Challenge 5
 >
 > 1. If we run `r nrow(methylation)` tests, even if there are no true differences,
 >     how many of them (on average) will be statistically significant at
@@ -955,7 +955,7 @@ simply due to chance.
 > 3.  How could we account for a varying number of tests to ensure
 >     "significant" changes are truly different?
 >
-> > ## Solution
+> > ### Solution
 > >
 > > 1.  By default we expect
 > >     $`r nrow(methylation)` \times 0.05 = `r nrow(methylation) * 0.05`$
@@ -974,7 +974,7 @@ simply due to chance.
 > {: .solution}
 {: .challenge}
 
-# Adjusting for multiple tests
+## Adjusting for multiple tests
 
 When performing many statistical tests to categorise features, we are
 effectively classifying features as "non-significant" or "significant", that latter meaning those for
@@ -996,7 +996,7 @@ make falls into four categories:
     little data, we can't detect large differences. However, both can be
     argued to be "true".
 
-|                     | Label as different | Label as not different |
+|         True outcome| Label as different | Label as not different |
 |--------------------:|-------------------:|-----------------------:|
 |     Truly different |      True positive |         False negative |
 | Truly not different |     False positive |          True negative |
@@ -1062,7 +1062,7 @@ experiment over and over.
 | \- Very conservative                                    | \- Does not control probability of making errors |
 | \- Requires larger statistical power                    | \- May result in false discoveries               |
 
-> ## Challenge 6
+> ### Challenge 6
 >
 > 1.  At a significance level of 0.05, with 100 tests performed, what is
 >     the Bonferroni significance threshold?
@@ -1075,7 +1075,7 @@ experiment over and over.
 >     Compare these values to the raw p-values and the Bonferroni
 >     p-values.
 >
-> > ## Solution
+> > ### Solution
 > >
 > > 1.  The Bonferroni threshold for this significance threshold is $$
 > >          \frac{0.05}{100} = 0.0005
@@ -1109,7 +1109,7 @@ experiment over and over.
 
 
 
-> ## Feature selection
+> ### Feature selection
 > 
 > In this episode, we have focussed on regression in a setting where there are more
 > features than observations. This approach is relevant if we are interested in the