Skip to content

Commit

Permalink
Merge pull request #183 from alanocallaghan/carp-review-comments
Browse files Browse the repository at this point in the history
Carpentries review comments
  • Loading branch information
catavallejos authored Dec 23, 2024
2 parents 1956d9e + 6af2cd3 commit d88b308
Show file tree
Hide file tree
Showing 37 changed files with 154 additions and 4,322 deletions.
29 changes: 14 additions & 15 deletions _episodes_rmd/01-introduction-to-high-dimensional-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ knitr_fig_path("01-")
```


# What are high-dimensional data?
## What are high-dimensional data?

*High-dimensional data* are defined as data with many features (variables observed).
In recent years, advances in information technology have allowed large amounts of data to
Expand All @@ -48,8 +48,7 @@ blood test results, behaviours, and general health. An example of what high-dime
in a biomedical study is shown in the figure below.



```{r table-intro, echo = FALSE, fig.cap = "Example of a high-dimensional data table with features in the columns and individual observations (patients) in rows.", fig.alt = "Table displaying a high-dimensional data set with many features in individual columns relating to health data such as blood pressure, heart rate and respiratory rate. Each row contains the data for individual patients."}
```{r table-intro, echo = FALSE, fig.cap = "Example of a high-dimensional data table with features in the columns and individual observations (patients) in rows.", fig.alt = "Table displaying a high-dimensional data set with many columns representing features related to health, such as blood pressure, heart rate and respiratory rate. Each row contains the data for an individual patient. This type of high-dimensional data could contain hundreds or thousands of columns (features/variables) and thousands or even millions of rows (observations/samples/patients)."}
knitr::include_graphics("../fig/intro-table.png")
```

Expand All @@ -65,7 +64,7 @@ for practical high-dimensional data analysis in the biological sciences.



> ## Challenge 1
> ### Challenge 1
>
> Descriptions of four research questions and their datasets are given below.
> Which of these scenarios use high-dimensional data?
Expand All @@ -84,7 +83,7 @@ for practical high-dimensional data analysis in the biological sciences.
> (age, weight, BMI, blood pressure) and cancer growth (tumour size,
> localised spread, blood test results).
>
> > ## Solution
> > ### Solution
> >
> > 1. No. The number of features is relatively small (4 including the response variable since this is an observed variable).
> > 2. Yes, this is an example of high-dimensional data. There are 200,004 features.
Expand All @@ -98,7 +97,7 @@ Now that we have an idea of what high-dimensional data look like we can think
about the challenges we face in analysing them.


# Why is dealing with high-dimensional data challenging?
## Why is dealing with high-dimensional data challenging?

Most classical statistical methods are set up for use on low-dimensional data
(i.e. with a small number of features, $p$).
Expand All @@ -118,7 +117,7 @@ of the challenges we are facing when working with high-dimensional data. For ref
the lesson are described in the [data page](https://carpentries-incubator.github.io/high-dimensional-stats-r/data/index.html).


> ## Challenge 2
> ### Challenge 2
>
> For illustrative purposes, we start with a simple dataset that is not technically
> high-dimensional but contains many features. This will illustrate the general problems
Expand All @@ -139,7 +138,7 @@ the lesson are described in the [data page](https://carpentries-incubator.github
> 3. Plot the relationship between the variables (hint: see the `pairs()` function). What problem(s) with
> high-dimensional data analysis does this illustrate?
>
> > ## Solution
> > ### Solution
> >
> >
> > ```{r dim-prostate, eval = FALSE}
Expand Down Expand Up @@ -171,7 +170,7 @@ Note that function documentation and information on function arguments will be u
this lesson. We can access these easily in R by running `?` followed by the package name.
For example, the documentation for the `dim` function can be accessed by running `?dim`.
> ## Locating data with R - the **`here`** package
> ### Locating data with R - the **`here`** package
>
> It is often desirable to access external datasets from inside R and to write
> code that does this reliably on different computers. While R has an inbulit
Expand Down Expand Up @@ -213,7 +212,7 @@ in these datasets makes high correlations between variables more likely. Let's
explore why high correlations might be an issue in a Challenge.


> ## Challenge 3
> ### Challenge 3
>
> Use the `cor()` function to examine correlations between all variables in the
> `prostate` dataset. Are some pairs of variables highly correlated using a threshold of
Expand All @@ -227,7 +226,7 @@ explore why high correlations might be an issue in a Challenge.
> Fit a multiple linear regression model predicting patient age using both
> variables. What happened?
>
> > ## Solution
> > ### Solution
> >
> > Create a correlation matrix of all variables in the `prostate` dataset
> >
Expand Down Expand Up @@ -289,7 +288,7 @@ regularisation, which we will discuss in the lesson on high-dimensional
regression.
# What statistical methods are used to analyse high-dimensional data?
## What statistical methods are used to analyse high-dimensional data?
We have discussed so far that high-dimensional data analysis can be challenging since variables are difficult to visualise,
leading to challenges identifying relationships between variables and suitable response variables; we may have
Expand Down Expand Up @@ -336,7 +335,7 @@ through clustering cells with similar gene expression patterns. The *K-means*
episode will explore a specific method to perform clustering analysis.
> ## Using Bioconductor to access high-dimensional data in the biosciences
> ### Using Bioconductor to access high-dimensional data in the biosciences
>
> In this workshop, we will look at statistical methods that can be used to
> visualise and analyse high-dimensional biological data using packages available
Expand Down Expand Up @@ -392,7 +391,7 @@ episode will explore a specific method to perform clustering analysis.
> common challenge in analysing high-dimensional genomics data.
{: .callout}
# Further reading
## Further reading
- Buhlman, P. & van de Geer, S. (2011) Statistics for High-Dimensional Data. Springer, London.
- [Buhlman, P., Kalisch, M. & Meier, L. (2014) High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application](https://doi.org/10.1146/annurev-statistics-022513-115545).
Expand All @@ -406,7 +405,7 @@ methods that could be used to analyse high-dimensional data. See
Some related (an important!) content is also available in
[Responsible machine learning](https://carpentries-incubator.github.io/machine-learning-responsible-python/).
# Other resources suggested by former students
## Other resources suggested by former students
- [Josh Starmer's](https://www.youtube.com/c/joshstarmer) youtube channel.
Expand Down
52 changes: 26 additions & 26 deletions _episodes_rmd/02-high-dimensional-regression.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ source(here("bin/chunk-options.R"))
knitr_fig_path("02-")
```

# DNA methylation data
## DNA methylation data

For the following few episodes, we will be working with human DNA
methylation data from flow-sorted blood samples, described in [data](https://carpentries-incubator.github.io/high-dimensional-stats-r/data/index.html). DNA methylation assays
Expand Down Expand Up @@ -138,12 +138,12 @@ Heatmap(methyl_mat_ord,
top_annotation = columnAnnotation(age = age_ord))
```

> ## Challenge 1
> ### Challenge 1
>
> Why can we not just fit many linear regression models relating every combination of feature
> (`colData` and assays) and draw conclusions by associating all variables with significant model p-values?
>
> > ## Solution
> > ### Solution
> >
> > There are a number of problems that this kind of approach presents.
> > For example:
Expand Down Expand Up @@ -178,7 +178,7 @@ have a single outcome (age) which will be predicted using 5000 covariates
The examples in this episode will focus on the first type of problem, whilst
the next episode will focus on the second.

> ## Measuring DNA Methylation
> ### Measuring DNA Methylation
>
> DNA methylation is an epigenetic modification of DNA. Generally, we
> are interested in the proportion of methylation at many sites or
Expand Down Expand Up @@ -214,7 +214,7 @@ the next episode will focus on the second.
> therefore can be easier to work with in statistical models.
{: .callout}

# Regression with many outcomes
## Regression with many outcomes

In high-throughput studies, it is common to have one or more phenotypes
or groupings that we want to relate to features of interest (eg, gene
Expand Down Expand Up @@ -299,7 +299,7 @@ And, of course, we often have an awful lot of features and need to
prioritise a subset of them! We need a rigorous way to prioritise genes
for further analysis.

# Fitting a linear model
## Fitting a linear model

So, in the data we have read in, we have a matrix of methylation values
$X$ and a vector of ages, $y$. One way to model this is to see if we can
Expand Down Expand Up @@ -342,7 +342,7 @@ outlined previously. Before we introduce this approach, let's go into
detail about how we generally check whether the results of a linear
model are statistically significant.

# Hypothesis testing in linear regression
## Hypothesis testing in linear regression

Using the linear model we defined above, we can ask questions based on the
estimated value for the regression coefficients. For example, do individuals
Expand Down Expand Up @@ -477,7 +477,7 @@ we're estimating and the uncertainty we have in that effect. A large effect with
uncertainty may not lead to a small p-value, and a small effect with
small uncertainty may lead to a small p-value.

> ## Calculating p-values from a linear model
> ### Calculating p-values from a linear model
>
> Manually calculating the p-value for a linear model is a little bit
> more complex than calculating the t-statistic. The intuition posted
Expand Down Expand Up @@ -509,12 +509,12 @@ small uncertainty may lead to a small p-value.
> ```
{: .callout}
> ## Challenge 2
> ### Challenge 2
>
> In the model we fitted, the estimate for the intercept is 0.902 and its associated
> p-value is 0.0129. What does this mean?
>
> > ## Solution
> > ### Solution
> >
> > The first coefficient in a linear model like this is the intercept, which measures
> > the mean of the outcome (in this case, the methylation value for the first CpG)
Expand All @@ -535,7 +535,7 @@ small uncertainty may lead to a small p-value.
> {: .solution}
{: .challenge}
# Fitting a lot of linear models
## Fitting a lot of linear models
In the linear model above, we are generally interested in the second regression
coefficient (often referred to as *slope*) which measures the linear relationship
Expand All @@ -557,7 +557,7 @@ efficient, and it would also be laborious to do programmatically. There are ways
to get around this, but first let us talk about what exactly we are doing when
we look at significance tests in this context.

# Sharing information across outcome variables
## Sharing information across outcome variables

We are going to introduce an idea that allows us to
take advantage of the fact that we carry out many tests at once on
Expand Down Expand Up @@ -609,7 +609,7 @@ may have seen when running linear models. Here, we define a *model matrix* or
coefficients that should be fit in each linear model. These are used in
similar ways in many different modelling libraries.

> ## What is a model matrix?
> ### What is a model matrix?
> R fits a regression model by choosing the vector of regression coefficients
> that minimises the differences between outcome values and predicted values
> using the covariates (or predictor variables). To get predicted values,
Expand Down Expand Up @@ -715,12 +715,12 @@ continuous measures like these, it is often convenient to obtain a list
of features which we are confident have non-zero effect sizes. This is
made more difficult by the number of tests we perform.

> ## Challenge 3
> ### Challenge 3
>
> The effect size estimates are very small, and yet many of the p-values
> are well below a usual significance level of p \< 0.05. Why is this?
>
> > ## Solution
> > ### Solution
> >
> > Because age has a much larger range than methylation levels, the
> > unit change in methylation level even for a strong relationship is
Expand Down Expand Up @@ -799,15 +799,15 @@ understand, but it is useful to develop an intuition about why these approaches
precise and sensitive than the naive approach of fitting a model to each
feature separately.

> ## Challenge 4
> ### Challenge 4
>
> 1. Try to run the same kind of linear model with smoking status as
> covariate instead of age, and making a volcano plot. *Note:
> smoking status is stored as* `methylation$smoker`.
> 2. We saw in the example in the lesson that this information sharing
> can lead to larger p-values. Why might this be preferable?
>
> > ## Solution
> > ### Solution
> >
> > 1. The following code runs the same type of model with smoking
> > status:
Expand Down Expand Up @@ -859,7 +859,7 @@ feature separately.
{: .challenge}
```

> ## Shrinkage
> ### Shrinkage
>
> Shrinkage is an intuitive term for an effect of information sharing,
> and is something observed in a broad range of statistical models.
Expand Down Expand Up @@ -902,7 +902,7 @@ feature separately.
# todo: callout box explaining DESeq2
```

# The problem of multiple tests
## The problem of multiple tests

With such a large number of features, it would be useful to decide which
features are "interesting" or "significant" for further study. However,
Expand Down Expand Up @@ -943,7 +943,7 @@ threshold in a real experiment, it is likely that we would identify many
features as associated with age, when the results we are observing are
simply due to chance.

> ## Challenge 5
> ### Challenge 5
>
> 1. If we run `r nrow(methylation)` tests, even if there are no true differences,
> how many of them (on average) will be statistically significant at
Expand All @@ -955,7 +955,7 @@ simply due to chance.
> 3. How could we account for a varying number of tests to ensure
> "significant" changes are truly different?
>
> > ## Solution
> > ### Solution
> >
> > 1. By default we expect
> > $`r nrow(methylation)` \times 0.05 = `r nrow(methylation) * 0.05`$
Expand All @@ -974,7 +974,7 @@ simply due to chance.
> {: .solution}
{: .challenge}

# Adjusting for multiple tests
## Adjusting for multiple tests

When performing many statistical tests to categorise features, we are
effectively classifying features as "non-significant" or "significant", that latter meaning those for
Expand All @@ -996,7 +996,7 @@ make falls into four categories:
little data, we can't detect large differences. However, both can be
argued to be "true".

| | Label as different | Label as not different |
| True outcome| Label as different | Label as not different |
|--------------------:|-------------------:|-----------------------:|
| Truly different | True positive | False negative |
| Truly not different | False positive | True negative |
Expand Down Expand Up @@ -1062,7 +1062,7 @@ experiment over and over.
| \- Very conservative | \- Does not control probability of making errors |
| \- Requires larger statistical power | \- May result in false discoveries |

> ## Challenge 6
> ### Challenge 6
>
> 1. At a significance level of 0.05, with 100 tests performed, what is
> the Bonferroni significance threshold?
Expand All @@ -1075,7 +1075,7 @@ experiment over and over.
> Compare these values to the raw p-values and the Bonferroni
> p-values.
>
> > ## Solution
> > ### Solution
> >
> > 1. The Bonferroni threshold for this significance threshold is $$
> > \frac{0.05}{100} = 0.0005
Expand Down Expand Up @@ -1109,7 +1109,7 @@ experiment over and over.
> ## Feature selection
> ### Feature selection
>
> In this episode, we have focussed on regression in a setting where there are more
> features than observations. This approach is relevant if we are interested in the
Expand Down
Loading

0 comments on commit d88b308

Please sign in to comment.