diff --git a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd index 2ab29dfc..6842fbb1 100644 --- a/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd +++ b/_episodes_rmd/01-introduction-to-high-dimensional-data.Rmd @@ -36,7 +36,7 @@ knitr_fig_path("01-") ``` -# What are high-dimensional data? +## What are high-dimensional data? *High-dimensional data* are defined as data with many features (variables observed). In recent years, advances in information technology have allowed large amounts of data to @@ -48,8 +48,7 @@ blood test results, behaviours, and general health. An example of what high-dime in a biomedical study is shown in the figure below. - -```{r table-intro, echo = FALSE, fig.cap = "Example of a high-dimensional data table with features in the columns and individual observations (patients) in rows.", fig.alt = "Table displaying a high-dimensional data set with many features in individual columns relating to health data such as blood pressure, heart rate and respiratory rate. Each row contains the data for individual patients."} +```{r table-intro, echo = FALSE, fig.cap = "Example of a high-dimensional data table with features in the columns and individual observations (patients) in rows.", fig.alt = "Table displaying a high-dimensional data set with many columns representing features related to health, such as blood pressure, heart rate and respiratory rate. Each row contains the data for an individual patient. This type of high-dimensional data could contain hundreds or thousands of columns (features/variables) and thousands or even millions of rows (observations/samples/patients)."} knitr::include_graphics("../fig/intro-table.png") ``` @@ -65,7 +64,7 @@ for practical high-dimensional data analysis in the biological sciences. -> ## Challenge 1 +> ### Challenge 1 > > Descriptions of four research questions and their datasets are given below. > Which of these scenarios use high-dimensional data? @@ -84,7 +83,7 @@ for practical high-dimensional data analysis in the biological sciences. > (age, weight, BMI, blood pressure) and cancer growth (tumour size, > localised spread, blood test results). > -> > ## Solution +> > ### Solution > > > > 1. No. The number of features is relatively small (4 including the response variable since this is an observed variable). > > 2. Yes, this is an example of high-dimensional data. There are 200,004 features. @@ -98,7 +97,7 @@ Now that we have an idea of what high-dimensional data look like we can think about the challenges we face in analysing them. -# Why is dealing with high-dimensional data challenging? +## Why is dealing with high-dimensional data challenging? Most classical statistical methods are set up for use on low-dimensional data (i.e. with a small number of features, $p$). @@ -118,7 +117,7 @@ of the challenges we are facing when working with high-dimensional data. For ref the lesson are described in the [data page](https://carpentries-incubator.github.io/high-dimensional-stats-r/data/index.html). -> ## Challenge 2 +> ### Challenge 2 > > For illustrative purposes, we start with a simple dataset that is not technically > high-dimensional but contains many features. This will illustrate the general problems @@ -139,7 +138,7 @@ the lesson are described in the [data page](https://carpentries-incubator.github > 3. Plot the relationship between the variables (hint: see the `pairs()` function). What problem(s) with > high-dimensional data analysis does this illustrate? > -> > ## Solution +> > ### Solution > > > > > > ```{r dim-prostate, eval = FALSE} @@ -171,7 +170,7 @@ Note that function documentation and information on function arguments will be u this lesson. We can access these easily in R by running `?` followed by the package name. For example, the documentation for the `dim` function can be accessed by running `?dim`. -> ## Locating data with R - the **`here`** package +> ### Locating data with R - the **`here`** package > > It is often desirable to access external datasets from inside R and to write > code that does this reliably on different computers. While R has an inbulit @@ -213,7 +212,7 @@ in these datasets makes high correlations between variables more likely. Let's explore why high correlations might be an issue in a Challenge. -> ## Challenge 3 +> ### Challenge 3 > > Use the `cor()` function to examine correlations between all variables in the > `prostate` dataset. Are some pairs of variables highly correlated using a threshold of @@ -227,7 +226,7 @@ explore why high correlations might be an issue in a Challenge. > Fit a multiple linear regression model predicting patient age using both > variables. What happened? > -> > ## Solution +> > ### Solution > > > > Create a correlation matrix of all variables in the `prostate` dataset > > @@ -289,7 +288,7 @@ regularisation, which we will discuss in the lesson on high-dimensional regression. -# What statistical methods are used to analyse high-dimensional data? +## What statistical methods are used to analyse high-dimensional data? We have discussed so far that high-dimensional data analysis can be challenging since variables are difficult to visualise, leading to challenges identifying relationships between variables and suitable response variables; we may have @@ -336,7 +335,7 @@ through clustering cells with similar gene expression patterns. The *K-means* episode will explore a specific method to perform clustering analysis. -> ## Using Bioconductor to access high-dimensional data in the biosciences +> ### Using Bioconductor to access high-dimensional data in the biosciences > > In this workshop, we will look at statistical methods that can be used to > visualise and analyse high-dimensional biological data using packages available @@ -392,7 +391,7 @@ episode will explore a specific method to perform clustering analysis. > common challenge in analysing high-dimensional genomics data. {: .callout} -# Further reading +## Further reading - Buhlman, P. & van de Geer, S. (2011) Statistics for High-Dimensional Data. Springer, London. - [Buhlman, P., Kalisch, M. & Meier, L. (2014) High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application](https://doi.org/10.1146/annurev-statistics-022513-115545). @@ -406,7 +405,7 @@ methods that could be used to analyse high-dimensional data. See Some related (an important!) content is also available in [Responsible machine learning](https://carpentries-incubator.github.io/machine-learning-responsible-python/). -# Other resources suggested by former students +## Other resources suggested by former students - [Josh Starmer's](https://www.youtube.com/c/joshstarmer) youtube channel. diff --git a/_episodes_rmd/02-high-dimensional-regression.Rmd b/_episodes_rmd/02-high-dimensional-regression.Rmd index 579a662e..9f9eb0c5 100644 --- a/_episodes_rmd/02-high-dimensional-regression.Rmd +++ b/_episodes_rmd/02-high-dimensional-regression.Rmd @@ -35,7 +35,7 @@ source(here("bin/chunk-options.R")) knitr_fig_path("02-") ``` -# DNA methylation data +## DNA methylation data For the following few episodes, we will be working with human DNA methylation data from flow-sorted blood samples, described in [data](https://carpentries-incubator.github.io/high-dimensional-stats-r/data/index.html). DNA methylation assays @@ -138,12 +138,12 @@ Heatmap(methyl_mat_ord, top_annotation = columnAnnotation(age = age_ord)) ``` -> ## Challenge 1 +> ### Challenge 1 > > Why can we not just fit many linear regression models relating every combination of feature > (`colData` and assays) and draw conclusions by associating all variables with significant model p-values? > -> > ## Solution +> > ### Solution > > > > There are a number of problems that this kind of approach presents. > > For example: @@ -178,7 +178,7 @@ have a single outcome (age) which will be predicted using 5000 covariates The examples in this episode will focus on the first type of problem, whilst the next episode will focus on the second. -> ## Measuring DNA Methylation +> ### Measuring DNA Methylation > > DNA methylation is an epigenetic modification of DNA. Generally, we > are interested in the proportion of methylation at many sites or @@ -214,7 +214,7 @@ the next episode will focus on the second. > therefore can be easier to work with in statistical models. {: .callout} -# Regression with many outcomes +## Regression with many outcomes In high-throughput studies, it is common to have one or more phenotypes or groupings that we want to relate to features of interest (eg, gene @@ -299,7 +299,7 @@ And, of course, we often have an awful lot of features and need to prioritise a subset of them! We need a rigorous way to prioritise genes for further analysis. -# Fitting a linear model +## Fitting a linear model So, in the data we have read in, we have a matrix of methylation values $X$ and a vector of ages, $y$. One way to model this is to see if we can @@ -342,7 +342,7 @@ outlined previously. Before we introduce this approach, let's go into detail about how we generally check whether the results of a linear model are statistically significant. -# Hypothesis testing in linear regression +## Hypothesis testing in linear regression Using the linear model we defined above, we can ask questions based on the estimated value for the regression coefficients. For example, do individuals @@ -477,7 +477,7 @@ we're estimating and the uncertainty we have in that effect. A large effect with uncertainty may not lead to a small p-value, and a small effect with small uncertainty may lead to a small p-value. -> ## Calculating p-values from a linear model +> ### Calculating p-values from a linear model > > Manually calculating the p-value for a linear model is a little bit > more complex than calculating the t-statistic. The intuition posted @@ -509,12 +509,12 @@ small uncertainty may lead to a small p-value. > ``` {: .callout} -> ## Challenge 2 +> ### Challenge 2 > > In the model we fitted, the estimate for the intercept is 0.902 and its associated > p-value is 0.0129. What does this mean? > -> > ## Solution +> > ### Solution > > > > The first coefficient in a linear model like this is the intercept, which measures > > the mean of the outcome (in this case, the methylation value for the first CpG) @@ -535,7 +535,7 @@ small uncertainty may lead to a small p-value. > {: .solution} {: .challenge} -# Fitting a lot of linear models +## Fitting a lot of linear models In the linear model above, we are generally interested in the second regression coefficient (often referred to as *slope*) which measures the linear relationship @@ -557,7 +557,7 @@ efficient, and it would also be laborious to do programmatically. There are ways to get around this, but first let us talk about what exactly we are doing when we look at significance tests in this context. -# Sharing information across outcome variables +## Sharing information across outcome variables We are going to introduce an idea that allows us to take advantage of the fact that we carry out many tests at once on @@ -609,7 +609,7 @@ may have seen when running linear models. Here, we define a *model matrix* or coefficients that should be fit in each linear model. These are used in similar ways in many different modelling libraries. -> ## What is a model matrix? +> ### What is a model matrix? > R fits a regression model by choosing the vector of regression coefficients > that minimises the differences between outcome values and predicted values > using the covariates (or predictor variables). To get predicted values, @@ -715,12 +715,12 @@ continuous measures like these, it is often convenient to obtain a list of features which we are confident have non-zero effect sizes. This is made more difficult by the number of tests we perform. -> ## Challenge 3 +> ### Challenge 3 > > The effect size estimates are very small, and yet many of the p-values > are well below a usual significance level of p \< 0.05. Why is this? > -> > ## Solution +> > ### Solution > > > > Because age has a much larger range than methylation levels, the > > unit change in methylation level even for a strong relationship is @@ -799,7 +799,7 @@ understand, but it is useful to develop an intuition about why these approaches precise and sensitive than the naive approach of fitting a model to each feature separately. -> ## Challenge 4 +> ### Challenge 4 > > 1. Try to run the same kind of linear model with smoking status as > covariate instead of age, and making a volcano plot. *Note: @@ -807,7 +807,7 @@ feature separately. > 2. We saw in the example in the lesson that this information sharing > can lead to larger p-values. Why might this be preferable? > -> > ## Solution +> > ### Solution > > > > 1. The following code runs the same type of model with smoking > > status: @@ -859,7 +859,7 @@ feature separately. {: .challenge} ``` -> ## Shrinkage +> ### Shrinkage > > Shrinkage is an intuitive term for an effect of information sharing, > and is something observed in a broad range of statistical models. @@ -902,7 +902,7 @@ feature separately. # todo: callout box explaining DESeq2 ``` -# The problem of multiple tests +## The problem of multiple tests With such a large number of features, it would be useful to decide which features are "interesting" or "significant" for further study. However, @@ -943,7 +943,7 @@ threshold in a real experiment, it is likely that we would identify many features as associated with age, when the results we are observing are simply due to chance. -> ## Challenge 5 +> ### Challenge 5 > > 1. If we run `r nrow(methylation)` tests, even if there are no true differences, > how many of them (on average) will be statistically significant at @@ -955,7 +955,7 @@ simply due to chance. > 3. How could we account for a varying number of tests to ensure > "significant" changes are truly different? > -> > ## Solution +> > ### Solution > > > > 1. By default we expect > > $`r nrow(methylation)` \times 0.05 = `r nrow(methylation) * 0.05`$ @@ -974,7 +974,7 @@ simply due to chance. > {: .solution} {: .challenge} -# Adjusting for multiple tests +## Adjusting for multiple tests When performing many statistical tests to categorise features, we are effectively classifying features as "non-significant" or "significant", that latter meaning those for @@ -996,7 +996,7 @@ make falls into four categories: little data, we can't detect large differences. However, both can be argued to be "true". -| | Label as different | Label as not different | +| True outcome| Label as different | Label as not different | |--------------------:|-------------------:|-----------------------:| | Truly different | True positive | False negative | | Truly not different | False positive | True negative | @@ -1062,7 +1062,7 @@ experiment over and over. | \- Very conservative | \- Does not control probability of making errors | | \- Requires larger statistical power | \- May result in false discoveries | -> ## Challenge 6 +> ### Challenge 6 > > 1. At a significance level of 0.05, with 100 tests performed, what is > the Bonferroni significance threshold? @@ -1075,7 +1075,7 @@ experiment over and over. > Compare these values to the raw p-values and the Bonferroni > p-values. > -> > ## Solution +> > ### Solution > > > > 1. The Bonferroni threshold for this significance threshold is $$ > > \frac{0.05}{100} = 0.0005 @@ -1109,7 +1109,7 @@ experiment over and over. -> ## Feature selection +> ### Feature selection > > In this episode, we have focussed on regression in a setting where there are more > features than observations. This approach is relevant if we are interested in the diff --git a/_episodes_rmd/03-regression-regularisation.Rmd b/_episodes_rmd/03-regression-regularisation.Rmd index c4292650..e8d99f68 100644 --- a/_episodes_rmd/03-regression-regularisation.Rmd +++ b/_episodes_rmd/03-regression-regularisation.Rmd @@ -32,7 +32,7 @@ source(here("bin/chunk-options.R")) knitr_fig_path("03-") ``` -# Introduction +## Introduction This episode is about **regularisation**, also called **regularised regression** @@ -85,7 +85,7 @@ when naively fitting linear regression models with a large number of features (i often since the model cannot distinguish between the effects of many, correlated features or when we have more features than observations. -> ## Singularities +> ### Singularities > > The message that `lm()` produced is not necessarily the most intuitive. What > are "singularities" and why are they an issue? A singular matrix @@ -113,7 +113,7 @@ when we have more features than observations. > ``` {: .callout} -> ## Correlated features -- common in high-dimensional data +> ### Correlated features -- common in high-dimensional data > > In high-dimensional datasets, there > are often multiple features that contain redundant information (correlated features). @@ -142,7 +142,7 @@ when we have more features than observations. -> ## Challenge 1 +> ### Challenge 1 > > Consider or discuss in groups: > @@ -151,7 +151,7 @@ when we have more features than observations. > 1. Why might correlated features be a problem when fitting linear models? > 1. What issue might correlated features present when selecting features to include in a model one at a time? > -> > ## Solution +> > ### Solution > > > > 1. Many of the features in biological data represent very similar > > information biologically. For example, sets of genes that form complexes @@ -174,7 +174,7 @@ going on when we fit a linear model. -# Coefficient estimates of a linear model +## Coefficient estimates of a linear model When we fit a linear model, we're finding the line through our data that minimises the sum of the squared residuals. We can think of that as finding @@ -281,7 +281,7 @@ caution when estimating effect sizes (coefficients). That is, we might want to avoid estimating very large effect sizes. -> ## Challenge 2 +> ### Challenge 2 > > Discuss in groups: > @@ -289,7 +289,7 @@ avoid estimating very large effect sizes. > 2. Why are we minimising this objective? What assumptions are we making > about our data when we do so? > -> > ## Solution +> > ### Solution > > > > 1. When we fit a linear model we are minimising the squared error. > > In fact, the standard linear model estimator is often known as @@ -311,7 +311,7 @@ avoid estimating very large effect sizes. {: .challenge} -# Model selection using training and test sets +## Model selection using training and test sets Sets of models are often compared using statistics such as adjusted $R^2$, AIC or BIC. These show us how well the model is learning the data used in fitting that same model [^1]. @@ -404,12 +404,12 @@ err_lm_train The training error appears very low here -- on average we're only off by about a year! -> ## Challenge 3 +> ### Challenge 3 > > For the fitted model above, calculate the mean squared error for the test set. > > -> > ## Solution +> > ### Solution > > > > First, let's find the predicted values for the 'unseen' test data: > > ```{r modelpreds} @@ -444,7 +444,7 @@ good, the dots should follow a line. Regularisation can help us to make the model more generalisable, improving predictions for the test dataset (or any other dataset that is not used when fitting our model). -# Regularisation +## Regularisation Regularisation can be used to reduce correlation between predictors, the number of features, and improve generalisability by restricting model parameter estimates. Essentially, we @@ -497,7 +497,7 @@ points( par(mfrow = c(1, 1)) ``` -# Ridge regression +## Ridge regression There are multiple ways to define the distance that our solution must fall in. The one we've plotted above controls the squared sum of the @@ -600,14 +600,14 @@ abline(v = log(chosen_lambda), lty = "dashed") -> ## Challenge 4 +> ### Challenge 4 > > 1. Which performs better, ridge or OLS? > 2. Plot predicted ages for each method against the true ages. > How do the predictions look for both methods? Why might ridge be > performing better? > -> > ## Solution +> > ### Solution > > > > 1. Ridge regression performs significantly better on unseen data, despite > > being "worse" on the training data. @@ -635,7 +635,7 @@ abline(v = log(chosen_lambda), lty = "dashed") {: .challenge} -# LASSO regression +## LASSO regression *LASSO* is another type of regularisation. In this case we use the $L^1$ norm, or the sum of the absolute values of the coefficients. @@ -725,7 +725,7 @@ drawplot() ``` -> ## Challenge 5 +> ### Challenge 5 > > 1. Use `glmnet()` to fit a LASSO model (hint: set `alpha = 1`). > 2. Plot the model object. Remember that for ridge regression, @@ -733,7 +733,7 @@ drawplot() > relationship between the two plots? > 3. How do the coefficient paths differ to the ridge case? > -> > ## Solution +> > ### Solution > > > > 1. Fitting a LASSO model is very similar to a ridge model, we just need > > to change the `alpha` setting. @@ -759,7 +759,7 @@ drawplot() {: .challenge} -# Cross-validation to find the best value of $\lambda$ +## Cross-validation to find the best value of $\lambda$ Ideally, we want $\lambda$ to be large enough to reduce the complexity of our model, thus reducing the number of and correlations between the features, and improving generalisability. However, @@ -798,7 +798,7 @@ selected_coefs We can see that cross-validation has selected a value of $\lambda$ resulting in `r length(selected_coefs)-1` features and the intercept. -# Elastic nets: blending ridge regression and the LASSO +## Elastic nets: blending ridge regression and the LASSO So far, we've used ridge regression (where `alpha = 0`) and LASSO regression, (where `alpha = 1`). What if `alpha` is set to a value between zero and one? @@ -895,7 +895,7 @@ plot_elastic(0.5) plot_elastic(0.75) ``` -> ## Challenge 6 +> ### Challenge 6 > > 1. Fit an elastic net model (hint: alpha = 0.5) without cross-validation and plot the model > object. @@ -907,7 +907,7 @@ plot_elastic(0.75) > 4. Discuss: how could we pick an `alpha` in the range (0, 1)? Could we justify > choosing one *a priori*? > -> > ## Solution +> > ### Solution > > 1. Fitting an elastic net model is just like fitting a LASSO model. > > You can see that coefficients tend to go exactly to zero, > > but the paths are a bit less @@ -948,7 +948,7 @@ plot_elastic(0.75) {: .challenge} -> ## The bias-variance tradeoff +> ### The bias-variance tradeoff > > When we make predictions in statistics, there are two sources of error > that primarily influence the (in)accuracy of our predictions. these are *bias* @@ -992,7 +992,7 @@ plot_elastic(0.75) -> ## Other types of outcomes +> ### Other types of outcomes > > You may have noticed that `glmnet` is written as `glm`, not `lm`. > This means we can actually model a variety of different outcomes @@ -1030,7 +1030,7 @@ plot_elastic(0.75) {: .callout} -> ## tidymodels +> ### tidymodels > > A lot of the packages for fitting predictive models like regularised > regression have different user interfaces. To do predictive modelling, it's @@ -1107,14 +1107,14 @@ plot_elastic(0.75) > ``` {: .callout} -# Further reading +## Further reading - [An introduction to statistical learning](https://www.statlearning.com/). - [Elements of statistical learning](https://web.stanford.edu/~hastie/ElemStatLearn/). - [glmnet vignette](https://glmnet.stanford.edu/articles/glmnet.html). - [tidymodels](https://www.tidymodels.org/). -# Footnotes +## Footnotes [^1]: Model selection including $R^2$, AIC and BIC are covered in the additional feature selection for regression episode of this course. [^2]: [Epigenetic Predictor of Age, Bocklandt et al. (2011)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014821) diff --git a/_episodes_rmd/04-principal-component-analysis.Rmd b/_episodes_rmd/04-principal-component-analysis.Rmd index baec398a..0370683a 100644 --- a/_episodes_rmd/04-principal-component-analysis.Rmd +++ b/_episodes_rmd/04-principal-component-analysis.Rmd @@ -44,7 +44,7 @@ knitr_fig_path("05-") ``` -# Introduction +## Introduction If a dataset contains many variables ($p$), it is likely that some of these variables will be highly correlated. Variables may even be so highly correlated @@ -66,7 +66,7 @@ resulting principal component could also be used as an effect in further analysi -> ## Challenge 1 +> ### Challenge 1 > > Descriptions of three datasets and research questions are given below. For > which of these might PCA be considered a useful tool for analysing data so @@ -85,7 +85,7 @@ resulting principal component could also be used as an effect in further analysi > severity. > 3. Both of the above. > -> > ## Solution +> > ### Solution > > > > > > In the first case, a regression model would be more suitable; perhaps a @@ -98,7 +98,7 @@ resulting principal component could also be used as an effect in further analysi {: .challenge} -# Principal component analysis +## Principal component analysis PCA transforms a dataset of continuous variables into a new set of uncorrelated variables called "principal components". The first principal component derived explains the largest amount of the variability in the underlying dataset. The second principal component derived explains the second largest amount of variability in the dataset and so on. Once the dataset has been transformed into principal components, we can extract a subset of the principal components in order of the variance they explain (starting with the first principal component that by definition explains the most variability, and then the second), giving new variables that explain a lot of the variability in the original dataset. Thus, PCA helps us to produce a lower dimensional dataset while keeping most of the information in the original dataset. @@ -182,7 +182,7 @@ This is explained in more detail on [this Q&A website](https://stats.stackexchan knitr::include_graphics("../fig/pendulum.gif") ``` -> ## Mathematical description of PCA +> ### Mathematical description of PCA > Mathematically, each principal component is a linear combination > of the variables in the dataset. That is, the first principal > component values or _scores_, $Z_1$, are a linear combination @@ -197,7 +197,7 @@ In summary, the principal components' values are called _scores_. The loadings c be thought of as the degree to which each original variable contributes to the principal component scores. In this episode, we will see how to perform PCA to summarise the information in high-dimensional datasets. -# How do we perform a PCA? +## How do we perform a PCA? To illustrate how to perform PCA initially, we revisit the low-dimensional `prostate` dataset. The data come from a study which examined the @@ -242,7 +242,7 @@ pros2 <- prostate[, c("lcavol", "lweight", "lbph", "lcp", "lpsa")] head(pros2) ``` -## Do we need to scale the data? +### Do we need to scale the data? PCA derives principal components based on the variance they explain in the data. Therefore, we may need to apply some pre-processing to scale variables in our dataset if we want to ensure that each variable is considered equally by the PCA. Scaling is essential if we want to avoid the PCA ignoring variables that may be important to our analysis just because they take low values and have low variance. We do not need to scale if we want variables with low variance to carry less weight in the PCA. @@ -263,7 +263,7 @@ standardise all five variables to have a mean of 0 and a standard deviation of 1. -> ## Challenge 2 +> ### Challenge 2 > > > 1. Why might it be necessary to standardise variables before performing a PCA? @@ -279,7 +279,7 @@ deviation of 1. > variables? > > -> > ## Solution +> > ### Solution > > > > 1. Scaling the data isn't guaranteed to make the results more interesting. > > It also won't affect how quickly the output will be calculated, whether @@ -297,7 +297,7 @@ deviation of 1. > {: .solution} {: .challenge} -## Performing PCA +### Performing PCA Throughout this episode, we will carry out PCA using the Bioconductor package **`PCAtools`**. This package provides several functions that are useful for exploring data via PCA and @@ -337,7 +337,7 @@ pca.pros <- pca(pros2t, scale = TRUE, center = TRUE, metadata = pmetadata) summary(pca.pros) ``` -# How many principal components do we need? +## How many principal components do we need? We have calculated one principal component for each variable in the original dataset. How do we choose how many of these are necessary to represent the true @@ -415,7 +415,7 @@ The principal component scores are obtained by carrying out matrix multiplicatio (usually centered and scaled) original data times the loadings. The following callout demonstrates this. -> ## Computing a PCA "by hand" +> ### Computing a PCA "by hand" > The rotation matrix obtained in a PCA is identical to the eigenvectors > of the covariance matrix of the data. Multiplying these with the (centered and scaled) > data yields the principal component scores: @@ -470,7 +470,7 @@ loadings are scaled by the standard deviation of the principal components (`pca.pros$sdev`) times the square root the number of observations. -# Advantages and disadvantages of PCA +## Advantages and disadvantages of PCA Advantages: * It is a relatively easy to use and popular method. @@ -492,7 +492,7 @@ Disadvantages: especially when including them in further analyses (e.g. inclusion in a linear regression). -# Using PCA to analyse gene expression data +## Using PCA to analyse gene expression data The [dataset](https://carpentries-incubator.github.io/high-dimensional-stats-r/data/index.html) @@ -564,7 +564,7 @@ against each other as well as understanding what each principal component represents. -> ## Challenge 3 +> ### Challenge 3 > > Apply a PCA to the cancer gene expression data using the `pca()` function from > **`PCAtools`**. @@ -576,7 +576,7 @@ represents. > As in the example using `prostate` data above, examine the first 5 rows and > columns of rotated data and loadings from your PCA. > -> > ## Solution +> > ### Solution > > > > ```{r pca-ex} > > pc <- pca(mat, metadata = metadata) # implement a PCA @@ -618,7 +618,7 @@ represents. > {: .solution} {: .challenge} -> ## Scaling variables for PCA +> ### Scaling variables for PCA > > When running `pca()` above, we kept the default setting, `scale=FALSE`. That means genes with higher variation in > their expression levels should have higher loadings, which is what we are interested in. @@ -640,14 +640,14 @@ components and whether these principal components represent a reasonable amount of the variation. The proportion of variance explained should sum to one. -> ## Challenge 4 +> ### Challenge 4 > > This time using the `screeplot()` function in **`PCAtools`**, create a scree plot to show > proportion of variance explained by the first 20 principal component (hint: `components = 1:20`). > Explain the output of the scree plot in terms of proportion of the variance in the data explained > by each principal component. > -> > ## Solution +> > ### Solution > > > > ```{r scree-ex, fig.cap="A scree plot of the gene expression data.", fig.alt="A bar and line plot showing the variance explained by principal components (PCs) of gene expression data. Blue bars depict the variance explained by each PC, while a red line depicts the cumulative variance explained by the PCs. The first principal component explains roughly 30% of the variance, while succeeding PCs explain less than 10%."} > > pc <- pca(mat, metadata = metadata) @@ -698,7 +698,7 @@ are two functions called `biplot()`, one in the package **`PCAtools`** and one i **`stats`**. Both functions produce biplots but their scales are different! -> ## Challenge 5 +> ### Challenge 5 > > Create a biplot of the first two principal components from your PCA > using `biplot()` function in **`PCAtools`**. See `help("PCAtools::biplot")` for @@ -706,7 +706,7 @@ are two functions called `biplot()`, one in the package **`PCAtools`** and one i > > Examine whether the data appear to form clusters. Explain your results. > -> > ## Solution +> > ### Solution > > > > ```{r biplot-ex, fig.cap="A biplot of PC2 against PC1 in the gene expression data, coloured by Grade.", fig.alt="A biplot of PC2 against PC1 in the gene expression data, coloured by Grade. The points on the scatter plot separate clearly on PC1, but there is no clear grouping of samples based on Grade across these two groups."} > > PCAtools::biplot(pc, lab = NULL, colby = 'Grade', legendPosition = 'top') @@ -725,7 +725,7 @@ are two functions called `biplot()`, one in the package **`PCAtools`** and one i Let's consider this biplot in more detail, and also display the loadings: -> ## Challenge 6 +> ### Challenge 6 > > Use `colby` and `lab` arguments in `biplot()` to explore whether these two > groups may cluster by patient age or by whether or not the sample expresses @@ -735,7 +735,7 @@ Let's consider this biplot in more detail, and also display the loadings: > labels but little space for plotting. This is not usually a serious problem - > not all labels will be shown. > -> > ## Solution +> > ### Solution > > > > ```{r pca-biplot-ex2, fig.cap="A biplot of PC2 against PC1 in the gene expression data, coloured by ER status with patient ages overlaid.", fig.alt="A biplot of PC2 against PC1 in the gene expression data, coloured by ER status. The points on the scatter plot separate clearly on PC1, but there is no clear grouping of samples based on ER across these two groups, although there are more ER- samples in the rightmost cluster. Patient ages are overlaid as text near the points, but there is again no clear pattern."} > > PCAtools::biplot(pc, @@ -751,7 +751,7 @@ So far, we have only looked at a biplot of PC1 versus PC2 which only gives part of the picture. The `pairplots()` function in **`PCAtools`** can be used to create multiple biplots including different principal components. -```{r pairsplot, fig.cap="Multiple biplots produced by pairsplot().", fig.alt="An upper triangular grid of scatter plots of each principal component versus the others."} +```{r pairsplot, fig.cap="Multiple biplots produced by pairsplot().", fig.alt="A triangular grid of scatter plots. The grid is the upper right triangle of a square, where each entry of the grid corresponds to a plot of one principal component against another. For example, the plot in the upper left corner of the plot corresponds to principal component one plotted against principal component 2, and the plot to the right of this plots principal component 1 against principal component 3. Points correspond to samples, and are coloured arbitrarily from light blue to dark blue."} pairsplot(pc) ``` @@ -782,7 +782,7 @@ extreme loadings. This is because all loadings selected for any principal compon are shown for all other principal components. For instance, it is plausible that features which have high loadings on PC1 may have lower ones on PC2. -# Using PCA output in further analysis +## Using PCA output in further analysis The output of PCA can be used to interpret data or can be used in further analyses. For example, the PCA outputs new variables (principal components) @@ -793,7 +793,7 @@ regressions. Because the principal components are uncorrelated (and independent) they can be included together in a single linear regression. -> ## Principal component regression +> ### Principal component regression > > PCA is often used to reduce large numbers of correlated variables into fewer > uncorrelated variables that can then be included in linear regression or @@ -813,7 +813,7 @@ they can be included together in a single linear regression. {: .callout} -# Further reading +## Further reading - James, G., Witten, D., Hastie, T. & Tibshirani, R. (2013) An Introduction to Statistical Learning with Applications in R. Chapter 6.3 (Dimension Reduction Methods), Chapter 10 (Unsupervised Learning). diff --git a/_episodes_rmd/05-factor-analysis.Rmd b/_episodes_rmd/05-factor-analysis.Rmd index 8b103375..388de471 100644 --- a/_episodes_rmd/05-factor-analysis.Rmd +++ b/_episodes_rmd/05-factor-analysis.Rmd @@ -27,7 +27,7 @@ source(here("bin/chunk-options.R")) knitr_fig_path("06-") ``` -# Introduction +## Introduction Biologists often encounter high-dimensional datasets from which they wish to extract underlying features – they need to carry out dimensionality @@ -57,7 +57,7 @@ exploratory data analysis methods (including PCA) to provide an initial estimate of how many factors adequately explain the variation observed in a dataset. In practice, a range of different values is usually tested. -## Motivating example: student scores +### Motivating example: student scores One scenario for using FA would be whether student scores in different subjects can be summarised by certain subject categories. Take a look at the hypothetical @@ -78,7 +78,7 @@ component representing a different linear combination of features. The principal components are ordered by the amount of variance they account for. -# Prostate cancer patient data +## Prostate cancer patient data We revisit the [`prostate`]((https://carpentries-incubator.github.io/high-dimensional-stats-r/data/index.html) dataset of 97 men who have prostate cancer. @@ -109,7 +109,7 @@ pros2 <- prostate[, c("lcavol", "lweight", "lbph", "lcp", "lpsa")] head(pros2) ``` -# Performing exploratory factor analysis +## Performing exploratory factor analysis EFA may be implemented in R using the `factanal()` function from the **`stats`** package (which is a built-in package in base R). This @@ -118,12 +118,12 @@ data matrix as input. The number of factors to be fitted in the analysis is specified by the user using the `factors` argument. -> ## Challenge 1 (3 mins) +> ### Challenge 1 (3 mins) > > Use the `factanal()` function to identify the minimum number of factors > necessary to explain most of the variation in the data > -> > ## Solution +> > ### Solution > > > > ```{r ex1} > > # Include one factor only @@ -157,7 +157,7 @@ factors, while negative values show a negative relationship between variables and factors. Loading values are missing for some variables because R does not print loadings less than 0.1. -# How many factors do we need? +## How many factors do we need? There are numerous ways to select the “best” number of factors. One is to use the minimum number of features that does not leave a significant amount of @@ -179,7 +179,7 @@ Like PCA, the fewer factors that can explain most of the variation in the dataset, the better. It is easier to explore and interpret results using a smaller number of factors which represent underlying features in the data. -# Variance accounted for by factors - communality and uniqueness +## Variance accounted for by factors - communality and uniqueness The *communality* of a variable is the sum of its squared loadings. It represents the proportion of the variance in a variable that is accounted @@ -196,7 +196,7 @@ apply(pros_fa$loadings^2, 1, sum) #communality 1 - apply(pros_fa$loadings^2, 1, sum) #uniqueness ``` -# Visualising the contribution of each variable to the factors +## Visualising the contribution of each variable to the factors Similar to a biplot as we produced in the PCA episode, we can “plot the loadings”. This shows how each original variable contributes to each of the factors we chose to visualise. @@ -227,7 +227,7 @@ text( ``` -> ## Challenge 2 (3 mins) +> ### Challenge 2 (3 mins) > > Use the output from your factor analysis and the plots above to interpret > the results of your analysis. @@ -235,7 +235,7 @@ text( > What variables are most important in explaining each factor? Do you think > this makes sense biologically? Consider or discuss in groups. > -> > ## Solution +> > ### Solution > > > > This plot suggests that the variables `lweight` and `lbph` are associated with > > high values on factor 2 (but lower values on factor 1) and the variables @@ -251,7 +251,7 @@ text( > {: .solution} {: .challenge} -# Advantages and disadvantages of Factor Analysis +## Advantages and disadvantages of Factor Analysis There are several advantages and disadvantages of using FA as a dimensionality reduction method. @@ -279,7 +279,7 @@ Disadvantages: -# Further reading +## Further reading - Gundogdu et al. (2019) Comparison of performances of Principal Component Analysis (PCA) and Factor Analysis (FA) methods on the identification of cancerous and healthy colon tissues. International Journal of Mass Spectrometry 445:116204. - Kustra et al. (2006) A factor analysis model for functional genomics. BMC Bioinformatics 7: doi:10.1186/1471-2105-7-21. diff --git a/_episodes_rmd/06-k-means.Rmd b/_episodes_rmd/06-k-means.Rmd index c4d036f0..dd75e6a5 100644 --- a/_episodes_rmd/06-k-means.Rmd +++ b/_episodes_rmd/06-k-means.Rmd @@ -30,7 +30,7 @@ source(here("bin/chunk-options.R")) knitr_fig_path("08-") ``` -# Introduction +## Introduction As we saw in previous episodes, visualising high-dimensional data with a large amount of features is difficult and can @@ -63,7 +63,7 @@ proposed clusters. Using this process, we can also iteratively update clusters so that we become more confident about the shape and size of the clusters. -# What is K-means clustering? +## What is K-means clustering? **K-means clustering** groups data points into a user-defined number of distinct, non-overlapping clusters. @@ -93,7 +93,7 @@ to understand), it does have some disadvantages, particularly difficulties in id initial clusters which observations belong to and the need for the user to specify the number of clusters that the data should be partitioned into. -> ## Initialisation +> ### Initialisation > > The algorithm used in K-means clustering finds a *local* rather than a > *global* optimum, so that results of clustering are dependent on the initial @@ -113,7 +113,7 @@ number of clusters that the data should be partitioned into. {: .callout} -# Believing in clusters +## Believing in clusters When using clustering, it's important to realise that data may seem to group together even when these groups are created randomly. It's especially @@ -158,7 +158,7 @@ that the clusters are 'real'. But how do we tell if clusters identified visually are 'real'? -> ## Initialisation +> ### Initialisation > > The algorithm used in K-means clustering finds a *local* rather than a > *global* optimum, so that results of clustering are dependent on the initial @@ -177,7 +177,7 @@ visually are 'real'? > {: .callout} -# K-means clustering applied to the single-cell RNA sequencing data +## K-means clustering applied to the single-cell RNA sequencing data Let's carry out K-means clustering in `R` using some real high-dimensional data. We're going to work with single-cell RNA sequencing data, @@ -226,13 +226,13 @@ We can see that this produces a sensible-looking partition of the data. However, is it totally clear whether there might be more or fewer clusters here? -> ## Challenge 1 +> ### Challenge 1 > > Perform clustering to group the data into $k=5$ clusters, and plot it using `plotReducedDim()`. > Save this with a variable name that's different to what we just used, > because we'll use this again later. > -> > ## Solution +> > ### Solution > > > > ```{r kmeans-ex, fig.cap="Scatter plot of principal component 2 against principal component 1 in the scRNAseq data, coloured by clusters produced by k-means clustering.", fig.alt="A scatter plot of principal component 1 versus principal component 2 of the scrnaseq data. Each point is one of five colours, representing cluster membership. Points of the same colour appear in the same areas of the plot, showing five distinct clusters in the data."} > > set.seed(42) @@ -244,7 +244,7 @@ here? > {: .solution} {: .challenge} -> ## K-medoids (PAM) +> ### K-medoids (PAM) > > One problem with K-means is that using the mean to define cluster centroids > means that clusters can be very sensitive to outlying observations. @@ -270,7 +270,7 @@ here? {: .callout} -# Cluster separation +## Cluster separation When performing clustering, it is important for us to be able to measure how well our clusters are separated. One measure to test this is silhouette width, which measures how similar a data point is to points in the same cluster compared @@ -280,7 +280,7 @@ within the same cluster. For each cluster, the silhouette widths can then be averaged or an overall average can be taken. -> ## More detail on silhouette widths +> ### More detail on silhouette widths > In more detail, each observation’s silhouette width is computed as follows: > 1. Compute the average distance between the focal observation and all other > observations in the same cluster. @@ -337,14 +337,14 @@ This makes sense, as points that are "between" two clusters may be more similar to points in another cluster than they are to the points in the cluster one they belong to. -> ## Challenge 2 +> ### Challenge 2 > > Calculate the silhouette width for the k of 5 clustering we did earlier. > Are 5 clusters appropriate? Why/why not? > > Can you identify where the differences lie? > -> > ## Solution +> > ### Solution > > > > ```{r silhouette-ex, fig.cap="Scatter plot of principal component 2 against principal component 1 in the scRNAseq data, coloured by clusters produced by k-means clustering.", fig.alt="A scatter plot of principal component 1 versus principal component 2 of the scrnaseq data. Each point is one of five colours, representing cluster membership. Points of the same colour appear in the same areas of the plot, showing five distinct clusters in the data."} > > sil5 <- silhouette(cluster5$cluster, dist = dist_mat) @@ -365,7 +365,7 @@ belong to. -> ## Gap statistic +> ### Gap statistic > > Another measure of how good our clustering is is the "gap statistic". > This compares the observed squared distance between observations in a cluster @@ -386,7 +386,7 @@ belong to. {: .callout} -# Cluster robustness: bootstrapping +## Cluster robustness: bootstrapping When we cluster data, we want to be sure that the clusters we identify are not a result of the exact properties of the input data. That is, we want @@ -485,12 +485,12 @@ pheatmap(ratios, Yellow boxes indicate values slightly greater than 1, which may be observed. These are “good” (despite missing in the colour bar). -> ## Challenge 3 +> ### Challenge 3 > > Repeat the bootstrapping process with k=5. Do the results appear better or worse? > Can you identify where the differences occur on the `plotReducedDim()`? > -> > ## Solution +> > ### Solution > > > > ```{r bs-ex, fig.cap="Grid of empirical cluster swapping behaviour estimated by the bootstrap samples.", fig.alt="Grid of 25 squares labelled 1-5 on each of the x and y axes. The diagonal and off-diagonal squares of the grid are coloured in green, indicating the highest scoring value of 1 according to the legend, with the exception of the square corresponding to (4, 5), which is slightly darker green indicating a lower value. The lower triangular squares are coloured in grey, indicating NA values since these would be the same as the upper triangular squares."} > > km_fun5 <- function(x) { @@ -516,7 +516,7 @@ These are “good” (despite missing in the colour bar). {: .challenge} -> ## Consensus clustering +> ### Consensus clustering > > One useful and generic method of clustering is *consensus clustering*. > This method can use k-means or other clustering methods. @@ -533,7 +533,7 @@ These are “good” (despite missing in the colour bar). {: .callout} -> ## Speed +> ### Speed > > It's worth noting that a lot of the methods we've discussed here are very > computationally demanding. diff --git a/_episodes_rmd/07-hierarchical.Rmd b/_episodes_rmd/07-hierarchical.Rmd index d8e60689..7a404102 100644 --- a/_episodes_rmd/07-hierarchical.Rmd +++ b/_episodes_rmd/07-hierarchical.Rmd @@ -36,7 +36,7 @@ knitr_fig_path("09-") ``` -# Why use hierarchical clustering on high-dimensional data? +## Why use hierarchical clustering on high-dimensional data? When analysing high-dimensional data in the life sciences, it is often useful to identify groups of similar data points to understand more about the relationships @@ -64,7 +64,7 @@ In this episode we will explore hierarchical clustering for identifying clusters in high-dimensional data. We will use *agglomerative* hierarchical clustering (see box) in this episode. -> ## Agglomerative and Divisive hierarchical clustering +> ### Agglomerative and Divisive hierarchical clustering > > There are two main methods of carrying out hierarchical clustering: > agglomerative clustering and divisive clustering. @@ -78,7 +78,7 @@ clustering (see box) in this episode. {: .callout} -# The agglomerative hierarchical clustering algorithm +## The agglomerative hierarchical clustering algorithm To start with, we measure distance (or dissimilarity) between pairs of observations. Initially, and at the bottom @@ -99,7 +99,7 @@ knitr::include_graphics("../fig/hierarchical_clustering_1.png") knitr::include_graphics("../fig/hierarchical_clustering_2.png") ``` -# A motivating example +## A motivating example To motivate this lesson, let's first look at an example where hierarchical clustering is really useful, and then we can understand how to apply it in more @@ -164,7 +164,7 @@ cause of this is -- it could be a batch effect, or a known grouping (e.g., old vs young samples). However, clustering like this can be a useful part of exploratory analysis of data to build hypotheses. -# Hierarchical clustering +## Hierarchical clustering Now, let's cover the inner workings of hierarchical clustering in more detail. Hierarchical clustering is a type of clustering that also allows us to estimate the number @@ -172,7 +172,7 @@ of clusters. There are two things to consider before carrying out clustering: * how to define dissimilarity between observations using a distance matrix, and * how to define dissimilarity between clusters and when to fuse separate clusters. -# Defining the dissimilarity between observations: creating the distance matrix +## Defining the dissimilarity between observations: creating the distance matrix Agglomerative hierarchical clustering is performed in two steps: calculating the distance matrix (containing distances between pairs of observations) and iteratively grouping observations into clusters using this matrix. @@ -195,7 +195,7 @@ clustering can have a big effect on the resulting tree. The decision of which distance matrix to use before carrying out hierarchical clustering depends on the type of data and question to be addressed. -# Defining the dissimilarity between clusters: Linkage methods +## Defining the dissimilarity between clusters: Linkage methods The second step in performing hierarchical clustering after defining the distance matrix (or another function defining similarity between data points) @@ -221,7 +221,7 @@ it sets their dissimilarity to the maximum dissimilarity value observed between any of these clusters' constituent points. The two clusters with smallest dissimilarity value are then fused. -# Computing a dendrogram +## Computing a dendrogram Dendrograms are useful tools that plot the grouping of points and clusters into bigger clusters. We can create and plot dendrograms in R using `hclust()` which takes @@ -258,14 +258,14 @@ dist_m <- dist(example_data, method = "euclidean") Hierarchical clustering carried out on the data can be used to produce a dendrogram showing how the data is partitioned into clusters. But how do we interpret this dendrogram? Let's explore this using our example data in a Challenge. -> ## Challenge 1 +> ### Challenge 1 > > Use `hclust()` to implement hierarchical clustering using the > distance matrix `dist_m` and > the `complete` linkage method and plot the results as a dendrogram using > `plot()`. Why is hierarchical clustering and the resulting dendrogram useful for performing clustering this case? > -> > ## Solution: +> > ### Solution: > > > > ```{r plotclustex, fig.cap="A dendrogram of the randomly-generated data.", fig.alt="A line plot depicting a dendrogram --- a tree structure representing the hierarchical structure of the data. The data broadly fit into three clusters, with one sample (14) being quite dissimilar to all others, and the rest of the data comprising two other clusters (one larger than the other)."} > > clust <- hclust(dist_m, method = "complete") @@ -289,7 +289,7 @@ points that fuse at the bottom of the tree, which are quite similar. You can see this by comparing the position of similar/dissimilar points according to the scatterplot with their position on the tree. -# Identifying the number of clusters +## Identifying the number of clusters To identify the number of clusters, we can make a horizontal cut through the dendrogram at a user-defined height. The sets of observations beneath this cut and single points representing clusters above can be thought of as distinct clusters. Equivalently, @@ -297,7 +297,7 @@ we can count the vertical lines we encounter crossing the horizontal cut and the example, a cut at height 10 produces `r length(unique(cutree(clust, h=10)))` clusters for the dendrogram in Challenge 1, while a cut at height 4 produces `r length(unique(cutree(clust, h=4)))` clusters. -# Dendrogram visualisation +## Dendrogram visualisation We can first visualise cluster membership by highlight branches in dendrograms. In this example, we calculate a distance matrix between @@ -363,12 +363,12 @@ ggplot(example_cl, aes(x = cg01644850, y = cg01656216, color = factor(cluster))) Note that this cut produces `r length(unique(four_cut))` clusters (zero before the cut and another four downstream of the cut). -> ## Challenge 2: +> ### Challenge 2: > > Identify the value of `k` in `cutree()` that gives the same > output as `h = 36` > -> > ## Solution: +> > ### Solution: > > > > ```{r h-k-ex-plot, fig.cap="A dendrogram of the randomly-generated data.", fig.alt="A line plot depicting a dendrogram --- a tree structure representing the hierarchical structure of the data. The data broadly fit into three clusters, with one sample (14) being quite dissimilar to all others, and the rest of the data comprising two other clusters (one larger than the other). A dashed horizontal line at a height of 5 indicates the cut point used to divide the data into clusters."} > > plot(clust) @@ -397,7 +397,7 @@ downstream of the cut). {: .challenge} -# The effect of different linkage methods +## The effect of different linkage methods Now let us look into changing the default behaviour of `hclust()`. First, we'll load a synthetic dataset, comprised of two variables representing two crescent-shaped point clouds: @@ -415,7 +415,7 @@ plot(cres, col=cresClass) # colour scatterplot by partition ``` -> ## Challenge 3 +> ### Challenge 3 > > Carry out hierarchical clustering on the `cres` data that we generated above. > Try out different linkage methods and use `cutree()` to split each resulting @@ -426,7 +426,7 @@ plot(cres, col=cresClass) # colour scatterplot by partition > > Hint: Check `?hclust` to see the possible values of the argument `method` (the linkage method used). > -> > ## Solution: +> > ### Solution: > > > > ```{r plot-clust-comp, fig.cap="A scatter plot of synthetic data coloured by cluster.", fig.alt="A scatter plot of synthetic data, comprising two variables, with points forming two crescent-shaped clusters. Points are coloured based on hierarchical clustering with single linkage, with two clusters, corresponding to the two crescent-shaped clouds."} > > #?hclust @@ -474,7 +474,7 @@ points into the same cluster (thus resolving one cluster per crescent). Complete other hand recognises that some points a the tip of a crescent are much closer to points in the other crescent and so it splits both crescents. -# Using different distance methods +## Using different distance methods So far, we've been using Euclidean distance to define the dissimilarity or distance between observations. However, this isn't always the best @@ -513,8 +513,8 @@ being quite different in their pattern across the different features. In contrast, `sample_a` and `sample_c` are very distant, despite having *exactly* the same pattern across the different features. -```{r heatmap-cor-example, fig.cap="Heatmap of simulated data.", fig.alt="Heatmap of simulated data: feature versus sample. The grid cells of the heatmap are coloured from red (high) to blue (low) according to value of the simulated data."} -Heatmap(as.matrix(cor_example)) +```{r heatmap-cor-example, fig.cap="Heatmap of simulated data.", fig.alt="Heatmap of simulated data: feature versus sample. The grid cells of the heatmap are coloured from red (high) to blue (low) according to value of the simulated data. Samples A and B are both primarily composed of blue (low) values and form a cluster distinct from sample C on the column dendrogram."} +pheatmap(as.matrix(cor_example)) ``` We can see that more clearly if we do a line plot: @@ -572,7 +572,7 @@ or unusual data. It's often possible to use correlation and other custom dissimilarity measures in functions that perform hierarchical clustering, such as `pheatmap()` and `stats::heatmap()`: -```{r heatmap-cor-cor-example, fig.cap="Heatmaps of features versus samples clustered in the samples according to correlation.", fig.alt="Heatmaps of features versus samples, coloured by simulated value. The columns (samples) are clustered according to the correlation. Samples a and b have mostly low values, delineated by blue in the first plot and yellow in the second plot. Sample c has mostly high values, delineated by red in the first plot and brown in the second plot."} +```{r heatmap-cor-cor-example, fig.cap="Heatmaps of features versus samples clustered in the samples according to correlation.", fig.alt="Heatmaps of features versus samples, coloured by simulated value. The columns (samples) are clustered according to the correlation. Samples a and b have mostly low values, delineated by blue in the first plot and yellow in the second plot. Sample c has mostly high values, delineated by red in the first plot and brown in the second plot. Samples A and C form a cluster separate from sample B in the column dendrogram."} ## pheatmap allows you to select correlation directly pheatmap(as.matrix(cor_example), clustering_distance_cols = "correlation") ## Using the built-in stats::heatmap @@ -583,7 +583,7 @@ heatmap( ``` -# Validating clusters +## Validating clusters Now that we know how to carry out hierarchical clustering, how do we know how many clusters are optimal for the dataset? @@ -627,12 +627,12 @@ dunn(distance = distmat, cut) The value of the Dunn index has no meaning in itself, but is used to compare between sets of clusters with larger values being preferred. -> ## Challenge 4 +> ### Challenge 4 > > Examine how changing the `h` or `k` arguments in `cutree()` > affects the value of the Dunn index. > -> > ## Solution: +> > ### Solution: > > > > ```{r dunn-ex} > > library("clValid") @@ -699,7 +699,7 @@ As we said before (see previous episode), clustering is a non-trivial task. It is important to think about the nature of your data and your expectations rather than blindly using a some algorithm for clustering or cluster validation. -# Further reading +## Further reading - Dunn, J. C. (1974) Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4(1):95–104. - Halkidi, M., Batistakis, Y. & Vazirgiannis, M. (2001) On clustering validation techniques. Journal of Intelligent Information Systems 17(2/3):107-145. diff --git a/_extras/guide.md b/_extras/guide.md index 5ea51dfb..83ffc4ab 100644 --- a/_extras/guide.md +++ b/_extras/guide.md @@ -4,7 +4,7 @@ title: "Instructor Notes" Thank you for teaching high dimensional statistics with R! We hope you enjoy teaching the lesson. This page contains additional information for instructors. -The materials for each episode are self-contained and can be found through the episode links on the home page. The slides in the Extras section can be used to supplement these materials. +The materials for each episode are self-contained and can be found through the episode links on the home page. In previous rounds of teaching, the lesson was taught in four sessions, each lasting 2 hours and 50 minutes. We also advise allowing around 40 minutes of additional time for breaks. The recommended timings for each session are as follows: @@ -23,4 +23,6 @@ Session 4: - K-means (episode 6): 60 minutes of teaching time, 20 minutes for exercises (total: 80 minutes). - Hierarchical clustering (episode 7): 70 minutes of teaching time, 20 minutes for exercises (total: 90 minutes). +This lesson was designed for learners with good familiarity with R programming and foundational knowledge of statistics. To manage cognitive load, it may be necessary to omit later chapters, especially if some learners have difficulties with the programming techniques shown here or foundational statistical concepts. In this case, we recommend omitting Chapter 5: Factor analysis and one or both of the clustering chapters (Chapter 6: K-means and Chapter 7: Hierarchical clustering). + {% include links.md %} diff --git a/_extras/slides.md b/_extras/slides.md deleted file mode 100644 index cc2e534f..00000000 --- a/_extras/slides.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -title: Lecture slides ---- - -{% include base_path.html %} - - -{% for p in site.slides %} -- [{{p.title}}]({{ relative_root_path }}/{{p.url | replace: "Rmd", "html"}}) -{% endfor %} - - -{% include links.md %} diff --git a/_slides/.gitignore b/_slides/.gitignore deleted file mode 100644 index ab53cf43..00000000 --- a/_slides/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -*.log -*.tex diff --git a/_slides/01-introduction-to-high-dimensional-data.Rmd b/_slides/01-introduction-to-high-dimensional-data.Rmd deleted file mode 100644 index 3f4af608..00000000 --- a/_slides/01-introduction-to-high-dimensional-data.Rmd +++ /dev/null @@ -1,58 +0,0 @@ ---- -title: "Introduction to high-dimensional data" -output: - xaringan::moon_reader: - self_contained: true ---- -```{r, echo=FALSE, message=FALSE} -library("here") -``` -# plot of chunk table-intro -```{r, out.width="100%", echo=FALSE} -# , fig.cap=' -knitr::include_graphics(here("fig/intro-table.png")) -``` - ---- -# plot of chunk pairs-prostate -```{r, out.width="100%", echo=FALSE} -# , fig.cap='plot of chunk pairs-prostate -knitr::include_graphics(here("fig/rmd-01-pairs-prostate-1.png")) -``` - ---- -# plot of chunk intro-figure -```{r, out.width="100%", echo=FALSE} -# , fig.cap=' -knitr::include_graphics(here("fig/intro-scatterplot.png")) -``` - ---- -# plot of chunk plot-lm -```{r, out.width="100%", echo=FALSE} -# , fig.cap='plot of chunk plot-lm -knitr::include_graphics(here("fig/rmd-01-plot-lm-4.png")) -``` - ---- -# plot of chunk plot-random -```{r, out.width="100%", echo=FALSE} -# , fig.cap='plot of chunk plot-random -knitr::include_graphics(here("fig/rmd-01-plot-random-1.png")) -``` - ---- -# plot of chunk plot-random -```{r, out.width="100%", echo=FALSE} -# , fig.cap='plot of chunk plot-random -knitr::include_graphics(here("fig/rmd-01-plot-random-2.png")) -``` - ---- -# plot of chunk plot-random -```{r, out.width="100%", echo=FALSE} -# , fig.cap='plot of chunk plot-random -knitr::include_graphics(here("fig/rmd-01-plot-random-3.png")) -``` - ---- diff --git a/_slides/01-introduction-to-high-dimensional-data.html b/_slides/01-introduction-to-high-dimensional-data.html deleted file mode 100644 index 3b98ea99..00000000 --- a/_slides/01-introduction-to-high-dimensional-data.html +++ /dev/null @@ -1,297 +0,0 @@ - - -
-