Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Method Regression Analysis to exercises.qmd #591

Open
wants to merge 20 commits into
base: devel
Choose a base branch
from
Open
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 169 additions & 0 deletions inst/pages/exercises.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -1741,6 +1741,175 @@ pheatmap(
annotation_col = as.data.frame(sample_metadata))
```

### *bonus* Regression charts to visualize microbiome data

Regression analysis is a powerful statistical method that can help examine
the relationship between two or more variables. In microbiome research,
regression charts help in visualizing how changes in microbial communities
correlate with various environmental factors, health outcomes, or other
biological variables.

Let us do a regression analysis exercise using the `GlobalPatterns` dataset
against the Shannon index ("ShannonIndex"), a common measure of alpha diversity,
which quantifies the diversity within each sample. This index will serve as the
dependent variable in our regression model examples.

We extract SampleType as a categorical(discrete) variable and coverage as a
continuous variable from the dataset’s metadata. These variables will be used
as predictors/independent variables in our linear regression models.

Steps:
1. Import the dataset into R; here we are using a microbiome dataset.
2. Calculate Alpha Diversity (Shannon Index), i.e. the dependent variable and
define the covariates, the independent variable.
3. Fit linear model using discrete variable "SampleType", where Shannon diversity
is the response variable, and SampleType is the predictor.
4. Visualize the results using a boxplot. This plot helps in understanding the
distribution of diversity across discrete sample types.
5. Fit linear model using continuous variable "coverage", where Shannon diversity
is the response variable, and coverage is the predictor.
6. Visualize the model with a scatterplot. In R, you can also use
[`ggplot2`](https://ggplot2.tidyverse.org/) for
plotting and `lm()` function for linear modeling.
7. Examine the slope, intercept, and R-squared value to understand the
relationship between the variables.
8. Summarize results with `gtsummary`. The [`gtsummary`]
(https://www.danieldsjoberg.com/gtsummary/) package is used to generate a summary
table of the regression model, which is easy to interpret and present in reports
or publications.
9. Visualize coefficients with `jtools`. The [`jtools`]
(https://cran.r-project.org/web/packages/jtools/index.html) package provides a visual
representation of the model’s coefficients, including their confidence intervals, allowing
for a clear understanding of the effect sizes.
10. Finally, interpret the analysis outcomes.

Visualization methods from packages like
jkc9886 marked this conversation as resolved.
Show resolved Hide resolved
[`miaViz`](https://bioconductor.org/packages/release/bioc/html/miaViz.html)
and [`scater`](https://bioconductor.org/packages/release/bioc/html/scater.html) can be used.

```{r}
#| label: extra-regressionchart
#| code-fold: true
#| code-summary: "Show solution"
#| eval: false

# Load the necessary libraries
library(mia)
library(TreeSummarizedExperiment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TreeSE comes with mia

library(scater)
library(gtsummary)
library(jtools)

# Load the example dataset
data("GlobalPatterns")
tse <- GlobalPatterns

# Calculate alpha diversity (ShannonIndex)
tse <- addAlpha(tse, method = "shannon")
shannon <- colData(tse)$shannon
jkc9886 marked this conversation as resolved.
Show resolved Hide resolved
jkc9886 marked this conversation as resolved.
Show resolved Hide resolved

# Fit Linear Model Using discrete predictor 'SampleType'
model <- lm(shannon ~ SampleType, data = colData(tse))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line does not work, since it does not have "_diversity" suffix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also modify the line 1853

tse <- addAlpha(tse, index = "shannon")


# Visualize using a boxplot
plotColData(tse, x = "SampleType", y = "shannon", color_by = "SampleType") +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

plotColData(tse, x = "SampleType", y = "shannon", color_by = "SampleType", show_boxplot = TRUE)

geom_boxplot() +
theme_minimal()
```
What do we interpret from the above regression analysis?

1.The boxplot generated shows the distribution of Shannon diversity across
different sample types. If the boxes are well-separated and there are significant
differences in medians, it indicates that SampleType has a strong effect on
diversity. This would be supported by significant p-values in the regression summary.
2. The p-value associated with each coefficient tells us whether the effect
of that sample type on the Shannon diversity index is statistically
significant. A common threshold for significance is 0.05.
3. R-squared: This value indicates the proportion of the variance in the
Shannon diversity index that is explained by the sample type. A higher R-squared
value indicates a better fit of the model.

Now, let's check some linear modeling assumptions :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove space: "modeling assumptions:"


```{r}
#| label: sub-checking-modeling-assumptions

# Check Linear Modeling Assumptions
# - Residual diagnostics to visually check for normality and homoscedasticity.
# - Durbin-Watson Test for independence of residuals

par(mfrow = c(2, 2))
plot(model)
dw_result <- dwtest(model)
print(dw_result)
```
Here,
1. Residual diagnostics (normality and homoscedasticity):
a. The Residuals vs Fitted plot provides insight into the relationship between residuals
and fitted values. Ideally, the residuals should be randomly scattered around the horizontal
line (residuals = 0) without any discernible pattern. This indicates that the model’s
assumption of linearity and homoscedasticity (constant variance of residuals) is likely
met. If you observe any systematic patterns (such as a funnel shape), it may suggest
heteroscedasticity, meaning the variance of residuals is not constant.
b. The Normal Q-Q plot assesses whether the residuals are normally distributed. In this plot,
residuals should align closely with the diagonal line. Deviations from this line, especially at
the tails, indicate that the residuals are not normally distributed, which could violate the
assumption of normality.
c. The Scale-Location (or Spread-Location) plot also helps check the homoscedasticity assumption.
It plots the square root of standardized residuals against the fitted values.
d. The Residuals vs Leverage plot helps identify influential data points (outliers) that
might disproportionately affect the model. Points outside the dashed lines (Cook’s distance)
are potential outliers that may have a strong influence on the model's coefficients.

2. Durbin-Watson test checks for autocorrelation in the residuals. A value
close to 2 indicates no autocorrelation. Values significantly different
from 2 suggest autocorrelation.

In a nutshell, if all the assumptions hold (as indicated by the diagnostic plots
and the statistical tests), you can be confident that the linear regression model
is appropriate for the data, and the results can be interpreted reliably.

Further, let's explore a continuous covariate, "coverage".
jkc9886 marked this conversation as resolved.
Show resolved Hide resolved

```{r}
#| label: sub-lm-continuous-variable

# Fit Linear Model Using 'Coverage'
model_continuous <- lm(shannon ~ coverage, data = colData(tse))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no "coverare" and "shannon" indices in colData


# Visualization for Continuous Predictor
plotColData(tse, x = "coverage", y = "shannon") +
geom_point() +
theme_minimal()
```

Here, when you perform a linear regression with Shannon diversity as the outcome
variable and coverage as the predictor variable, both are treated as continuous.
The regression model will estimate how changes in the continuous variable coverage are
associated with changes in the continuous variable shannon.

Thus, this example illustrates a linear relationship between two continuous variables,
where the slope of the regression line indicates the rate of change in Shannon diversity
per unit change in coverage unlike our previous example in which the predictor is categorical
(like SampleType) and coefficients represent differences between group means.

Additionally, let's summarise results with [`gtsummary`]
(https://www.danieldsjoberg.com/gtsummary/) and visualize the coefficients with [`jtools`]
(https://cran.r-project.org/web/packages/jtools/index.html).

```{r}
#| label: sub-gtsummary-jtools

# Summarize Results with gtsummary
gt_model <- tbl_regression(model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the model for shannon vs coverage

print(gt_model)

# Visualize Coefficients with jtools
summ(model, confint = TRUE, ci.level = 0.95)
```
Similarly, you can do it for the continuous variable example and compare their
coefficients and statistical values.

## Multiomics

### Basic exploration
Expand Down
Loading