-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revise PCA lesson #1
Comments
New iteration of the lesson was trialled 2019-07-04 here. The lesson was revised to have detailed explanation of However, this was perhaps a bit too much the other way... There's too much wrangling syntax necessary to do everything "manually". Need to find a better compromise between syntax (preparing matrix for PCA and extract stuff from it) and concept (what is a PCA and how to interpret output). Some ideas from post-lesson debriefing and sticky feedback:
|
Possible simplification using # load packages
library(tidyverse)
# read the data
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")
# Create a transposed matrix from our table of counts
pca_matrix <- trans_cts %>%
column_to_rownames("gene") %>%
t()
# Perform the PCA
sample_pca <- prcomp(pca_matrix, scale. = TRUE)
# Extract tidy output
library(broom)
pc_scores <- augment(sample_pca) %>%
rename(sample = .rownames)
pc_eigenvalues <- tidy(sample_pca, matrix = "pcs")
pc_loadings <- tidy(sample_pca, matrix = "variables")
# screeplot
pc_eigenvalues %>%
ggplot(aes(PC, percent)) +
geom_col()
# PC plot (with bonus ellipse?)
pc_scores %>%
full_join(sample_info, by = c("sample")) %>%
ggplot(aes(.fittedPC1, .fittedPC2)) +
geom_point(aes(colour = factor(minute))) +
geom_polygon(stat = "ellipse",
aes(colour = factor(minute), fill = factor(minute)),
alpha = 0.1) Then, the loadings part could be simplified to just say that we can pull the top genes with highest loading if we want to know who they are: # alternatively could just say that for such a big matrix it's hard to interpret loadings
# but we could find out which genes have highest loading by:
pc_loadings %>%
filter(PC == "2") %>%
top_n(10, abs(value)) Finally, could show shortcut using library(factoextra)
fviz_pca_ind(sample_pca)
# illustrate why the biplot is useless in this case
fviz_pca_biplot(sample_pca) |
autoplot()
, which caused some confusionprcomp
object is and how to extract things from it "manually"top_n()
to get top PC loadingsThe text was updated successfully, but these errors were encountered: