diff --git a/session3_solutions.Rmd b/session3_solutions.Rmd index e37b46b..71dddc5 100644 --- a/session3_solutions.Rmd +++ b/session3_solutions.Rmd @@ -54,6 +54,8 @@ results_tgf %>% The tricky part of making a heatmap is deciding which genes to include. The gene names have to be in ENSEMBL format, as the count matrix we will be plotting has ENSEMBL names in the rows. +In the following example, we chose the genes with most extreme log2 fold-change + ```{r} # The pheatmap library is needed to make the heatmap @@ -64,42 +66,32 @@ library(pheatmap) vsd <- vst(dds) sampleInfo <- as.data.frame(colData(dds)[,c("condition","Treated")]) -# create a new table that includes gene variability - -res_with_var <- mutate(results_tgf, GeneVar = rowSds(assay(vsd))) %>% - filter(!is.na(padj)) - -# arrange by the new column and extract ENSEMBL ID of the first 100 rows +## Order results by log2FoldChange and get the most extreme -genes_to_plot <- arrange(res_with_var, desc(GeneVar)) %>% - dplyr::slice(1:100) %>% +top_genes_logFC <- results_tgf %>% + arrange(desc(abs(log2FoldChange))) %>% + slice(1:30) %>% pull(ENSEMBL) +top_genes_logFC_labels <- results_tgf %>% + arrange(desc(abs(log2FoldChange))) %>% + slice(1:30) %>% + pull(SYMBOL) -``` +## get the corresponding geen symbols -```{r fig.width=12} +pheatmap(assay(vsd)[top_genes_logFC,], + scale = "row", + labels_row = top_genes_logFC_labels) -pheatmap(assay(vsd)[genes_to_plot,], - annotation_col = sampleInfo, - scale="row") ``` -```{r fig.width=12} -gene_labels <- arrange(res_with_var, desc(GeneVar)) %>% - dplyr::slice(1:100) %>% - pull(SYMBOL) -pheatmap(assay(vsd)[genes_to_plot,], - annotation_col = sampleInfo, - labels_row = gene_labels, - scale="row") -``` + # Exercise: Pathways -**Need to fix, as currently this doesn't return anything!** In order to use the `enrichKEGG` function, we will have to define a set of enriched genes in terms of their `ENTREZID` (instead of `ENSEMBL`) diff --git a/session3_solutions.nb.html b/session3_solutions.nb.html index c527630..dee47c7 100644 --- a/session3_solutions.nb.html +++ b/session3_solutions.nb.html @@ -1511,7 +1511,7 @@ $(this).detach().appendTo(div); // add a show code button right above - var showCodeText = $('' + (showThis ? 'Hide' : 'Code') + ''); + var showCodeText = $('' + (showThis ? 'Hide' : 'Show') + ''); var showCodeButton = $(''); showCodeButton.append(showCodeText); showCodeButton @@ -1537,7 +1537,7 @@ // * Change text // * add a class for intermediate states styling div.on('hide.bs.collapse', function () { - showCodeText.text('Code'); + showCodeText.text('Show'); showCodeButton.addClass('btn-collapsing'); }); div.on('hidden.bs.collapse', function () { @@ -1755,10 +1755,12 @@
We start by loading our results from the previous week
+library(DESeq2)
-
+
+
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
@@ -1771,29 +1773,28 @@ R Notebook
The following objects are masked from ‘package:base’:
- anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated,
- eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
- match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
- rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min
+ anyDuplicated, aperm, append, as.data.frame, basename, cbind,
+ colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
+ get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
+ match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
+ Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
+ tapply, union, unique, unsplit, which.max, which.min
Attaching package: ‘S4Vectors’
+The following object is masked from ‘package:utils’:
+
+ findMatches
+
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
-
-Attaching package: ‘IRanges’
-
-The following object is masked from ‘package:grDevices’:
-
- windows
-
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
-Warning: package ‘GenomeInfoDb’ was built under R version 4.2.1Loading required package: SummarizedExperiment
+Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
@@ -1801,22 +1802,28 @@ R Notebook
The following objects are masked from ‘package:matrixStats’:
- colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins,
- colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads,
- colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges,
- colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
- colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs,
- rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
- rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs,
- rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
- rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads,
- rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars
+ colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+ colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+ colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+ colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+ colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+ colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+ colWeightedMeans, colWeightedMedians, colWeightedSds,
+ colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+ rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+ rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+ rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+ rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+ rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+ rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+ rowWeightedSds, rowWeightedVars
Loading required package: Biobase
Welcome to Bioconductor
- Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor,
- see 'citation("Biobase")', and for packages 'citation("pkgname")'.
+ Vignettes contain introductory material; view with
+ 'browseVignettes()'. To cite Bioconductor, see
+ 'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘Biobase’
@@ -1829,9 +1836,11 @@ R Notebook
anyMissing, rowMedians
+
library(dplyr)
+
Attaching package: ‘dplyr’
@@ -1872,12 +1881,14 @@ R Notebook
intersect, setdiff, setequal, union
+
## Read the annotated results
results_tgf <- readRDS("Robjects/results_TGF_vs_CTR_annotated_BACKUP.rds")
## Read the counts that we produced previously
dds <- readRDS("Robjects/dds_BACKUP.rds")
+
ggplot2
package
-
+
+
-library(ggplot2)
-
-
-Warning: package ‘ggplot2’ was built under R version 4.2.2
-
-
-results_tgf %>%
+library(ggplot2)
+results_tgf %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj))) + geom_point()
-
-
+
+
+
+
+
+
library(org.Hs.eg.db)
+
Loading required package: AnnotationDbi
@@ -1916,44 +1928,56 @@ Exercise: Volcano plot
select
+
ecm_anno <- AnnotationDbi::select(org.Hs.eg.db,
keys = "GO:0030198",
keytype = "GO",
columns=c("ENSEMBL","SYMBOL"))
+
'select()' returned 1:many mapping between keys and columns
+
ecm_anno
-
+
+
+
ecm_ids <- pull(ecm_anno, "ENSEMBL")
+
+
results_tgf %>%
mutate(ECM_Gene = ENSEMBL %in% ecm_ids) %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj),col=ECM_Gene)) + geom_point()
-
-
+
+
+
+
+
The default colouring does not distinguish the ECM genes as much as @@ -1962,14 +1986,18 @@
alpha
) can also be adjusted. Below is a suggestion.
+
results_tgf %>%
mutate(ECM_Gene = ENSEMBL %in% ecm_ids) %>%
ggplot(aes(x = log2FoldChange, y = -log10(padj),col=ECM_Gene,alpha=ECM_Gene)) + geom_point() + scale_color_manual(values=c("black","red")) + scale_alpha_manual(values=c(0.2,1))
-
-
+
+
+
+
+
The tricky part of making a heatmap is deciding which genes to include. The gene names have to be in ENSEMBL format, as the count matrix we will be plotting has ENSEMBL names in the rows.
+In the following example, we chose the genes with most extreme log2 +fold-change
+# The pheatmap library is needed to make the heatmap
@@ -1989,120 +2020,209 @@ Exercise: Heatmap
vsd <- vst(dds)
+
using 'avgTxLength' from assays(dds), correcting for library size
-
+
+
sampleInfo <- as.data.frame(colData(dds)[,c("condition","Treated")])
-# create a new table that includes gene variability
-
-res_with_var <- mutate(results_tgf, GeneVar = rowSds(assay(vsd))) %>%
- filter(!is.na(padj))
+## Order results by log2FoldChange and get the most extreme
-# arrange by the new column and extract ENSEMBL ID of the first 100 rows
-
-genes_to_plot <- arrange(res_with_var, desc(GeneVar)) %>%
- dplyr::slice(1:100) %>%
+top_genes_logFC <- results_tgf %>%
+ arrange(desc(abs(log2FoldChange))) %>%
+ slice(1:30) %>%
pull(ENSEMBL)
-
-
-
-
-
-
-
-
-pheatmap(assay(vsd)[genes_to_plot,],
- annotation_col = sampleInfo,
- scale="row")
-
-
-
-
-
-
-
-
-
-gene_labels <- arrange(res_with_var, desc(GeneVar)) %>%
- dplyr::slice(1:100) %>%
+
+top_genes_logFC_labels <- results_tgf %>%
+ arrange(desc(abs(log2FoldChange))) %>%
+ slice(1:30) %>%
pull(SYMBOL)
-pheatmap(assay(vsd)[genes_to_plot,],
- annotation_col = sampleInfo,
- labels_row = gene_labels,
- scale="row")
+## get the corresponding geen symbols
+
+pheatmap(assay(vsd)[top_genes_logFC,],
+ scale = "row",
+ labels_row = top_genes_logFC_labels)
-
-
+
+
+
+
+
Need to fix, as currently this doesn’t return -anything!
In order to use the enrichKEGG
function, we will have to
define a set of enriched genes in terms of their ENTREZID
(instead of ENSEMBL
)
sigGenesEntrez <- results_tgf %>%
filter(padj < 0.05, !is.na(ENTREZID)) %>% pull(ENTREZID)
+
Make sure you check the help for enrichKEGG
, as it uses
different argument names
library(clusterProfiler)
+
+
+
+
+Registered S3 method overwritten by 'data.table':
+ method from
+ print.data.table
+clusterProfiler v4.12.6 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
+
+Please cite:
+
+S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang,
+W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize
+multiomics data. Nature Protocols. 2024, doi:10.1038/s41596-024-01020-z
+
+Attaching package: ‘clusterProfiler’
+
+The following object is masked from ‘package:AnnotationDbi’:
+
+ select
+
+The following object is masked from ‘package:IRanges’:
+
+ slice
+
+The following object is masked from ‘package:S4Vectors’:
+
+ rename
+
+The following object is masked from ‘package:stats’:
+
+ filter
+
+
+
+enrich_kegg <- enrichKEGG(gene = sigGenesEntrez,
+ organism = "hsa",)
+
+
+
+Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
+Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
+
+
+
+as.data.frame(enrich_kegg)
+
+
+
+
The same plots can be created from the results. Here is the dot plot.
+dotplot(enrich_kegg)
-
-Error in (function (classes, fdef, mtable) :
- unable to find an inherited method for function ‘dotplot’ for signature ‘"NULL"’
+
+
+
+
+
and the upset plot.
+ + +enrichplot::upsetplot(enrich_kegg)
+
+
+
+
+
+
+
For the GSEA analysis using KEGG we can use the same set of ranked statistics, but make sure to name them according to ENTREZID
+ + +ranked_genes <- results_tgf %>%
+ arrange(desc(stat)) %>%
+ filter(!is.na(stat))
+
+geneList <- pull(ranked_genes, stat)
+names(geneList) <- pull(ranked_genes, ENTREZID)
+
+
Again, make sure to check the arguments for gseKEGG
.
gse_kegg <- gseKEGG(geneList = geneList,
+ organism = "hsa")
+
+
+
+using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
+
+preparing geneSet collections...
+GSEA analysis...
+Warning: There are ties in the preranked stats (11.21% of the list).
+The order of those tied genes will be arbitrary, which may produce unexpected results.Warning: There are duplicate gene names, fgsea may produce unexpected results.Warning: For some pathways, in reality P-values are less than 1e-10. You can set the `eps` argument to zero for better estimation.leading edge analysis...
+done...
+
+
+
+as.data.frame(gse_kegg)
+
+
+
+
+