2024_SISBID_Unsupervised_Lab.Rmd

---
title: "2024 SISBID Unsupervised Lab"
author: "Genevera I. Allen & Yufeng Liu"
output:
  html_document: default
  pdf_document: default
---

# Data Description

## gdat is Gene Expression Data,  n = 445 patients x p = 353 genes
- Only 353 genes with somatic mutations from COSMIC are retained

## Data is Level III TCGA BRCA RNA-Sequencing gene expression data that have already been pre-processed according to the following steps:
- Reads normalized by RPKM
- Corrected for overdispersion by a log-transformation (1 + data)
- Short gene name labels are given as the column names

## cdat is Clinical Data,  n = 445 patients x q = 6 clinical features
- Subtype - denotes 5 PAM50 subtypes including Basal-like, Luminal A, Luminal B, HER2-enriched, and Normal-like
- ER-Status - estrogen-receptor status
- PR-Status - progesterone-receptor status
- HER2-Status - human epidermal growth factor receptor 2 status
- Node - number of lymph nodes involved
- Metastasis - indicator for whether the cancer has metastasized

# Problems


## Problem 1 - Dimension reduction 

1a - Apply PCA, NMF, ICA and MDS, UMAP, and tSNE to this dataset. Compare and contrast the results using these methods.  
1b - Relate the dimension reduction results with the clinical data. Is any clinical information reflected in the lower dimensional spaces?

1c - Overall, which dimension reduction method do you recommend for this data set and why?


## Problem 2 - Clustering 

2a - Apply various clustering algorithms such as K-means (explore different K), hierarchical clustering (explore different linkages), NMF, and biclustering. Compare the clustering results using these methods.

2b - Relate the clustering results with the clinical data. Can the clustering algorithm recover some of the clinical information such as cancer subtypes?

2c - (Optional) Validate your cluster findings. 

2c - Overall, which clustering method(s) do you recommend for this data set and why?


## Problem 3 - Multiple comparisons 

3a - Identify important genes to differetiate ER postive and negative, PR postive and negative, HER2 postive and negative, metastasis status.

3b - Try different procedures to adjust for multiple comparisons.

3c - Examine the lists of genes identified using different methods for each clinical response.  Which method is best?  Why?


## Problem 5 - Graphical models

5a - Use graphical models to explore interactions among genes.  Are any of the well-connected genes related to patterns previously identified? 


## Problem 6 - Visulaization


6a - Visualize this data using multiple approaches.

6b - Prepare the "best" visual summary of this data.


## Problem 7 - Exploratory Data Analysis Summary


7a - What is the most interesting finding?

7b - Is this finding consistent and stable?

7c - Prepare a visual summary that best illustrates this interesting finding.


# R scripts to help out with the BRCA case study Lab
## Don't peek at this if you want to practice coding on your own!!

Load Data
```{r, echo = TRUE}
load("UnsupL_SISBID_2024.Rdata")
library(ggplot2)
library(kknn)
library(GGally)
library(umap)
library(Rtsne)
library(igraph)
library(huge)
```


Explore Data
```{r, echo = TRUE}
dim(gdat)
dim(cdat)

# clinical data
table(cdat$Subtype)
table(cdat$ER)
table(cdat$PR)
table(cdat$HER2)
table(cdat$Node)
table(cdat$Metastasis)
table(cdat$ER,cdat$PR)

```


## Cluster Heatmap - biclustering
```{r, echo = TRUE}
#cluster heatmap - biclustering
aa = grep("grey",colors())
bb = grep("green",colors())
cc = grep("red",colors())
gcol2 = colors()[c(aa[1:2],bb[1:25],cc[1:50])]

```

Without scaling
```{r, echo = TRUE}
heatmap(gdat,col=gcol2,hclustfun=function(x)hclust(x,method="ward.D"))
```

With scaling
```{r, echo = TRUE}

heatmap(scale(gdat),col=gcol2,hclustfun=function(x)hclust(x,method="ward.D"))

Cols=function(vec){cols=rainbow(length(unique(vec)))
                   return(cols[as.numeric(as.factor(vec))])}

heatmap(scale(gdat),col=gcol2,hclustfun=function(x)hclust(x,method="ward.D"),labRow=cdat$Subtype,RowSideColors=Cols(cdat$Subtype))
```

## Dimension Reduction

PCA
```{r, echo = TRUE}
Cols=function(vec){cols=rainbow(length(unique(vec)))
                   return(cols[as.numeric(as.factor(vec))])}

sv = svd(scale(gdat,center=TRUE,scale=FALSE))
V = sv$v
Z = gdat%*%V

K = 3
pclabs = c("PC1","PC2","PC3","PC4")
par(mfrow=c(1,K))
for(i in 1:K){
  j = i+1
  plot(Z[,i],Z[,j],pch=16,xlab=pclabs[i],ylab=pclabs[j],col=Cols(cdat$Subtype))
}
legend(-45,0,pch=16,col=rainbow(5),levels(cdat$Subtype))
```

Pairs Plot
```{r, echo = TRUE}
PC1<-as.matrix(Z[,1])
PC2<-as.matrix(Z[,2])
PC3<-as.matrix(Z[,3])
PC4<-as.matrix(Z[,4])
PC5<-as.matrix(Z[,5])

pc.df.cdat<-data.frame(Subtype = cdat$Subtype, PC1, PC2, PC3, PC4, PC5)
ggpairs(pc.df.cdat, mapping = aes(color = Subtype))
```

MDS
```{r, echo = TRUE}

Dmat = dist(gdat,method="maximum")
mdsres = cmdscale(Dmat,k=2)
plot(mdsres[,1],mdsres[,2],pch=16,col=Cols(cdat$Subtype), main = "Dimension Reduction MDS")
legend(-30,20,pch=16,col=rainbow(5),levels(cdat$Subtype))

```


ICA
```{r, echo = TRUE}

require("fastICA")

K = 4
icafit = fastICA(gdat,n.comp=K)

kk = 3
pclabs = c("ICA1","ICA2","ICA3","ICA4")
par(mfrow=c(1,kk))
for(i in 1:kk){
  j = i+1
  plot(icafit$A[i,],icafit$A[j,],pch=16,xlab=pclabs[i],ylab=pclabs[j],col=Cols(cdat$Subtype))
}
legend(-1,2.8,pch=16,col=rainbow(5),levels(cdat$Subtype))
```

UMAP
```{r, echo = TRUE}
gdat.umap = umap(gdat)
plot(gdat.umap$layout[,1], y =gdat.umap$layout[,2], type = "n", main = "UMAP", xlab = "UMAP1", ylab = "UMAP2")
text(gdat.umap$layout[,1], y =gdat.umap$layout[,2], type = "n", cdat$Subtype, col=Cols(cdat$Subtype), cex = .7 )

```


## Clustering

K-means
```{r, echo = TRUE}

K = 5
km = kmeans(gdat,centers=K,nstart=25)
table(km$cluster,cdat$Subtype)

```


Plot Kmeans with labels
```{r, echo = TRUE}
plot(Z[,1],Z[,2],col=km$cluster, main = "Plot Kmeans Clusters ", xlab = "PC1", ylab = "PC2")
text(Z[,1],Z[,2],cdat$Subtype,cex=.75,col=km$cluster)
cens = km$centers
points(cens%*%V[,1],cens%*%V[,2],col=1:K,pch=16,cex=3)
```


Hierarchical
```{r, echo = TRUE}
#which linakge is the best?
#which distance metric is the best?

Dmat = dist(gdat)
com.hc = hclust(Dmat,method="ward.D")


plot(com.hc,labels=cdat$Subtype,cex=.5)

res.com = cutree(com.hc,5)
table(res.com,cdat$Subtype)
```

Consensus Clustering with Hierarchical
```{r, echo = TRUE}
#Note that ConsensusClusterPlus not available for R version 4.0.2
#results = ConsensusClusterPlus(gdat,maxK=6,reps=500,pItem=0.8,pFeature=1,
#clusterAlg="hc",distance="pearson",plot="png")
```

Look at results for first 5 clusters
```{r, echo = TRUE}
 #heatmap(results[[2]][["consensusMatrix"]][1:5,1:5])
```


Spectral Clustering
```{r, echo = TRUE}
K = 5
s_gdat = specClust(gdat, centers=K, nn = 7, method = "symmetric", gmax=NULL)
```

Visualize
```{r, echo  = TRUE}
plot(Z[,1],Z[,2],col=s_gdat$cluster, main = "Visualize Spectral Clusters", xlab = "PC1", ylab = "PC2")
text(Z[,1],Z[,2],cdat$Subtype,cex=.75,col=s_gdat$cluster)
```


## Genes significantly associated with ER or PR Status, etc
```{r, echo = TRUE}

x = gdat[cdat$ER=="Positive" | cdat$ER=="Negative",]
y.er = cdat$ER[cdat$ER=="Positive" | cdat$ER=="Negative"]
y.label = rep(1, length(y.er))
y.label[y.er == "Positive"]=2

ps = NULL
for(i in 1:ncol(gdat)) ps = c(ps,
 t.test(x[y.label==1,i],x[y.label==2,i])$p.value)
fdrs.bh = p.adjust(ps, method="BH")

cat("Number of Tests significant with alpha=0.1 using Bonferroni correction:",
sum(ps<0.1/length(y.label)), fill=TRUE)

cat("Number of Tests with FDR below 0.1:",
sum(fdrs.bh<0.1), fill=TRUE)

plot(sort(ps,decreasing=FALSE),ylab="P-Values")
#BH procedure
abline(a=0, b=0.1/length(y.label),col="red")
#Bonferroni
abline(a=0.1/length(y.label), b=0,col="blue")
```


## Graphical models - how are genes related?

```{r}
# use huge package
neth = huge(gdat,method="mb")
plot(neth)
```

```{r}
## stability selection with huge
net.s <- huge.select(neth, criterion="stars")
net.s
plot(net.s)
```


```{r}
#larger lambda
mat <- neth$path[[2]]
neti <- as.undirected(graph_from_adjacency_matrix(mat))
plot(neti,vertex.label=colnames(gdat),vertex.size=2,vertex.label.cex=1.2,vertex.label.dist=1,layout=layout_with_kk)
```


```{r}
#smaller lambda
mat = neth$path[[6]]
neti = as.undirected(graph_from_adjacency_matrix(mat))
plot(neti,vertex.label=colnames(gdat),vertex.size=2,vertex.label.cex=1.2,vertex.label.dist=1,layout=layout_with_kk)
```