06-3-ex-sol-ADHD.Rmd

---
output:
  html_document: default
  pdf_document: default
---
 * **Importing and processing data**

```{r, message=FALSE}
# Defining file paths
biom_file_path <- "data/Aggregated_humanization2.biom"
sample_meta_file_path <- "data/Mapping_file_ADHD_aggregated.csv"
tree_file_path <- "data/Data_humanization_phylo_aggregation.tre"

library(mia)

# Imports the data
se <- loadFromBiom(biom_file_path)


names(rowData(se)) <- c("Kingdom", "Phylum", "Class", "Order", 
                        "Family", "Genus")

# Goes through the whole DataFrame. Removes '.*[kpcofg]__' from strings, where [kpcofg] 
# is any character from listed ones, and .* any character.
rowdata_modified <- BiocParallel::bplapply(rowData(se), 
                                           FUN = stringr::str_remove, 
                                           pattern = '.*[kpcofg]__')

# Genus level has additional '\"', so let's delete that also
rowdata_modified <- BiocParallel::bplapply(rowdata_modified, 
                                           FUN = stringr::str_remove, 
                                           pattern = '\"')

# rowdata_modified is a list, so it is converted back to DataFrame format. 
rowdata_modified <- DataFrame(rowdata_modified)

# And then assigned back to the SE object
rowData(se) <- rowdata_modified


# We use this to check what type of data it is
# read.table(sample_meta_file_path)

# It seems like a comma separated file and it does not include headers
# Let us read it and then convert from data.frame to DataFrame
# (required for our purposes)
sample_meta <- DataFrame(read.table(sample_meta_file_path, sep = ",", header = FALSE))

# Add sample names to rownames
rownames(sample_meta) <- sample_meta[,1]

# Delete column that included sample names
sample_meta[,1] <- NULL

# We can add headers
colnames(sample_meta) <- c("patient_status", "cohort", "patient_status_vs_cohort", "sample_name")

# Then it can be added to colData
colData(se) <- sample_meta

# Convert to tse format
tse <- as(se, "TreeSummarizedExperiment")

# Reads the tree file
tree <- ape::read.tree(tree_file_path)

# Add tree to rowTree
rowTree(tse) <- tree
```

 * **Abundance table** Retrieve the taxonomic abundance table from the
   example data set (TSE object).
   
```{r}
# We show only part of it
assays(tse)$counts[1:6,1:6]
```
 * How many different samples and genus-level groups this phyloseq
   object has?

```{r}
dim(rowData(tse))
```

 * What is the maximum abundance of Akkermansia in this data set?

```{r}
# Agglomerating to Genus level
tse_genus <- agglomerateByRank(tse,rank="Genus")
# Retrieving the count for Akkermansia
Akkermansia_abund <- assays(tse_genus)$count["Genus:Akkermansia",]
max(Akkermansia_abund)
```
 * Draw a histogram of library sizes (total number of reads per
   sample).

```{r}
library(scater)
tse <- addPerCellQC(tse) # adding a new column "sum" to colData, of total counts/sample.
hist(colData(tse)$sum, xlab = "Total number of reads per sample", main = "Histogram of library sizes")

```
 * **Taxonomy table** Retrieve the taxonomy table and print out the
   first few lines of it with the R command head(). Investigate how
   many different phylum-level groups this phyloseq object has?

```{r}
head(rowData(tse))
taxonomyRanks(tse) # The taxonomic ranks available
unique(rowData(tse)["Phylum"]) # phylum-level groups
length(unique(rowData(tse)["Phylum"])[,1]) # number of phylum-level groups
```

 * **Sample metadata** Retrieve sample metadata. How many patient
     groups this data set has? Draw a histogram of sample
     diversities.
     
```{r}
colData(tse) # samples metadata
unique(colData(tse)$patient_status) # patient groups
# Example of sample diversity using Shannon index
tse <- mia::estimateDiversity(tse, 
                             abund_values = "counts",
                             index = "shannon", 
                             name = "shannon")
hist(colData(tse)$shannon, xlab = "Shannon index", main = "Histogram of sample diversity")
```

 * **Subsetting** Pick a subset of the data object including only
     ADHD individuals from Cohort 1. How many there are?

```{r}
sub_cohort_1 <- tse[, colData(tse)$cohort=="Cohort_1"]
sub_cohort_1_ADHD <- sub_cohort_1[, colData(sub_cohort_1)$patient_status=="ADHD"]
colData(sub_cohort_1_ADHD)
```

 * **Transformations** The data contains read counts. We can convert
  these into relative abundances and other formats. Compare abundance
  of a given taxonomic group using the example data before and after
  the compositionality transformation (with a cross-plot, for
  instance). You can also compare the results to CLR-transformed data
  (see e.g. [Gloor et
  al. 2017](https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full))
  
```{r}
tse <- transformCounts(tse, method = "relabundance")
tse <- transformCounts(tse, method = "clr", abund_values = "counts",pseudocount = 1)
# Lets compare with taxa: A29
taxa <- "A29"
df <- as.data.frame(list(
                          counts=assays(tse)$counts[,taxa],
                          relabundance=assays(tse)$relabundance[,taxa],
                          clr=assays(tse)$clr[,taxa])
                        )
ggplot(df, aes(x=counts,y=relabundance))+
  geom_point()+
  geom_smooth()

ggplot(df, aes(x=counts,y=clr))+
  geom_point()+
  geom_smooth()

ggplot(df, aes(x=relabundance,y=clr))+
  geom_point()+
  geom_smooth()
```
 * **Visual exploration** Visualize the population distribution of
   abundances for certain taxonomic groups. Do the same for
   CLR-transformed abundances.
   
``` {r}
# Same taxa used as earlier
ggplot(df, aes(x=counts,colour="blue")) +
    geom_density(alpha=.2)+
  theme(legend.position = "none")+
  labs(title =paste("Distribution of counts for",taxa, collapse = ": "))
ggplot(df, aes(x=clr,colour="red")) + 
    geom_density(alpha=.2)+
  theme(legend.position = "none")+
  labs(title =paste("Distribution of clr for",taxa, collapse = ": "))
```