-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRNAseq_prj1.r
117 lines (101 loc) · 4.37 KB
/
RNAseq_prj1.r
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
title: "Homework1"
author: "XiaoyanWen"
date: "6/8/2021"
output:
word_document: default
html_document: default
pdf_document: default
---
Homework 1 DUE: 06/18/21
## Part 2 (10 pts)
# 1. Create a matrix (call it transcriptome) with the values below. The experiments are column names and genes are the row names. (3pts)
# Control Nitrogen Phosphate Potassium
#GeneA 89 78 77 56
#GeneB 90 99 85 97
#GeneC 78 94 99 87
#GeneD 81 83 80 79
#GeneE 62 51 99 88
```{r}
# create matrix
transcriptome <- matrix(c(89,90,78,81,62,78,99,94,83,51,77,85,99,80,99,56,97,87,79,88), nrow = 5)
# add rownames, colnames
rownames(transcriptome) <- paste(rep("Gene", times=5), c("A", "B", "C", "D", "E"), sep = "")
colnames(transcriptome) <- c("Control", "Nitrogen", "Phosphate", "Potassium")
# check result
transcriptome
```
# 2. Use R code to calculate the average expression for each gene across all experiments. Call the vector (call it expression_average). ( The vector should contain 5 values – one for each gene) (3pts)
```{r}
# compute row means using apply()
expression_avarage <- apply(transcriptome, 1, mean)
# check result
expression_avarage
```
# 3. Sort the matrix so that the gene with the highest average expression value is on top and save it into a new data frame (call it “sorted_genes”) (4pts)
```{r}
# using order() function to find out sorted index with the highest average on top
# and then sort the transcriptome rows using the order() output.
sorted_genes <- transcriptome[order(expression_avarage, decreasing = TRUE),]
# check result
sorted_genes
```
## Part 3- (20 pts) Fold change calculation:
#We will use the data file “expvalues.txt”. Remember, the first line defines the different experiments and after that every row is a different gene and its expression (RNA transcript abundance) values. The first column is the gene name, followed by the different experiments. First three experiments are replicates of the “control” experiments and the last three are replicates for the “treatments”.
# a. Load the file expvalues.txt into R. (2pts)
```{r}
# import data
expvalues <- read.csv("C:/Users/wen_x/Downloads/trantomicsData/expvalues.txt", sep="")
# check dimension
dim(expvalues)
```
# b. The first three columns are “control” and last three columns are “treatment” group. Use a loop or apply functions to calculate the mean of the control samples and the mean of the treatment samples for each gene. (5pts)
```{r}
# create factor for columns
expgroup <- factor(rep(c("control", "treatment"), each=3))
# apply() function to calculate group means for each row and transform the result to data frame
groupmean <- data.frame(t(apply(expvalues, 1, tapply, expgroup, mean)))
# check result wit dimension and print the top 6 rows
dim(groupmean)
head(groupmean)
```
# c. Calculate the ratio of average treatment/ average control for each gene. (2pts)
```{r}
# calculate ratio and add as a column
groupmean$trtc_ratio <- groupmean$treatment/groupmean$control
# check result by printing top 6 rows
head(groupmean)
```
# d. Take the log2 of the ratio. You have just calculated log fold change. (2pts)
```{r}
# calculate log2(ratio) and add as a column
groupmean$logfold <- log2(groupmean$trtc_ratio)
# check result
head(groupmean)
```
# e. How many genes have a log2 fold change > 1 OR < -1 ? (2pts)
```{r}
# if $logfold > 1 or < -1, assign 1; else assign 0.
# sum the number so we know how many genes meet the conditions
sum(ifelse((groupmean$logfold > 1 | groupmean$logfold < -1), 1, 0))
```
# f. Save the names of the genes that have a log2 fold change > 1 into a file called “Induced_genes.txt” (2pts)
```{r}
# use which() to find out the rows that meet the condition
# extract the subset from expvalues and assign to induced_g
induced_g <- expvalues[which(groupmean$logfold > 1),]
# check subset result
head(induced_g)
dim(induced_g)
# write rownames (gene names) to file
write.csv(rownames(induced_g), file = "Induced_genes.txt", quote = FALSE, row.names = FALSE)
```
# g. Using the same set of induced genes in the previous question, create a boxplot to show the distribution of values for each induced genes in each experiment. The x-axis should have all six experiments, and the y-axis is the expression level. Save the boxplot as a pdf file called “boxplot.pdf” (5pts)
```{r}
# open a pdf file using pdf() function
pdf(file = "boxplot.pdf")
# draw boxplot
boxplot(induced_g)
# close pdf()
dev.off()
```