Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

生信分析-R语言整理TCGA基因表达数据 #35

Open
zemise opened this issue Aug 4, 2023 · 0 comments
Open

生信分析-R语言整理TCGA基因表达数据 #35

zemise opened this issue Aug 4, 2023 · 0 comments

Comments

@zemise
Copy link
Owner

zemise commented Aug 4, 2023

# todo 总体目标:用TCGA数据做表达分析和生存分析

# 此节代码,完成R语言整理TCGA基因表达数据

setwd("/Volumes/SSD/Download/LUAD")
#install.packages("rjson")
library(rjson)

# 由于gdc_download文件夹下的***.TSV文件中包含了4种数据(unstranded、tpm_unstranded 、fpkm_unstranded、fpkm_uq_unstranded)
# ***.TSV文件只有行名/基因名,而没有列名/样本名,JSON文件中包含了样本名--TSV文件的文件名的对应关系。
# 根据TSV文件的文件名,即可在JSON文件中找到对应的样本名,再将样本名加入TSV文件即可。


json <- jsonlite::fromJSON("metadata.cart.2023-07-31.json")
# View(json)


# 取出json中的associated_entities列中的第一个元素,该元素为样本名
sample_id <- sapply(json$associated_entities, function(x){x[,1]})
# 得到的file_name和sample_id添加到对应文件名的TSV文件中,作为unstranded列的列名
file_sample <- data.frame(sample_id,file_name =json$file_name)

# 获取gdc_download文件夹下的所有TSV表达文件的 路径+文件名,值得注意的list.files函数需要绝对路径
# count_file <- list.files('gdc_download_20230731_095103.8678500', pattern = '*.tsv')
# 更准确的匹配
count_file <- list.files('/Volumes/SSD/Download/LUAD/gdc_download_20230731_095103.867850', pattern = '*gene_counts.tsv', recursive = TRUE)

# 在count_file中分割出文件名
count_file_name <- strsplit(count_file, split = '/')
count_file_name <- sapply(count_file_name, function(x){x[2]})

# 60660是tsv文件中的基因数,恒定是60660个基因
# 2023年8月4日实测,仍然为60661个基因
matrix = data.frame(matrix(nrow = 60660, ncol = 0))

for (i in 1:length(count_file_name)) {
  # 拼接完整路径,如果是Windows则改为\\,此处我用macOS,因此改为/
  path = paste0('gdc_download_20230731_095103.867850/', count_file[i])
  data <- read.delim(path, fill = TRUE, header = FALSE, row.names = 1)
  colnames(data) <- data[2,]
  data <-data[-c(1:6),]
  
  # 取出unstranded列(得到COUNT矩阵), 若想提取fpkm-unstranded则改为data[7], fpkm-up-unstranded改为data[8]
  data <- data[3] 
  colnames(data) <- file_sample$sample_id[which(file_sample$file_name == count_file_name[i])]
  matrix <- cbind(matrix, data)
}

# write.csv(matrix, '/Volumes/SSD/Download/LUAD/COUNT_matrix.csv',row.names = TRUE)
# 此处生成的文件比较大,最好先运行上述代码,等到控制台最后一行为出现单个>标志,再运行下方代码进行写入
write.csv(matrix, 'COUNT_matrix.csv',row.names = TRUE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant