Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Downloading RNAseq Data with gdcRNADownload() #20

Open
Josuerinho opened this issue Apr 8, 2022 · 5 comments
Open

Error Downloading RNAseq Data with gdcRNADownload() #20

Josuerinho opened this issue Apr 8, 2022 · 5 comments

Comments

@Josuerinho
Copy link

Hi all!!

I've been trying to use the function gdcRNADownload() to download RNAseq data from TCGA but no matter what RNAseq type I try, I always get the same error:

Successfully downloaded: 0
Warning message:
In read.table(paste(url, "&return_type=manifest", sep = ""), header = TRUE, :
incomplete final line found by readTableHeader on 'https://api.gdc.cancer.gov/files?filters=%7B%22op%22:%22and%22,%22content%22:[%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.project.project_id%22,%22value%22:[%22TCGA-CHOL%22]%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.data_category%22,%22value%22:%22Transcriptome%20Profiling%22%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.data_type%22,%22value%22:%22Gene%20Expression%20Quantification%22%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.analysis.workflow_type%22,%22value%22:%22HTSeq%20-%20Counts%22%7D%7D]%7D&pretty=true&format=JSON&size=10000&expand=analysis,analysis.input_files,associated_entities,cases,cases.diagnoses,cases.diagnoses.treatments,cases.demographic,cases.project,cases.samples,cases.samples.portions,cases.samples.portions.analytes,cases.samples.portions.analytes.aliquots,cases.samples.portions.slides&return_type=manifest'

It only happens with RNAseq type of data. I can download miRNAs data without problems. Initially I was working on a Macbook air with M1 chip:

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] stringr_1.4.0 readxl_1.4.0 tibble_3.1.6 oligo_1.56.0
[5] Biostrings_2.60.2 GenomeInfoDb_1.28.4 XVector_0.32.0 IRanges_2.26.0
[9] S4Vectors_0.30.2 oligoClasses_1.54.0 GEOquery_2.60.0 Biobase_2.52.0
[13] BiocGenerics_0.38.0 edgeR_3.34.1 limma_3.48.3 GDCRNATools_1.13.1

loaded via a namespace (and not attached):
[1] utf8_1.2.2 tidyselect_1.1.2 RSQLite_2.2.12
[4] AnnotationDbi_1.54.1 htmlwidgets_1.5.4 grid_4.1.1
[7] BiocParallel_1.26.2 scatterpie_0.1.7 munsell_0.5.0
[10] codetools_0.2-18 preprocessCore_1.54.0 DT_0.22
[13] colorspace_2.0-3 GOSemSim_2.18.1 filelock_1.0.2
[16] knitr_1.38 rstudioapi_0.13 ggsignif_0.6.3
[19] DOSE_3.18.3 pathview_1.32.0 MatrixGenerics_1.4.3
[22] KEGGgraph_1.52.0 GenomeInfoDbData_1.2.6 KMsurv_0.1-5
[25] polyclip_1.10-0 bit64_4.0.5 farver_2.1.0
[28] downloader_0.4 vctrs_0.4.0 treeio_1.16.2
[31] generics_0.1.2 xfun_0.30 BiocFileCache_2.0.0
[34] affxparser_1.64.1 R6_2.5.1 graphlayouts_0.8.0
[37] locfit_1.5-9.5 bitops_1.0-7 cachem_1.0.6
[40] fgsea_1.18.0 gridGraphics_0.5-1 DelayedArray_0.18.0
[43] assertthat_0.2.1 promises_1.2.0.1 scales_1.1.1
[46] ggraph_2.0.5 enrichplot_1.12.3 gtable_0.3.0
[49] tidygraph_1.2.1 rlang_1.0.2 genefilter_1.74.1
[52] splines_4.1.1 rstatix_0.7.0 lazyeval_0.2.2
[55] broom_0.7.12 BiocManager_1.30.16 reshape2_1.4.4
[58] abind_1.4-5 backports_1.4.1 httpuv_1.6.5
[61] qvalue_2.24.0 clusterProfiler_4.0.5 tools_4.1.1
[64] ggplotify_0.1.0 ggplot2_3.3.5 affyio_1.62.0
[67] ellipsis_0.3.2 gplots_3.1.1 ff_4.0.5
[70] RColorBrewer_1.1-3 Rcpp_1.0.8.3 plyr_1.8.7
[73] progress_1.2.2 zlibbioc_1.38.0 purrr_0.3.4
[76] RCurl_1.98-1.6 prettyunits_1.1.1 ggpubr_0.4.0
[79] viridis_0.6.2 cowplot_1.1.1 zoo_1.8-9
[82] SummarizedExperiment_1.22.0 ggrepel_0.9.1 magrittr_2.0.3
[85] data.table_1.14.2 DO.db_2.9 survminer_0.4.9
[88] matrixStats_0.61.0 hms_1.1.1 patchwork_1.1.1
[91] mime_0.12 xtable_1.8-4 XML_3.99-0.9
[94] gridExtra_2.3 compiler_4.1.1 biomaRt_2.48.3
[97] KernSmooth_2.23-20 crayon_1.5.1 shadowtext_0.1.1
[100] htmltools_0.5.2 ggfun_0.0.6 later_1.3.0
[103] tzdb_0.3.0 tidyr_1.2.0 geneplotter_1.70.0
[106] aplot_0.1.3 DBI_1.1.2 tweenr_1.0.2
[109] dbplyr_2.1.1 MASS_7.3-56 rappdirs_0.3.3
[112] Matrix_1.4-1 car_3.0-12 readr_2.1.2
[115] cli_3.2.0 igraph_1.3.0 km.ci_0.5-2
[118] GenomicRanges_1.44.0 pkgconfig_2.0.3 xml2_1.3.3
[121] foreach_1.5.2 ggtree_3.0.4 annotate_1.70.0
[124] yulab.utils_0.0.4 digest_0.6.29 graph_1.70.0
[127] cellranger_1.1.0 fastmatch_1.1-3 survMisc_0.5.5
[130] tidytree_0.3.9 curl_4.3.2 shiny_1.7.1
[133] gtools_3.9.2 rjson_0.2.21 lifecycle_1.0.1
[136] nlme_3.1-157 GenomicDataCommons_1.16.0 jsonlite_1.8.0
[139] carData_3.0-5 viridisLite_0.4.0 fansi_1.0.3
[142] pillar_1.7.0 lattice_0.20-45 KEGGREST_1.32.0
[145] fastmap_1.1.0 httr_1.4.2 survival_3.3-1
[148] GO.db_3.13.0 glue_1.6.2 png_0.1-7
[151] iterators_1.0.14 bit_4.0.4 Rgraphviz_2.36.0
[154] ggforce_0.3.3 stringi_1.7.6 blob_1.2.2
[157] DESeq2_1.32.0 org.Hs.eg.db_3.13.0 caTools_1.18.2
[160] memoise_2.0.1 dplyr_1.0.8 ape_5.6-2

But I also have the same issue when I try to execute the same function in the cluster:

sessionInfo()

R version 4.1.3 (2022-03-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Springdale Linux 7.9 (Verona)

Matrix products: default
BLAS/LAPACK: /ifs/data/fg2532_lab/jc5737/Conda_env/lib/libopenblasp-r0.3.18.so

locale:
[1] C

attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] stringr_1.4.0 readxl_1.4.0 tibble_3.1.6
[4] oligo_1.58.0 Biostrings_2.62.0 GenomeInfoDb_1.30.1
[7] XVector_0.34.0 IRanges_2.28.0 S4Vectors_0.32.4
[10] oligoClasses_1.56.0 GEOquery_2.62.2 Biobase_2.54.0
[13] BiocGenerics_0.40.0 edgeR_3.36.0 limma_3.50.1
[16] GDCRNATools_1.14.0

loaded via a namespace (and not attached):
[1] utf8_1.2.2 tidyselect_1.1.2
[3] RSQLite_2.2.12 AnnotationDbi_1.56.2
[5] htmlwidgets_1.5.4 grid_4.1.3
[7] BiocParallel_1.28.3 scatterpie_0.1.7
[9] munsell_0.5.0 preprocessCore_1.56.0
[11] codetools_0.2-18 DT_0.22
[13] colorspace_2.0-3 GOSemSim_2.20.0
[15] filelock_1.0.2 knitr_1.38
[17] ggsignif_0.6.3 DOSE_3.20.1
[19] pathview_1.34.0 MatrixGenerics_1.6.0
[21] KEGGgraph_1.54.0 GenomeInfoDbData_1.2.7
[23] KMsurv_0.1-5 polyclip_1.10-0
[25] bit64_4.0.5 farver_2.1.0
[27] downloader_0.4 vctrs_0.4.0
[29] treeio_1.18.1 generics_0.1.2
[31] xfun_0.30 BiocFileCache_2.2.1
[33] affxparser_1.66.0 R6_2.5.1
[35] graphlayouts_0.8.0 locfit_1.5-9.5
[37] bitops_1.0-7 cachem_1.0.6
[39] fgsea_1.20.0 gridGraphics_0.5-1
[41] DelayedArray_0.20.0 assertthat_0.2.1
[43] promises_1.2.0.1 scales_1.1.1
[45] ggraph_2.0.5 enrichplot_1.14.2
[47] gtable_0.3.0 tidygraph_1.2.1
[49] rlang_1.0.2 genefilter_1.76.0
[51] splines_4.1.3 rstatix_0.7.0
[53] lazyeval_0.2.2 broom_0.7.12
[55] BiocManager_1.30.16 reshape2_1.4.4
[57] abind_1.4-5 backports_1.4.1
[59] httpuv_1.6.5 qvalue_2.26.0
[61] clusterProfiler_4.2.2 tools_4.1.3
[63] ggplotify_0.1.0 ggplot2_3.3.5
[65] affyio_1.64.0 ellipsis_0.3.2
[67] gplots_3.1.1 ff_4.0.5
[69] RColorBrewer_1.1-3 Rcpp_1.0.8.3
[71] plyr_1.8.7 progress_1.2.2
[73] zlibbioc_1.40.0 purrr_0.3.4
[75] RCurl_1.98-1.6 prettyunits_1.1.1
[77] ggpubr_0.4.0 viridis_0.6.2
[79] zoo_1.8-9 SummarizedExperiment_1.24.0
[81] ggrepel_0.9.1 magrittr_2.0.3
[83] data.table_1.14.2 DO.db_2.9
[85] survminer_0.4.9 matrixStats_0.61.0
[87] hms_1.1.1 patchwork_1.1.1
[89] mime_0.12 xtable_1.8-4
[91] XML_3.99-0.9 gridExtra_2.3
[93] compiler_4.1.3 biomaRt_2.50.3
[95] KernSmooth_2.23-20 crayon_1.5.1
[97] shadowtext_0.1.1 htmltools_0.5.2
[99] ggfun_0.0.6 later_1.3.0
[101] tzdb_0.3.0 tidyr_1.2.0
[103] geneplotter_1.72.0 aplot_0.1.3
[105] DBI_1.1.2 tweenr_1.0.2
[107] dbplyr_2.1.1 MASS_7.3-56
[109] rappdirs_0.3.3 Matrix_1.4-1
[111] car_3.0-12 readr_2.1.2
[113] cli_3.2.0 parallel_4.1.3
[115] igraph_1.3.0 GenomicRanges_1.46.1
[117] pkgconfig_2.0.3 km.ci_0.5-6
[119] xml2_1.3.3 foreach_1.5.2
[121] ggtree_3.2.1 annotate_1.72.0
[123] yulab.utils_0.0.4 digest_0.6.29
[125] graph_1.72.0 cellranger_1.1.0
[127] fastmatch_1.1-3 survMisc_0.5.6
[129] tidytree_0.3.9 curl_4.3.2
[131] shiny_1.7.1 gtools_3.9.2
[133] rjson_0.2.21 lifecycle_1.0.1
[135] nlme_3.1-157 GenomicDataCommons_1.18.0
[137] jsonlite_1.8.0 carData_3.0-5
[139] viridisLite_0.4.0 fansi_1.0.3
[141] pillar_1.7.0 lattice_0.20-45
[143] KEGGREST_1.34.0 fastmap_1.1.0
[145] httr_1.4.2 survival_3.3-1
[147] GO.db_3.14.0 glue_1.6.2
[149] png_0.1-7 iterators_1.0.14
[151] bit_4.0.4 Rgraphviz_2.38.0
[153] ggforce_0.3.3 stringi_1.7.6
[155] blob_1.2.2 DESeq2_1.34.0
[157] org.Hs.eg.db_3.14.0 caTools_1.18.2
[159] memoise_2.0.1 dplyr_1.0.8
[161] ape_5.6-2

So I don't know how to solve the problem because when I try to troubleshoot the gdcRNADownload() function and follow line by line the code, it says that one of the inner functions (gdcGetURL()) it's not found. So I don't know where the error comes from because I can't access the URL containing the RNAseq data. It might even be a format problem with the downloaded data. I know this issue was reported before but given there was no follow-through, I thought a new threat might bring a bit more attention. Sorry guys and thanks a lot for your help!

Josu

@pamonlan
Copy link

I got the same issue, looks like the link to obtain the manifest from the gdc api has changed and now we get an empty table. They have to change the url query.

@pranavkatariain
Copy link

There is some issue with the HTSeq-Counts data on the GDC portal, I guess it is not available with the new update. So we need to change the workflow.type to "STAR - COUNTS".

@Josuerinho
Copy link
Author

Josuerinho commented Apr 21, 2022

Hi all! I've been able to finally got access to the code of some of the used functions. So as @pranavkataria978 mentioned the issue was when trying the download the "RNA-seq" data type. The function gdcGetURL(), looks for the workflow type "HTSeq - Counts" that no longer exists. The workflow that might look close to this one now (after also checking the database) is "STAR - Counts" as he said. So if you create your own function gdcGetURL() with this small change, it made sense to me that it should work. But the only problem is that then, inside this function there is a bunch of other functions being called that for some reason now they are outside the original package function (no idea why this happens...) and they aren't found anymore. So in the end, I had to rename and save in the current environment a few more functions to make it all work again. After this step, now all these functions can be found and called. So here it goes as I have it right now to make RNAseq download work:

gdcGetURL_2
function (project.id, data.type)
{
urlAPI <- "https://api.gdc.cancer.gov/files?"
if (data.type == "RNAseq") {
data.category <- "Transcriptome Profiling"
data.type <- "Gene Expression Quantification"
workflow.type <- "STAR - Counts" ## Before we had "HTSeq-Counts"
}
else if (data.type == "miRNAs") {
data.category <- "Transcriptome Profiling"
data.type <- "Isoform Expression Quantification"
workflow.type <- "BCGSC miRNA Profiling"
}
else if (data.type == "Clinical") {
data.category <- "Clinical"
data.type <- "Clinical Supplement"
workflow.type <- NA
}
else if (data.type == "pre-miRNAs") {
data.category <- "Transcriptome Profiling"
data.type <- "miRNA Expression Quantification"
workflow.type <- "BCGSC miRNA Profiling"
}
project <- paste("{"op":"in","content":{"field":"cases.",
"project.project_id","value":["", project.id, ""]}}",
sep = "")
dataCategory <- paste("{"op":"in","content":{"field":"files.",
"data_category","value":"", data.category, ""}}",
sep = "")
dataType <- paste("{"op":"in","content":{"field":"files.data_type",",
""value":"", data.type, ""}}", sep = "")
workflowType <- paste("{"op":"in","content":{"field":"files.",
"analysis.workflow_type","value":"", workflow.type,
""}}", sep = "")
if (is.na(workflow.type)) {
dataFormat <- paste("{"op":"in","content":{"field":"files.",
"data_format","value":"", "BCR XML", ""}}",
sep = "")
content <- paste(project, dataCategory, dataType, dataFormat,
sep = ",")
}
else {
content <- paste(project, dataCategory, dataType, workflowType,
sep = ",")
}
filters <- paste("filters=", URLencode(paste("{"op":"and","content":[",
content, "]}", sep = "")), sep = "")
expand <- paste("analysis", "analysis.input_files", "associated_entities",
"cases", "cases.diagnoses", "cases.diagnoses.treatments",
"cases.demographic", "cases.project", "cases.samples",
"cases.samples.portions", "cases.samples.portions.analytes",
"cases.samples.portions.analytes.aliquots", "cases.samples.portions.slides",
sep = ",")
expand <- paste("expand=", expand, sep = "")
payload <- paste(filters, "pretty=true", "format=JSON", "size=10000",
expand, sep = "&")
url <- paste(urlAPI, payload, sep = "")
return(url)
}

#############
#############

And for the other funtions just renaming and saving them in my local environment for the problems I mentioned before:

downloadClientFun_2 <- function (os) {
if (os == "Linux") {
adress <- paste("https://gdc.cancer.gov/files/public/file/",
"gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip", sep = "")
download.file(adress, destfile = "./gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip")
unzip("./gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip")
}
else if (os == "Windows") {
adress <- paste("https://gdc.cancer.gov/files/public/file/",
"gdc-client_v1.6.0_Windows_x64-py3.7_0.zip", sep = "")
download.file(adress, destfile = "./gdc-client_v1.6.0_Windows_x64-py3.7_0.zip")
unzip("./gdc-client_v1.6.0_Windows_x64-py3.7_0.zip")
}
else if (os == "Darwin") {
adress <- paste("https://gdc.cancer.gov/files/public/file/",
"gdc-client_v1.6.0_OSX_x64_1.zip", sep = "")
download.file(adress, destfile = "./gdc-client_v1.6.0_OSX_x64_1.zip")
unzip("./gdc-client_v1.6.0_OSX_x64_1.zip")
}
}

#############
#############

file.move_2 <- function (files, directory)
{
file.copy(from = files, to = directory, recursive = TRUE)
unlink(files, recursive = TRUE)
}

#############
#############

manifestDownloadFun_2 <- function (manifest = manifest, directory)
{
if (!file.exists("gdc-client") & !file.exists("gdc-client.exe")) {
downloadClientFun_2(Sys.info()[1])
}
Sys.chmod("gdc-client")
manifestDa <- read.table(manifest, sep = "\t", header = TRUE,
stringsAsFactors = FALSE)
ex <- manifestDa$filename %in% dir(paste(directory, dir(directory),
sep = "/"))
nonex <- !ex
numFiles <- sum(ex)
if (numFiles > 0) {
message(paste("Already exists", numFiles, "files !",
sep = " "))
if (sum(nonex) > 0) {
message(paste("Download the other", sum(nonex), "files !",
sep = " "))
manifestDa <- manifestDa[nonex, ]
manifest <- paste(manifestDa$id, collapse = " ")
system(paste("./gdc-client download ", manifest,
sep = ""))
}
else {
return(invisible())
}
}
else {
system(paste("./gdc-client download -m ", manifest, sep = ""))
}
files <- manifestDa$id
if (directory == "Data") {
if (!dir.exists("Data")) {
dir.create("Data")
}
}
else {
if (!dir.exists(directory)) {
dir.create(directory, recursive = TRUE)
}
}
file.move_2(files, directory)
}

#############
#############

gdcRNADownload_2 <- function (manifest = NULL, project.id, data.type, directory = "Data",
write.manifest = FALSE, method = "gdc-client")
{
if (!is.null(manifest)) {
manifestDownloadFun_2(manifest = manifest, directory = directory)
}
else {
url <- gdcGetURL_2(project.id = project.id, data.type = data.type)
manifest <- read.table(paste(url, "&return_type=manifest",
sep = ""), header = TRUE, stringsAsFactors = FALSE)
systime <- gsub(" ", "T", Sys.time())
systime <- gsub(":", "-", systime)
manifile <- paste(project.id, data.type, "gdc_manifest",
systime, "txt", sep = ".")
write.table(manifest, file = manifile, row.names = FALSE,
sep = "\t", quote = FALSE)
if (method == "GenomicDataCommons") {
ex <- manifest$filename %in% dir(directory)
nonex <- !ex
numFiles <- sum(ex)
if (numFiles > 0) {
message(paste("Already exists", numFiles, "files !",
sep = " "))
if (sum(nonex) > 0) {
message(paste("Download the other", sum(nonex),
"files !", sep = " "))
manifest <- manifest[nonex, ]
fnames = lapply(manifest$id, gdcdata, destination_dir = directory,
overwrite = TRUE, progress = TRUE)
}
else {
return(invisible())
}
}
else {
fnames = lapply(manifest$id, gdcdata, destination_dir = directory,
overwrite = TRUE, progress = TRUE)
}
}
else if (method == "gdc-client") {
manifestDownloadFun_2(manifest = manifile, directory = directory)
}
if (write.manifest == FALSE) {
invisible(file.remove(manifile))
}
}
}

#############
#############

I believe I haven't missed any of them. Now it should all work nicely. For example:

project <- 'TCGA-PRAD'
gdcRNADownload_2(project.id = project,
data.type = 'RNAseq',
write.manifest = FALSE,
method = 'gdc-client',
directory = "Your/Own/directory")

Let me know if I may have missed sth!

@benchsar
Copy link

Hello @Josuerinho

I want to test your code, but i have this error :

Error in paste(filters, "pretty=true", "format=JSON", "size=10000", expand, : object 'filters' not found > url <- paste(urlAPI, payload, sep = "") Error in paste(urlAPI, payload, sep = "") : object 'urlAPI' not found > return(url) Error: no function to return from, jumping to top level > } Error: unexpected '}' in "}"

 

| >

@Josuerinho
Copy link
Author

Hi @benchsar! Sorry for the late reply. That code I posted was just a little workaround to original functions to get them to work but the problem has been solved and the original functions work as expected again. Try it and let me know if that it's not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants