-
-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Casts to avoid integer overflow in matrix row/columns. #1281
Conversation
Would it be helpful to test this change in the context of the problem that led to the associated bug report? https://support.bioconductor.org/p/9154744/#9154759 We could create a Guix environment that replaces Rcpp with a variant that has this patch set applied, and run the same problematic code. |
Sure. You don't have to use Guix, |
Okay, we'll get back to you. We're already using Guix, so for us it's just a matter of doing this:
|
Whatever floats your boat, and thanks for additional testing. I am about half-way through the reverse dependencies check I am running for |
It seems like the problem persists. Running the same script as mentioned on the Bioconductor forum produces a segfault even with the patches applied. Running
in a new directory, On this Guix
With this patch fileFor compatability, the changelog patch was removed
And this `test.R` filelibrary("edgeR")
options("timeout" = 60 * 60)
download_raw <- F
if (download_raw) {
library("Matrix")
files_needed <- c(
cells = "count_matrix_barcodes.tsv",
genes = "count_matrix_genes.tsv",
matrix = "count_matrix_sparse.mtx"
)
dl_url <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE176nnn/GSE176078/suppl/GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz"
dl_dest <- "wu_raw.tar"
unpacked_dir_name <- "Wu_etal_2021_BRCA_scRNASeq/"
needed_paths <- paste0(unpacked_dir_name, "/", files_needed)
names(needed_paths) <- names(files_needed)
if (!all(lapply(needed_paths, file.exists))) {
download.file(url = dl_url, destfile = dl_dest)
untar(dl_dest)
}
matrix <- Matrix::readMM(needed_paths["matrix"])
matrix@Dimnames <- list(
readLines(needed_paths["genes"]),
readLines(needed_paths["cells"])
)
count_mat <- as.matrix(matrix)
norm_factors <- edgeR::calcNormFactors(count_mat)
save_crash_objs <- FALSE
if (save_crash_objs) {
saveRDS(
list(
count_mat = count_mat,
norm_factors = norm_factors
),
"wu_raw_crash_objs.RDS"
)
}
} else {
dl_url <- "https://bimsbstatic.mdc-berlin.de/akalin/jfreige/wu_raw_crash_objs.RDS"
dl_dest <- "wu_raw_crash_objs.RDS"
if (!file.exists(dl_dest)) {
download.file(url = dl_url, destfile = dl_dest)
}
crash_objs <- readRDS(dl_dest)
count_mat <- crash_objs$count_mat
norm_factors <- crash_objs$norm_factors
}
print(sessionInfo())
# This will generate a segmentation fault.
edgeR::cpm(count_mat, norm_factors * colSums(count_mat)) I still get this output:
|
Please sort that out with yourself. Rcpp does not explicitly support Guix. I can assert that a) it tests fine here and b) has by now been reverse-dependency tested against about 1500 of the 2700+ reverse dependencies. While I appreciate the input I am not sure it adds much to the issue at hand. (Also consider checking out @LTLA 's branch rather than just tossing a diff in. It;s what I do -- testing the main branch of the repo, and including the proposed PR changes to it.) |
Also, 1 TB is a lot of RAM. Maybe that is also beyond the scope of what the PR supports. I haven't done the napkin math. Can R itself assign a matrix in the dimensions you desire? |
Well the actual computation doesn't require the whole memory before it fails. I merely mentioned it to point out that it was unlikely that the process ran into an error due to a hard limitation on available memory in the system. From my (pretty limited) understanding of the C++ code that is being called in A generic double matrix of the same dimensions as the one used in the compuation takes about 22.2 GB of RAM. By the very rudimentary process of starting the script and looking at the used memory, it seems that about 30 GB of RAM are consumed before the segfault happens. To me that seems a reasonable amount to expect R not to collapse. But yeah, I can't dispute that the original issue in #1280 appears to be fixed by this and this discussion might better be continued downstream in edgeR or the forum. |
@eddelbuettel I ran @jonasfreimuth's test case in a machine with 126 GB of RAM and it
So I think we should take a closer look at this. @jonasfreimuth That said, we cannot work with a test case involving another package. |
In the meantime I ran the example laid out in #1280, for me it fails both with and without the patches (via Guix). I will add the runs with session info in an edit. EditWithout patches:
1280_example.Rlibrary("Rcpp")
sessionInfo()
b <- Rcpp::sourceCpp(code="
#include \"Rcpp.h\"
//[[Rcpp::export(rng=false)]]
double get_bottom_right(Rcpp::NumericMatrix x) {
auto row = x.row(x.nrow() - 1);
return row[x.ncol() - 1];
}
")
# Okay:
mat1 <- matrix(1:20000 + 0.0, nrow=20000, ncol=1e5)
get_bottom_right(mat1)
## [1] 20000
mat1[20000,1e5]
## [1] 20000
# Incorrect:
mat2 <- matrix(1:30000 + 0.0, nrow=30000, ncol=1e5)
get_bottom_right(mat2)
## [1] 13216
# Segfaults for me
mat2[30000,1e5] Output
packages.scm(define-module (packages)
#:use-module (gnu)
#:use-module (guix)
#:use-module (guix git-download)
#:use-module (guix hg-download)
#:use-module (guix build-system r)
#:use-module (guix licenses))
(define-public r-rcpp
(let ((commit "6707d64068abf4d603d0678e5bca7f002ede1c1d")
(revision "1"))
(package
(name "r-rcpp")
(version (git-version "1.0.11.2" revision commit))
(source
(origin
(method git-fetch)
(uri (git-reference
(url "https://github.com/LTLA/Rcpp/")
(commit commit)))
(file-name (git-file-name name version))
(sha256
(base32 "0ghr1n9jj9yi7cgc49532zk7n1i54wm5ahkmpj3r1bkxqkv93fc6"))))
(properties `((upstream-name . "Rcpp")))
(build-system r-build-system)
(home-page "https://github.com/LTLA/Rcpp/")
(synopsis "Seamless R and C++ Integration")
(description
"The Rcpp package provides R functions as well as C++ classes which offer a
seamless integration of R and C++. Many R data types and objects can be mapped
back and forth to C++ equivalents which facilitates both writing of new code as
well as easier integration of third-party libraries. Documentation about Rcpp
is provided by several vignettes included in this package, via the Rcpp Gallery
site at <https://gallery.rcpp.org>, the paper by Eddelbuettel and Francois
(2011, <doi:10.18637/jss.v040.i08>), the book by Eddelbuettel (2013,
<doi:10.1007/978-1-4614-6868-4>) and the paper by Eddelbuettel and Balamuta
(2018, <doi:10.1080/00031305.2017.1375990>); see citation(\"Rcpp\") for details.")
(license gpl2+))))
r-rcpp
Install output for the curious
Output
|
No improvement on my end on 54245b1... Updated packages.scm``` (define-module (packages) #:use-module (gnu) #:use-module (guix) #:use-module (guix git-download) #:use-module (guix hg-download) #:use-module (guix build-system r) #:use-module (guix licenses))(define-public r-rcpp r-rcpp
|
Still working on it, hold on. |
Sorry, didn't mean to rush 😅 |
I think that does it. The edgeR breakage was not (just) because of the overflow in |
Thank you! With the latest change I can no longer reproduce the segfault --- neither in the simplified example nor in the real-world use case. Guix invocation used
guix shell -C --network \
--with-git-url=r-rcpp=https://github.com/LTLA/Rcpp \
--with-commit=r-rcpp=338c5b8db93d4960b4c1f5c0df52e239b2ba890b \
--emulate-fhs \
bash make gcc-toolchain@11 r r-edger gdb
|
I'll run another full reverse-depends check once the current one finishes. |
Yes, thanks indeed! |
All good here. If the revdep check comes out clean, I think we're good to go. @LTLA Please don't forget to update the Changelog with the latest changes. |
Yes update from here: first run, once all dependencies were taken care of, had only one potentially suspicious segfault. Second run ongoing, looks very good, the one segfault returned 😢 but did not replicate on another machines. So I think this may be local to the test machine. Package in question is |
No new issues in second run -- so big big Thank You! to @LTLA for the PR. Will fold it in. |
Pull Request Template for Rcpp
See #1280 for more details.
Checklist
R CMD check
still passes all tests(Not sure how to add tests for this as the failure only occurs for very large matrices that aren't easy to create.)