Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpart has undefined behavior when a predictor has many categories. #22

Open
michaelquinn32 opened this issue Aug 28, 2020 · 3 comments
Open

Comments

@michaelquinn32
Copy link

Hi Beth!

Here's a little example:

library(mlbench)
library(rpart)

data("BostonHousing2", package = "mlbench")
model <- rpart::rpart(medv ~ town, data = BostonHousing2)

Here, town has 92 possible values.

If you run this code under an address sanitizer, you get the following error:

UndefinedBehaviorSanitizer: out-of-bounds-index rpart/src/bsplit.c:79:4

And here is a rough traceback:

rpart/src/bsplit.c:79:4: runtime error: index 20 out of bounds for type 'int [20]'
    #0 rpart/src/bsplit.c:79:22
    #1 rpart/src/partition.c:73:5
    #2 rpart/src/rpart.c:218:5
    #3 R/main/dotcode.c

I believe this is due to the design of the Split/ pSplit struct, which sets the arrays to have 20 values:

rpart/src/node.h

Lines 10 to 18 in 3980685

typedef struct split {
double improve;
double adj; /* for surrogates only, adjusted agreement */
double spoint; /* only used if it is continuous */
struct split *nextsplit;
int var_num;
int count;
int csplit[20]; /* the actual length depends on splitting rule */
} Split, *pSplit;

To reproduce the error, you'll need to compile R with sanitizers enabled. This is a little extreme, I would recommend trying a check with R-hub.
https://cran.r-project.org/web/packages/rhub/vignettes/rhub.html

Or with one of the docker images here:
https://github.com/rocker-org/rocker#unversioned-images-builds-on-r-base

Best wishes,
Michael

@bethatkinson
Copy link
Owner

bethatkinson commented Sep 3, 2020 via email

@bquistorff
Copy link
Contributor

bquistorff commented Aug 11, 2021

I was not able to reproduce this error when using docker run -ti rocker/r-devel-san (R 4.0.0). No error was shown. (I haven't used the sanitizer before, so let me know if there's something else I should look for.)

The size allocated for the structs is actually bigger than their size definition (where the fields only have size 20) and so the final field "grows". For the offending line

tsplit->csplit[k] = rp.csplit[k];
the struct on the left's size is (ncat is the number of levels for this variable),
int splitsize = sizeof(Split) + (ncat - 20) * sizeof(int);

and the field on the right is allocated here (maxcat is maximum number of levels across all variables),
rp.csplit = (int *) ALLOC(3 * maxcat, sizeof(int));

While a bit messy, it's not obvious the code is wrong.

@michaelquinn32
Copy link
Author

Thanks Brian! I can confirm that this is still an error on R 3.6.3

Here's a bit of a traceback:

 -- Test started at 2021-08-16 13:08:01 PDT --
-----------------------------------------------------------------------------
rpart/src/bsplit.c:79:4: runtime error: index 20 out of bounds for type 'int [20]'
    #0 0x7fcb13397905 in bsplit rpart/src/bsplit.c:79:22
    #1 0x7fcb1339f96d in partition rpart/src/partition.c:73:5
    #2 0x7fcb133a619a in rpart rpart/src/rpart.c:218:5
    #3 0x555d3a2137bb in R_doDotCall R/src/main/dotcode.c
    #4 0x555d3a22103e in do_dotcall R/src/main/dotcode.c:1252:11
    #5 0x555d3a2b4171 in bcEval R/src/main/eval.c:6765:14
    #6 0x555d3a2a5e13 in Rf_eval R/src/main/eval.c:620:8
    #7 0x555d3a2e34f6 in R_execClosure R/src/main/eval.c
    #8 0x555d3a2dfafa in Rf_applyClosure R/src/main/eval.c:1706:16
    #9 0x555d3a2a6716 in Rf_eval R/src/main/eval.c:743:12
    #10 0x555d3a2ee049 in do_set R/src/main/eval.c:2807:8
    #11 0x555d3a2a6305 in Rf_eval R/src/main/eval.c:695:12
    #12 0x555d3a37293c in Rf_ReplIteration R/src/main/main.c:260:2
    #13 0x555d3a3769e0 in R_ReplConsole R/src/main/main.c:310:11
    #14 0x555d3a3767b8 in run_Rmainloop R/src/main/main.c:1103:5
    #15 0x555d3a049bbd in main R/src/main/Rmain.c:30:5

SUMMARY: UndefinedBehaviorSanitizer: out-of-bounds-index rpart/src/bsplit.c:79:4

My test script in R is:

library(mlbench)
library(rpart)

data("BostonHousing2", package = "mlbench")
model <- rpart::rpart(medv ~ town, data = BostonHousing2)

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants