-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpart has undefined behavior when a predictor has many categories. #22
Comments
Thank you for your detailed analysis of the problem. When rpart was developed, I don't think levels of 92 were anticipated. I'll add this to the list.
Beth
…________________________________
From: Michael Quinn <[email protected]>
Sent: Friday, August 28, 2020 4:18 PM
To: bethatkinson/rpart <[email protected]>
Cc: Subscribed <[email protected]>
Subject: [EXTERNAL] [bethatkinson/rpart] rpart has undefined behavior when a predictor has many categories. (#22)
Hi Beth!
Here's a little example:
library(mlbench)
library(rpart)
data("BostonHousing2", package = "mlbench")
model <- rpart::rpart(medv ~ town, data = BostonHousing2)
Here, town has 92 possible values.
If you run this code under an address sanitizer, you get the following error:
UndefinedBehaviorSanitizer: out-of-bounds-index rpart/src/bsplit.c:79:4
And here is a rough traceback:
rpart/src/bsplit.c:79:4: runtime error: index 20 out of bounds for type 'int [20]'
#0 rpart/src/bsplit.c:79:22
#1 rpart/src/partition.c:73:5
#2 rpart/src/rpart.c:218:5
#3 R/main/dotcode.c
I believe this is due to the design of the Split/ pSplit struct, which sets the arrays to have 20 values:
https://github.com/bethatkinson/rpart/blob/39806853bf5b1931e795897d7a8d5cb20b366ca6/src/node.h#L10-L18
To reproduce the error, you'll need to compile R with sanitizers enabled. This is a little extreme, I would recommend trying a check with R-hub.
https://cran.r-project.org/web/packages/rhub/vignettes/rhub.html
Or with one of the docker images here:
https://github.com/rocker-org/rocker#unversioned-images-builds-on-r-base
Best wishes,
Michael
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#22>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACWQG5Z6SDMHS6HHA3JPFODSDANI7ANCNFSM4QOQ3LZA>.
|
I was not able to reproduce this error when using The size allocated for the structs is actually bigger than their size definition (where the fields only have size 20) and so the final field "grows". For the offending line Line 79 in 3980685
ncat is the number of levels for this variable), Line 20 in 3980685
and the field on the right is allocated here ( maxcat is maximum number of levels across all variables), Line 182 in 3980685
While a bit messy, it's not obvious the code is wrong. |
Thanks Brian! I can confirm that this is still an error on R 3.6.3 Here's a bit of a traceback:
My test script in R is:
Thanks again! |
Hi Beth!
Here's a little example:
Here, town has 92 possible values.
If you run this code under an address sanitizer, you get the following error:
And here is a rough traceback:
I believe this is due to the design of the Split/ pSplit struct, which sets the arrays to have 20 values:
rpart/src/node.h
Lines 10 to 18 in 3980685
To reproduce the error, you'll need to compile R with sanitizers enabled. This is a little extreme, I would recommend trying a check with R-hub.
https://cran.r-project.org/web/packages/rhub/vignettes/rhub.html
Or with one of the docker images here:
https://github.com/rocker-org/rocker#unversioned-images-builds-on-r-base
Best wishes,
Michael
The text was updated successfully, but these errors were encountered: