suspected bug: rpart won't split unbalanced data even with infinitely small CP #70

ilchef · 2025-01-02T10:24:10Z

the vignette for rpart implies that the partitions take the smallest subtree T for which R|a(T) is minimised.

However I have found that this doesn't happen for unbalanced data - this suggests to me that the loss/risk function used here is accuracy rather than gini index.

reproducible example:

library(dplyr)
library(rpart)
df<-data.frame(lung_cancer=as.factor(c(1,1,0,0,0,0,0,0,0,0), smoking=c(1,0,1,1,0,0,0,0,0,0)))
tree <- rpart(lung_cancer~smoking, data=df, control=rpart.control(minbucket=3, minsplit=4, cp=0.000000000001), parms=list(split="gini") )
print(tree)

Now eyeballing this, it should split data at 0.5 to have one terminal node of one 1 and 2 0's, and one node of one 1 and 6 0's - this has lower impurity (by either gini or information index) than the initial node of two 1's and eight 0's.

However it doesn't, it produces 0 splits - even as CP -> 0.

n= 10

node), split, n, loss, yval, (yprob)
* denotes terminal node

root 10 2 0 (0.8000000 0.2000000) *

Ive observed this only seems to happen on unbalanced data(i.e. class prediction is same on all terminal nodes) - which leads me to believe the risk function R is accuracy not an impurity measure as implied and presumably intended.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suspected bug: rpart won't split unbalanced data even with infinitely small CP #70

suspected bug: rpart won't split unbalanced data even with infinitely small CP #70

ilchef commented Jan 2, 2025 •

edited

Loading

suspected bug: rpart won't split unbalanced data even with infinitely small CP #70

suspected bug: rpart won't split unbalanced data even with infinitely small CP #70

Comments

ilchef commented Jan 2, 2025 • edited Loading

ilchef commented Jan 2, 2025 •

edited

Loading