Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suspected bug: rpart won't split unbalanced data even with infinitely small CP #70

Open
ilchef opened this issue Jan 2, 2025 · 0 comments

Comments

@ilchef
Copy link

ilchef commented Jan 2, 2025

the vignette for rpart implies that the partitions take the smallest subtree T for which R|a(T) is minimised.

However I have found that this doesn't happen for unbalanced data - this suggests to me that the loss/risk function used here is accuracy rather than gini index.

reproducible example:

library(dplyr)
library(rpart)
df<-data.frame(lung_cancer=as.factor(c(1,1,0,0,0,0,0,0,0,0), smoking=c(1,0,1,1,0,0,0,0,0,0)))
tree <- rpart(lung_cancer~smoking, data=df, control=rpart.control(minbucket=3, minsplit=4, cp=0.000000000001), parms=list(split="gini") )
print(tree)

Now eyeballing this, it should split data at 0.5 to have one terminal node of one 1 and 2 0's, and one node of one 1 and 6 0's - this has lower impurity (by either gini or information index) than the initial node of two 1's and eight 0's.

However it doesn't, it produces 0 splits - even as CP -> 0.

n= 10

node), split, n, loss, yval, (yprob)
* denotes terminal node

  1. root 10 2 0 (0.8000000 0.2000000) *

Ive observed this only seems to happen on unbalanced data(i.e. class prediction is same on all terminal nodes) - which leads me to believe the risk function R is accuracy not an impurity measure as implied and presumably intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant