You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the vignette for rpart implies that the partitions take the smallest subtree T for which R|a(T) is minimised.
However I have found that this doesn't happen for unbalanced data - this suggests to me that the loss/risk function used here is accuracy rather than gini index.
Now eyeballing this, it should split data at 0.5 to have one terminal node of one 1 and 2 0's, and one node of one 1 and 6 0's - this has lower impurity (by either gini or information index) than the initial node of two 1's and eight 0's.
However it doesn't, it produces 0 splits - even as CP -> 0.
n= 10
node), split, n, loss, yval, (yprob)
* denotes terminal node
root 10 2 0 (0.8000000 0.2000000) *
Ive observed this only seems to happen on unbalanced data(i.e. class prediction is same on all terminal nodes) - which leads me to believe the risk function R is accuracy not an impurity measure as implied and presumably intended.
The text was updated successfully, but these errors were encountered:
the vignette for rpart implies that the partitions take the smallest subtree T for which R|a(T) is minimised.
However I have found that this doesn't happen for unbalanced data - this suggests to me that the loss/risk function used here is accuracy rather than gini index.
reproducible example:
library(dplyr)
library(rpart)
df<-data.frame(lung_cancer=as.factor(c(1,1,0,0,0,0,0,0,0,0), smoking=c(1,0,1,1,0,0,0,0,0,0)))
tree <- rpart(lung_cancer~smoking, data=df, control=rpart.control(minbucket=3, minsplit=4, cp=0.000000000001), parms=list(split="gini") )
print(tree)
Now eyeballing this, it should split data at 0.5 to have one terminal node of one 1 and 2 0's, and one node of one 1 and 6 0's - this has lower impurity (by either gini or information index) than the initial node of two 1's and eight 0's.
However it doesn't, it produces 0 splits - even as CP -> 0.
Ive observed this only seems to happen on unbalanced data(i.e. class prediction is same on all terminal nodes) - which leads me to believe the risk function R is accuracy not an impurity measure as implied and presumably intended.
The text was updated successfully, but these errors were encountered: