-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Gini impurity: formula and docstring fixed. #1495
Conversation
57c4b71
to
e518b73
Compare
Tests are now fixed as well. |
Current coverage is 88.86% (diff: 100%)@@ master #1495 diff @@
==========================================
Files 78 78
Lines 8099 8099
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 7197 7197
Misses 902 902
Partials 0 0
|
@@ -207,7 +207,7 @@ def _gini(D): | |||
"""Gini index of class-distribution matrix""" | |||
P = D / np.sum(D, axis=0) | |||
return sum((np.ones(1 if len(D.shape) == 1 else D.shape[1]) - np.sum(np.square(P), axis=0)) | |||
* 0.5 * np.sum(D, axis=0) / np.sum(D)) | |||
* np.sum(D, axis=0) / np.sum(D)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Among the several equivalent representations of the proposed equation, this one seems nicest:
return 1 - (P*P).sum()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it looks the nicest and most concise. Can maybe @lanzagar also comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance it looks to me like at least the second line is needed to get the weighted impurity we want.
It is possible that 1 can be used instead of the complications with np.ones (should be tested).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is out of my league. Perhaps merge this PR to fix division by two and make another one with improved formula? Otherwise I'll just copy-paste whatever you'll tell me to. 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy paste:
P = np.asarray(D / np.sum(D, axis=0))
return np.sum((1 - np.sum(P**2, axis=0)) *
np.sum(D, axis=0) / np.sum(D))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
2b4a24e
to
81b4360
Compare
81b4360
to
c839687
Compare
Gini coefficient and Gini impurity are not the same. Docstring fixed to proper Wikipedia source.
As observed in many sources, the formula for Gini impurity never contains division by 2. @BlazZupan, please check whether this is the right formula or not.
Also, "Gini" was a misleading attribute name in Rank widget. Here we're actually measuring Gini decrease. So even perhaps "Gini Gain" might not be the right name for this. Comments? @lanzagar @astaric
Probably some fixes to names should be made elsewhere, too.