[FIX] Gini impurity: formula and docstring fixed. #1495

ajdapretnar · 2016-08-02T07:08:30Z

Gini coefficient and Gini impurity are not the same. Docstring fixed to proper Wikipedia source.

As observed in many sources, the formula for Gini impurity never contains division by 2. @BlazZupan, please check whether this is the right formula or not.

Also, "Gini" was a misleading attribute name in Rank widget. Here we're actually measuring Gini decrease. So even perhaps "Gini Gain" might not be the right name for this. Comments? @lanzagar @astaric

Probably some fixes to names should be made elsewhere, too.

ajdapretnar · 2016-08-02T12:41:53Z

Tests are now fixed as well.

codecov-io · 2016-08-02T12:49:36Z

Current coverage is 88.86% (diff: 100%)

Merging #1495 into master will not change coverage

@@             master      #1495   diff @@
==========================================
  Files            78         78          
  Lines          8099       8099          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits           7197       7197          
  Misses          902        902          
  Partials          0          0

Powered by Codecov. Last update abbf69f...c839687

kernc · 2016-09-15T23:52:05Z

Orange/preprocess/score.py

@@ -207,7 +207,7 @@ def _gini(D):
    """Gini index of class-distribution matrix"""
    P = D / np.sum(D, axis=0)
    return sum((np.ones(1 if len(D.shape) == 1 else D.shape[1]) - np.sum(np.square(P), axis=0))
-               * 0.5 * np.sum(D, axis=0) / np.sum(D))
+               * np.sum(D, axis=0) / np.sum(D))


Among the several equivalent representations of the proposed equation, this one seems nicest:

return 1 - (P*P).sum()

I agree it looks the nicest and most concise. Can maybe @lanzagar also comment?

At first glance it looks to me like at least the second line is needed to get the weighted impurity we want.
It is possible that 1 can be used instead of the complications with np.ones (should be tested).

This is out of my league. Perhaps merge this PR to fix division by two and make another one with improved formula? Otherwise I'll just copy-paste whatever you'll tell me to. 😀

Copy paste:

P = np.asarray(D / np.sum(D, axis=0)) return np.sum((1 - np.sum(P**2, axis=0)) * np.sum(D, axis=0) / np.sum(D))

ajdapretnar assigned ajdapretnar and BlazZupan and unassigned ajdapretnar Aug 2, 2016

ajdapretnar force-pushed the gini-formula branch from 57c4b71 to e518b73 Compare August 2, 2016 12:41

ajdapretnar mentioned this pull request Sep 7, 2016

Rank Widget Improvements #1547

Closed

BlazZupan approved these changes Sep 15, 2016

View reviewed changes

kernc suggested changes Sep 15, 2016

View reviewed changes

ajdapretnar force-pushed the gini-formula branch 2 times, most recently from 2b4a24e to 81b4360 Compare September 16, 2016 10:10

lanzagar changed the title ~~[RFC] Gini impurity: formula and docstring fixed.~~ [FIX] Gini impurity: formula and docstring fixed. Sep 16, 2016

lanzagar approved these changes Sep 16, 2016

View reviewed changes

Gini impurity: formula and docstring fixed.

c839687

ajdapretnar force-pushed the gini-formula branch from 81b4360 to c839687 Compare September 16, 2016 10:24

lanzagar merged commit 3f3ebd4 into biolab:master Sep 16, 2016

ajdapretnar deleted the gini-formula branch October 14, 2016 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Gini impurity: formula and docstring fixed. #1495

[FIX] Gini impurity: formula and docstring fixed. #1495

ajdapretnar commented Aug 2, 2016

ajdapretnar commented Aug 2, 2016

codecov-io commented Aug 2, 2016 •

edited

Loading

kernc Sep 15, 2016

ajdapretnar Sep 16, 2016

lanzagar Sep 16, 2016

ajdapretnar Sep 16, 2016 •

edited

Loading

lanzagar Sep 16, 2016 •

edited

Loading

ajdapretnar Sep 16, 2016

[FIX] Gini impurity: formula and docstring fixed. #1495

[FIX] Gini impurity: formula and docstring fixed. #1495

Conversation

ajdapretnar commented Aug 2, 2016

ajdapretnar commented Aug 2, 2016

codecov-io commented Aug 2, 2016 • edited Loading

Current coverage is 88.86% (diff: 100%)

kernc Sep 15, 2016

Choose a reason for hiding this comment

ajdapretnar Sep 16, 2016

Choose a reason for hiding this comment

lanzagar Sep 16, 2016

Choose a reason for hiding this comment

ajdapretnar Sep 16, 2016 • edited Loading

Choose a reason for hiding this comment

lanzagar Sep 16, 2016 • edited Loading

Choose a reason for hiding this comment

ajdapretnar Sep 16, 2016

Choose a reason for hiding this comment

codecov-io commented Aug 2, 2016 •

edited

Loading

ajdapretnar Sep 16, 2016 •

edited

Loading

lanzagar Sep 16, 2016 •

edited

Loading