[FIX] Fix classification trees for data with repeated feature values #6488

markotoplak · 2023-06-21T11:50:11Z

Issue

I started seeing this on github tests (only on Ubuntu).

======================================================================
FAIL: test_full_tree (Orange.tests.test_orangetree.TestClassifier)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/orange3/orange3/.tox/orange-released/lib/python3.9/site-packages/Orange/tests/test_orangetree.py", line 37, in test_full_tree
    self.assertTrue(np.all(table.Y.flatten() == pred))
AssertionError: False is not true

Description of changes

The bug ran deeper. See my last comment: find_threshold_entropy skipped computing too many entropies.

Includes

Code changes
Tests
Documentation

codecov · 2023-06-21T12:17:19Z

Codecov Report

Merging #6488 (cbe2e11) into master (ff152a9) will not change coverage.
The diff coverage is n/a.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6488   +/-   ##
=======================================
  Coverage   87.66%   87.66%           
=======================================
  Files         321      321           
  Lines       69374    69374           
=======================================
  Hits        60817    60817           
  Misses       8557     8557

markotoplak · 2023-06-21T13:40:08Z

Hmm, the trees are built differently.

When the test fails, I see the following splitting process.

evaluating split
sepal length 0.5511234895788761 5.4
sepal width 0.2679113691892653 3.3
petal length 0.9182958340544887 1.9
petal width 0.9182958340544887 0.6
split petal length [50. 50. 50.] 1.9
ending 50 [50.  0.  0.]
SA 100
evaluating split
sepal length 0.1318407255951035 7.0
sepal width 0.049856777608735386 2.9
petal length 0.6573737500732166 4.7

And then my debugging code crashed because petal_width split does not have a .threshold.
And when it does not:

evaluating split
sepal length 0.445166742387752 5.3
sepal width 0.2679113691892653 3.3
petal length 0.9182958340544887 1.9
petal width 0.9182958340544887 0.6
split petal length [50. 50. 50.] 1.9
ending 50 [50.  0.  0.]
SA 100
evaluating split
sepal length 0.16049997364457932 6.1
sepal width 0.049856777608735386 2.9
petal length 0.6573737500732166 4.7
petal width 0.6901603707546751 1.7
split petal width [ 0. 50. 50.] 1.7
...

Furthermore, I see different scores for some other attributes (sepal length for the first and the second split).

markotoplak · 2023-06-22T13:04:24Z

During debugging, I saw that np.argsort (the default, unstable variant) worked differently across platforms. This should not be a problem if we did not have a bug in find_threshold_entropy, which aimed to skip computing needless entropies.

It avoided too many: it skipped if the next class value was the same or the next value was the same. That was a problem when feature and class values could both (interchangeably) repeat.

markotoplak force-pushed the fix-ubuntu-tree branch 2 times, most recently from ce900b9 to ac93cb2 Compare June 22, 2023 12:49

markotoplak changed the title ~~Fix test_full_tree fail~~ [FIX] Fix classification trees for data with repeated feature values Jun 22, 2023

markotoplak force-pushed the fix-ubuntu-tree branch from ac93cb2 to d1c9e14 Compare June 22, 2023 12:56

markotoplak force-pushed the fix-ubuntu-tree branch from d1c9e14 to 9ef29b7 Compare June 22, 2023 13:07

markotoplak marked this pull request as ready for review June 22, 2023 13:07

markotoplak force-pushed the fix-ubuntu-tree branch from 9ef29b7 to 01a5e07 Compare June 22, 2023 13:54

janezd self-assigned this Jun 23, 2023

markotoplak added 2 commits July 11, 2023 15:01

test_orangetree: use numpy's asserts

1852c10

fix tree's find_threshold_entropy for repeated values

cbe2e11

markotoplak force-pushed the fix-ubuntu-tree branch from 01a5e07 to cbe2e11 Compare July 11, 2023 13:01

janezd merged commit d17f021 into biolab:master Jul 13, 2023

markotoplak deleted the fix-ubuntu-tree branch November 6, 2023 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Fix classification trees for data with repeated feature values #6488

[FIX] Fix classification trees for data with repeated feature values #6488

markotoplak commented Jun 21, 2023 •

edited

Loading

codecov bot commented Jun 21, 2023 •

edited

Loading

markotoplak commented Jun 21, 2023 •

edited

Loading

markotoplak commented Jun 22, 2023 •

edited

Loading

[FIX] Fix classification trees for data with repeated feature values #6488

[FIX] Fix classification trees for data with repeated feature values #6488

Conversation

markotoplak commented Jun 21, 2023 • edited Loading

Issue

Description of changes

Includes

codecov bot commented Jun 21, 2023 • edited Loading

Codecov Report

markotoplak commented Jun 21, 2023 • edited Loading

markotoplak commented Jun 22, 2023 • edited Loading

markotoplak commented Jun 21, 2023 •

edited

Loading

codecov bot commented Jun 21, 2023 •

edited

Loading

markotoplak commented Jun 21, 2023 •

edited

Loading

markotoplak commented Jun 22, 2023 •

edited

Loading