Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Fix classification trees for data with repeated feature values #6488

Merged
merged 2 commits into from
Jul 13, 2023

Conversation

markotoplak
Copy link
Member

@markotoplak markotoplak commented Jun 21, 2023

Issue

I started seeing this on github tests (only on Ubuntu).

======================================================================
FAIL: test_full_tree (Orange.tests.test_orangetree.TestClassifier)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/orange3/orange3/.tox/orange-released/lib/python3.9/site-packages/Orange/tests/test_orangetree.py", line 37, in test_full_tree
    self.assertTrue(np.all(table.Y.flatten() == pred))
AssertionError: False is not true
Description of changes

The bug ran deeper. See my last comment: find_threshold_entropy skipped computing too many entropies.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Jun 21, 2023

Codecov Report

Merging #6488 (cbe2e11) into master (ff152a9) will not change coverage.
The diff coverage is n/a.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6488   +/-   ##
=======================================
  Coverage   87.66%   87.66%           
=======================================
  Files         321      321           
  Lines       69374    69374           
=======================================
  Hits        60817    60817           
  Misses       8557     8557           

@markotoplak
Copy link
Member Author

markotoplak commented Jun 21, 2023

Hmm, the trees are built differently.

When the test fails, I see the following splitting process.

evaluating split
sepal length 0.5511234895788761 5.4
sepal width 0.2679113691892653 3.3
petal length 0.9182958340544887 1.9
petal width 0.9182958340544887 0.6
split petal length [50. 50. 50.] 1.9
ending 50 [50.  0.  0.]
SA 100
evaluating split
sepal length 0.1318407255951035 7.0
sepal width 0.049856777608735386 2.9
petal length 0.6573737500732166 4.7

And then my debugging code crashed because petal_width split does not have a .threshold.
And when it does not:

evaluating split
sepal length 0.445166742387752 5.3
sepal width 0.2679113691892653 3.3
petal length 0.9182958340544887 1.9
petal width 0.9182958340544887 0.6
split petal length [50. 50. 50.] 1.9
ending 50 [50.  0.  0.]
SA 100
evaluating split
sepal length 0.16049997364457932 6.1
sepal width 0.049856777608735386 2.9
petal length 0.6573737500732166 4.7
petal width 0.6901603707546751 1.7
split petal width [ 0. 50. 50.] 1.7
...

Furthermore, I see different scores for some other attributes (sepal length for the first and the second split).

@markotoplak markotoplak force-pushed the fix-ubuntu-tree branch 2 times, most recently from ce900b9 to ac93cb2 Compare June 22, 2023 12:49
@markotoplak markotoplak changed the title Fix test_full_tree fail [FIX] Fix classification trees for data with repeated feature values Jun 22, 2023
@markotoplak
Copy link
Member Author

markotoplak commented Jun 22, 2023

During debugging, I saw that np.argsort (the default, unstable variant) worked differently across platforms. This should not be a problem if we did not have a bug in find_threshold_entropy, which aimed to skip computing needless entropies.

It avoided too many: it skipped if the next class value was the same or the next value was the same. That was a problem when feature and class values could both (interchangeably) repeat.

@markotoplak markotoplak marked this pull request as ready for review June 22, 2023 13:07
@janezd janezd self-assigned this Jun 23, 2023
@janezd janezd merged commit d17f021 into biolab:master Jul 13, 2023
@markotoplak markotoplak deleted the fix-ubuntu-tree branch November 6, 2023 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants