Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Statistics.util.stats: Fix negative #nans for sparse #1659

Merged
merged 1 commit into from
Oct 17, 2016

Conversation

nikicc
Copy link
Contributor

@nikicc nikicc commented Oct 13, 2016

Computing the number of nans for sparse matrices was broken and sometimes returned negative numbers.

Variable non_zero contains the number of defined values in each column. So to calculate the number of undefined (nans) for each column we have to substract non_zero from the number of values in the column i.e. X.shape[0] and not X.shape[1]. The bug was noticed when the number of defined values in some column was larger than the number of features and consequenlty the number of nans was negative. This caused density in OWTable to exceeded 100%.

screen shot 2016-10-13 at 11 55 57

Computing the number of nans for sparse matrices was broken and sometimes returned negative numbers.

Variable `non_zero` contains the number of defined values in each column. So to calculate the number of undefined (nans) for each column we have to substract `non_zero` from the number of values in the column i.e. `X.shape[0]` and not `X.shape[1]`. The bug was noticed when the number of defined values in some column was larger than the number of features and consequenlty the number of nans was negative. This caused density in OWTable to exceeded 100%.
@nikicc nikicc changed the title Statistics.util.stats: Fix negative #nans in sparse [FIX] Statistics.util.stats: Fix negative #nans for sparse Oct 13, 2016
@codecov-io
Copy link

codecov-io commented Oct 13, 2016

Current coverage is 88.73% (diff: 100%)

Merging #1659 into master will not change coverage

@@             master      #1659   diff @@
==========================================
  Files            78         78          
  Lines          8150       8150          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits           7232       7232          
  Misses          918        918          
  Partials          0          0          

Sunburst

Powered by Codecov. Last update 282b1b0...7b35830

Copy link
Contributor

@lanzagar lanzagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the wrong computation was unnoticed, that probably means there is no test for nan counting. Could you add a simple one?

@nikicc
Copy link
Contributor Author

nikicc commented Oct 17, 2016

@lanzagar IMHO we already have a test for this — the one that I corrected in this PR — just that it was incorrect.

For sparse data "nans" are actually values not present in sparse matrix and the test_stats_sparse already contain those since csr_matrix does not store zeros.

@lanzagar
Copy link
Contributor

You are right, the (not very clear) test that was already there (but broken) does test this too. I guess it is enough to cover the basics now that it is fixed.

@lanzagar lanzagar merged commit 20e7e66 into biolab:master Oct 17, 2016
@nikicc nikicc deleted the sparse-fix-nans-counting branch October 17, 2016 14:28
@astaric astaric modified the milestone: 3.3.9 Nov 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants