[ENH] Nomogram: Support for sparse data #2197

nikicc · 2017-04-06T16:02:33Z

Issue

Fixes #2165.

Description of changes

Statistics.utils: nanmin, nanmax, average, unique equivalents of numpy's that support sparse or dense matrices.
SharedComputeValue: add variable attribute.
Support nomogram on sparse data.
Some speedups for nomogram. Time measured on a small BoW data set of shape (140, 3000):
- reconstruct_domain method: 3.0s -> 0.03 s
- calculate_log_reg_coefficients method: TLDW (minutes+) -> 1.5 s

Includes

Code changes
Tests
Documentation

codecov-io · 2017-04-06T16:42:34Z

Codecov Report

Merging #2197 into master will increase coverage by 0.03%.
The diff coverage is 98.43%.

@@            Coverage Diff             @@
##           master    #2197      +/-   ##
==========================================
+ Coverage   67.64%   67.67%   +0.03%     
==========================================
  Files         319      319              
  Lines       54871    54926      +55     
==========================================
+ Hits        37119    37173      +54     
- Misses      17752    17753       +1

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 445662d...41e7d44. Read the comment docs.

lanzagar · 2017-04-11T14:31:52Z

Orange/statistics/util.py

+    return np.prod(x.shape) != x.data.size
+
+
+def _nan_min_max(x, axis=0, func=None):


None is not callable (L245)...
The default could be one of min/max or this could be a required parameter.

lanzagar · 2017-04-11T14:35:07Z

Orange/statistics/util.py

+        if axis == 0:
+            x = x.T
+
+        # TODO check & transform to correct format


In what (incorrect) format is it now?

X is usually in csr and hence when one calls this with axis=0 x becomes csc (due to transposing), which is isn't efficient for row slicing.

lanzagar · 2017-04-11T14:38:58Z

Orange/statistics/util.py

+        if n_nans:
+            return float('nan')
+        else:
+            n_values = np.prod(x.shape) - n_nans


Why - n_nans ? Isn't it 0 in this else part.

lanzagar · 2017-04-11T15:08:39Z

Orange/statistics/util.py

+        return np.unique(x, return_counts=return_counts)
+    else:
+        n_zeros = np.prod(x.shape) - x.data.size
+        r = np.unique(x.data, return_counts=return_counts)


x.data can contain explicit zeros right? E.g. make a csr matrix and set a non-zero element to 0.
In this case you need to be careful about inserting another 0 below...

lanzagar · 2017-04-11T15:14:05Z

Orange/statistics/util.py

+    """ Equivalent of np.unique that supports sparse or dense matrices. """
+    if not sp.issparse(x):
+        return np.unique(x, return_counts=return_counts)
+    else:


else is unnecessary here and in other functions, which first check if x is not sparse and return something.
It just adds an extra indentation to all of the actual function body.

lanzagar · 2017-04-11T15:25:05Z

Orange/statistics/util.py

+
+def _sparse_has_zeros(x):
+    """ Check if sparse matrix contains any implicit zeros. """
+    return np.prod(x.shape) != x.data.size


It is probably better to use x.nnz instead of x.data.size everywhere.
Looks like the spmatrix base class has nnz so every type should have it, while e.g. dok_matrix does not have .data

Corrected. Though, methods still won't work for for dox_matrix since we rely on x.data elsewhere.

Compute values usually have a reference to the original variable so SharedComputeValue should have it too.

nikicc · 2017-04-14T08:17:11Z

@lanzagar I think all issues are addressed now. Please, check again.

nikicc force-pushed the nomogram-sparse branch from 810e3f9 to cb04b60 Compare April 6, 2017 16:09

astaric added this to the 3.4.2 milestone Apr 7, 2017

astaric assigned lanzagar Apr 7, 2017

nikicc force-pushed the nomogram-sparse branch 5 times, most recently from df7eb56 to f4fb3c1 Compare April 7, 2017 14:28

nikicc changed the title ~~[WiP] Nomogram Sparse Support~~ Nomogram Sparse Support Apr 7, 2017

lanzagar requested changes Apr 11, 2017

View reviewed changes

SharedComputeValue: Add variable attribut

b2e728e

Compute values usually have a reference to the original variable so SharedComputeValue should have it too.

nikicc force-pushed the nomogram-sparse branch from f4fb3c1 to 610cae2 Compare April 14, 2017 08:13

nikicc force-pushed the nomogram-sparse branch 5 times, most recently from 2deeeab to 47b4187 Compare April 14, 2017 13:27

nikicc added 3 commits April 14, 2017 15:34

Statistics: nanmin, nanmax, mean, nanmean, unique for dense/sparse

c0808ba

Nomogram: Support sparse data

bae1746

Nomogram: Speedups

41e7d44

nikicc force-pushed the nomogram-sparse branch from 47b4187 to 41e7d44 Compare April 14, 2017 13:34

lanzagar approved these changes Apr 14, 2017

View reviewed changes

lanzagar changed the title ~~Nomogram Sparse Support~~ [ENH] Nomogram: Support for sparse data Apr 14, 2017

lanzagar merged commit c9266f8 into biolab:master Apr 14, 2017

nikicc deleted the nomogram-sparse branch April 14, 2017 15:39

nikicc mentioned this pull request Apr 19, 2017

Nomogram fails on sparse #2165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Nomogram: Support for sparse data #2197

[ENH] Nomogram: Support for sparse data #2197

nikicc commented Apr 6, 2017 •

edited

Loading

codecov-io commented Apr 6, 2017 •

edited

Loading

lanzagar Apr 11, 2017

nikicc Apr 14, 2017

lanzagar Apr 11, 2017

nikicc Apr 14, 2017

lanzagar Apr 11, 2017

nikicc Apr 14, 2017

lanzagar Apr 11, 2017

nikicc Apr 14, 2017

lanzagar Apr 11, 2017

nikicc Apr 14, 2017

lanzagar Apr 11, 2017

nikicc Apr 14, 2017

nikicc commented Apr 14, 2017

		return np.prod(x.shape) != x.data.size


		def _nan_min_max(x, axis=0, func=None):

[ENH] Nomogram: Support for sparse data #2197

[ENH] Nomogram: Support for sparse data #2197

Conversation

nikicc commented Apr 6, 2017 • edited Loading

Issue

Description of changes

Includes

codecov-io commented Apr 6, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikicc commented Apr 14, 2017

nikicc commented Apr 6, 2017 •

edited

Loading

codecov-io commented Apr 6, 2017 •

edited

Loading