Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Nomogram: Support for sparse data #2197

Merged
merged 4 commits into from
Apr 14, 2017
Merged

Conversation

nikicc
Copy link
Contributor

@nikicc nikicc commented Apr 6, 2017

Issue

Fixes #2165.

Description of changes
  • Statistics.utils: nanmin, nanmax, average, unique equivalents of numpy's that support sparse or dense matrices.
  • SharedComputeValue: add variable attribute.
  • Support nomogram on sparse data.
  • Some speedups for nomogram. Time measured on a small BoW data set of shape (140, 3000):
    • reconstruct_domain method: 3.0s -> 0.03 s
    • calculate_log_reg_coefficients method: TLDW (minutes+) -> 1.5 s
Includes
  • Code changes
  • Tests
  • Documentation

@codecov-io
Copy link

codecov-io commented Apr 6, 2017

Codecov Report

Merging #2197 into master will increase coverage by 0.03%.
The diff coverage is 98.43%.

@@            Coverage Diff             @@
##           master    #2197      +/-   ##
==========================================
+ Coverage   67.64%   67.67%   +0.03%     
==========================================
  Files         319      319              
  Lines       54871    54926      +55     
==========================================
+ Hits        37119    37173      +54     
- Misses      17752    17753       +1

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 445662d...41e7d44. Read the comment docs.

@astaric astaric added this to the 3.4.2 milestone Apr 7, 2017
@nikicc nikicc force-pushed the nomogram-sparse branch 5 times, most recently from df7eb56 to f4fb3c1 Compare April 7, 2017 14:28
@nikicc nikicc changed the title [WiP] Nomogram Sparse Support Nomogram Sparse Support Apr 7, 2017
return np.prod(x.shape) != x.data.size


def _nan_min_max(x, axis=0, func=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is not callable (L245)...
The default could be one of min/max or this could be a required parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if axis == 0:
x = x.T

# TODO check & transform to correct format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what (incorrect) format is it now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X is usually in csr and hence when one calls this with axis=0 x becomes csc (due to transposing), which is isn't efficient for row slicing.

if n_nans:
return float('nan')
else:
n_values = np.prod(x.shape) - n_nans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why - n_nans ? Isn't it 0 in this else part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is 🙈

return np.unique(x, return_counts=return_counts)
else:
n_zeros = np.prod(x.shape) - x.data.size
r = np.unique(x.data, return_counts=return_counts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x.data can contain explicit zeros right? E.g. make a csr matrix and set a non-zero element to 0.
In this case you need to be careful about inserting another 0 below...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

""" Equivalent of np.unique that supports sparse or dense matrices. """
if not sp.issparse(x):
return np.unique(x, return_counts=return_counts)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is unnecessary here and in other functions, which first check if x is not sparse and return something.
It just adds an extra indentation to all of the actual function body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


def _sparse_has_zeros(x):
""" Check if sparse matrix contains any implicit zeros. """
return np.prod(x.shape) != x.data.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably better to use x.nnz instead of x.data.size everywhere.
Looks like the spmatrix base class has nnz so every type should have it, while e.g. dok_matrix does not have .data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected. Though, methods still won't work for for dox_matrix since we rely on x.data elsewhere.

Compute values usually have a reference to the original variable so SharedComputeValue should have it too.
@nikicc
Copy link
Contributor Author

nikicc commented Apr 14, 2017

@lanzagar I think all issues are addressed now. Please, check again.

@nikicc nikicc force-pushed the nomogram-sparse branch 5 times, most recently from 2deeeab to 47b4187 Compare April 14, 2017 13:27
@lanzagar lanzagar changed the title Nomogram Sparse Support [ENH] Nomogram: Support for sparse data Apr 14, 2017
@lanzagar lanzagar merged commit c9266f8 into biolab:master Apr 14, 2017
@nikicc nikicc deleted the nomogram-sparse branch April 14, 2017 15:39
@nikicc nikicc mentioned this pull request Apr 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants