-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Nomogram: Support for sparse data #2197
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2197 +/- ##
==========================================
+ Coverage 67.64% 67.67% +0.03%
==========================================
Files 319 319
Lines 54871 54926 +55
==========================================
+ Hits 37119 37173 +54
- Misses 17752 17753 +1 Continue to review full report at Codecov.
|
df7eb56
to
f4fb3c1
Compare
Orange/statistics/util.py
Outdated
return np.prod(x.shape) != x.data.size | ||
|
||
|
||
def _nan_min_max(x, axis=0, func=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None is not callable (L245)...
The default could be one of min/max or this could be a required parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Orange/statistics/util.py
Outdated
if axis == 0: | ||
x = x.T | ||
|
||
# TODO check & transform to correct format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what (incorrect) format is it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X is usually in csr
and hence when one calls this with axis=0
x becomes csc
(due to transposing), which is isn't efficient for row slicing.
Orange/statistics/util.py
Outdated
if n_nans: | ||
return float('nan') | ||
else: | ||
n_values = np.prod(x.shape) - n_nans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why - n_nans
? Isn't it 0 in this else
part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is 🙈
Orange/statistics/util.py
Outdated
return np.unique(x, return_counts=return_counts) | ||
else: | ||
n_zeros = np.prod(x.shape) - x.data.size | ||
r = np.unique(x.data, return_counts=return_counts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x.data
can contain explicit zeros right? E.g. make a csr matrix and set a non-zero element to 0.
In this case you need to be careful about inserting another 0 below...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Orange/statistics/util.py
Outdated
""" Equivalent of np.unique that supports sparse or dense matrices. """ | ||
if not sp.issparse(x): | ||
return np.unique(x, return_counts=return_counts) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else
is unnecessary here and in other functions, which first check if x is not sparse and return something.
It just adds an extra indentation to all of the actual function body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Orange/statistics/util.py
Outdated
|
||
def _sparse_has_zeros(x): | ||
""" Check if sparse matrix contains any implicit zeros. """ | ||
return np.prod(x.shape) != x.data.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably better to use x.nnz
instead of x.data.size
everywhere.
Looks like the spmatrix
base class has nnz
so every type should have it, while e.g. dok_matrix
does not have .data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected. Though, methods still won't work for for dox_matrix
since we rely on x.data
elsewhere.
Compute values usually have a reference to the original variable so SharedComputeValue should have it too.
@lanzagar I think all issues are addressed now. Please, check again. |
2deeeab
to
47b4187
Compare
Issue
Fixes #2165.
Description of changes
nanmin
,nanmax
,average
,unique
equivalents of numpy's that support sparse or dense matrices.reconstruct_domain
method: 3.0s -> 0.03 scalculate_log_reg_coefficients
method: TLDW (minutes+) -> 1.5 sIncludes