-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Scatter Plot: dealing with scipy sparse matrix #2152
Conversation
Orange/widgets/utils/scaling.py
Outdated
@@ -47,7 +47,9 @@ def _compute_scaled_data(self): | |||
return | |||
|
|||
Y = data.Y if data.Y.ndim == 2 else np.atleast_2d(data.Y).T | |||
self.original_data = np.hstack((data.X, Y)).T | |||
self.original_data = np.hstack((data.X if "csr_matrix" not in str(type(data.X)) | |||
else data.X.toarray(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't transform sparse matrix to dense here but should rather adapt the method to work on sparse matrices.
If the matrix is sparse, we can use scipy's hstack instead of numpy's and then whoever uses self.original_data
should expect that it can also be sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, firstly I used scipy's hstack. But as you warned me, then there are problems with self.original_data
.
Orange/widgets/utils/scaling.py
Outdated
@@ -47,7 +47,9 @@ def _compute_scaled_data(self): | |||
return | |||
|
|||
Y = data.Y if data.Y.ndim == 2 else np.atleast_2d(data.Y).T | |||
self.original_data = np.hstack((data.X, Y)).T | |||
self.original_data = np.hstack((data.X if "csr_matrix" not in str(type(data.X)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check the type it's better to use issinstance
than checking if string is inside type's string representation. However, in this case checking for csr_matrix
is not sufficient since this doesn't work for other sparse matrix formats (like csc
or dok
).
The best way to check for sparsity is to use scipy's issparse
For example:
import scipy as sp
sp.issparse(data.X)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refactor to that sparse matrices wont'd be transformed to dense.
Codecov Report
@@ Coverage Diff @@
## master #2152 +/- ##
=========================================
+ Coverage 72.37% 72.4% +0.02%
=========================================
Files 319 319
Lines 55014 55055 +41
=========================================
+ Hits 39819 39862 +43
+ Misses 15195 15193 -2 |
@jerneju I haven't checked much of the code but this still fails for me on sparse: |
@nikicc: I will write the code again and that time I am planning to use new methods added in util.py. |
self.cb_class_density.setEnabled(self.graph.can_draw_density()) | ||
self.cb_reg_line.setEnabled(self.graph.can_draw_regresssion_line()) | ||
self.unconditional_commit() | ||
self.corpus_to_table() | ||
|
||
def corpus_to_table(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rename this to something else? In theory we could have sparse table also in Orange, nut just in Text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things still don't work for me:
- colouring doesn't work if I want to colour by a discrete variable stored in metas.
- giving a subset of the data points doesn't colour the selection on the graph.
@@ -394,18 +395,40 @@ def set_subset_data(self, subset_data): | |||
|
|||
# called when all signals are received, so the graph is updated only once | |||
def handleNewSignals(self): | |||
self.graph.new_data(self.data_metas_X, self.subset_data) | |||
if self.data is None or (self.data and not sp.issparse(self.data.X)):# table.issparse() !!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this simply be if self.data is None or not sp.issparse(self.data.X)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe even better:
if self.data is not None and self.data.is_sparse():
self.sparse_to_dense()
else:
self.graph.new_data(self.data_metas_X, self.subset_data)
Doing so you also don't need that if at the beginning of sparse_to_dense
method.
if self.attribute_selection_list and \ | ||
all(attr in self.graph.domain | ||
for attr in self.attribute_selection_list): | ||
self.attr_x = self.attribute_selection_list[0] | ||
self.attr_y = self.attribute_selection_list[1] | ||
self.attribute_selection_list = None | ||
#if not sp.issparse(self.data.X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably leftover from some debugging. Can you please remove it?
return | ||
keys = [] | ||
for i, attr in enumerate(self.data.domain): | ||
if attr in set([self.attr_x, self.attr_y, self.graph.attr_color]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably nitpicking here, but could you move this outside the loop into a variable so you do not create a new set in each iteration?
if self.data is None or not self.data.is_sparse(): | ||
return | ||
keys = [] | ||
for i, attr in enumerate(self.data.domain): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only iterates over features and class attributes, but not metas. What happens if we color by discrete variable inside metas?
dmx = Table.from_table(new_domain, self.data) | ||
dmx = self.move_primitive_metas_to_X(dmx) | ||
dmx.X = dmx.X.toarray() | ||
if sp.issparse(dmx.Y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a todo here? Something like:
#TODO: remove once we make sure Y is always dense.
dmx.X = dmx.X.toarray() | ||
if sp.issparse(dmx.Y): | ||
dmx.Y = dmx.Y.toarray() | ||
self.subset_data = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if someone provide sparse subset data?
Code updated. |
self.update_graph() | ||
|
||
def sparse_to_dense(self, input_data=None): | ||
if input_data is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified to:
if input_data is None or not input_data.is_sparse():
return input_data
self.graph.new_data(self.data_metas_X, self.subset_data) | ||
self.graph.new_data(self.sparse_to_dense(self.data_metas_X), | ||
self.sparse_to_dense(self.subset_data) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this )
at the end of previous line?
self.graph.new_data(self.sparse_to_dense(self.data_metas_X), | ||
self.sparse_to_dense(self.subset_data), | ||
new=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this )
at the end of previous line?
self.sparse_to_dense(self.subset_data), | ||
new=False | ||
) | ||
self.update_graph() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this update graph really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not.
return input_data | ||
self.vizrank_button.setVisible(False) # not for subset data? | ||
keys = [] | ||
attrs = set([self.attr_x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My PyCharm suggests: function call can be replaces with set literal
.
return None | ||
if not input_data.is_sparse(): | ||
return input_data | ||
self.vizrank_button.setVisible(False) # not for subset data? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to use setEnabled
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, once this gets disabled it is not re-enabled if dense data comes in. I suggest we add something like:
self.vizrank_button.setEnabled(not (self.data and self.data.is_sparse()))
@@ -972,6 +979,7 @@ def compute_symbols(self): | |||
return shape_data | |||
|
|||
def update_shapes(self): | |||
self.master.prepare_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add something like if self.attr_shape not in self.data.domain
?
table.Y = sp.csr_matrix(table._Y) # pylint: disable=protected-access | ||
self.assertTrue(sp.issparse(table.Y)) | ||
self.send_signal("Data", table) | ||
self.widget.set_subset_data(table[0:30]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be just table[:30]
Fixed. |
Issue
Orange add-on "Text mining" sends "Corpus" instead of "Table". "Corpus" uses sparse matrices and numpy hstack does not know how to handle them. That is why scipy sparse matrix is converted into numpy array.
https://sentry.io/biolab/orange3/issues/243775748/
Fixes #2157.
Description of changes
Includes