Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Scatter Plot: dealing with scipy sparse matrix #2152

Merged
merged 1 commit into from
Apr 21, 2017
Merged

[FIX] Scatter Plot: dealing with scipy sparse matrix #2152

merged 1 commit into from
Apr 21, 2017

Conversation

jerneju
Copy link
Contributor

@jerneju jerneju commented Mar 29, 2017

Issue

Orange add-on "Text mining" sends "Corpus" instead of "Table". "Corpus" uses sparse matrices and numpy hstack does not know how to handle them. That is why scipy sparse matrix is converted into numpy array.
https://sentry.io/biolab/orange3/issues/243775748/
Fixes #2157.

Description of changes
Includes
  • Code changes
  • Tests
  • Documentation

@jerneju
Copy link
Contributor Author

jerneju commented Mar 29, 2017

@@ -47,7 +47,9 @@ def _compute_scaled_data(self):
return

Y = data.Y if data.Y.ndim == 2 else np.atleast_2d(data.Y).T
self.original_data = np.hstack((data.X, Y)).T
self.original_data = np.hstack((data.X if "csr_matrix" not in str(type(data.X))
else data.X.toarray(),
Copy link
Contributor

@nikicc nikicc Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't transform sparse matrix to dense here but should rather adapt the method to work on sparse matrices.

If the matrix is sparse, we can use scipy's hstack instead of numpy's and then whoever uses self.original_data should expect that it can also be sparse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, firstly I used scipy's hstack. But as you warned me, then there are problems with self.original_data.

@@ -47,7 +47,9 @@ def _compute_scaled_data(self):
return

Y = data.Y if data.Y.ndim == 2 else np.atleast_2d(data.Y).T
self.original_data = np.hstack((data.X, Y)).T
self.original_data = np.hstack((data.X if "csr_matrix" not in str(type(data.X))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check the type it's better to use issinstance than checking if string is inside type's string representation. However, in this case checking for csr_matrix is not sufficient since this doesn't work for other sparse matrix formats (like csc or dok).

The best way to check for sparsity is to use scipy's issparse For example:

import scipy as sp
sp.issparse(data.X)

Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refactor to that sparse matrices wont'd be transformed to dense.

@codecov-io
Copy link

codecov-io commented Mar 30, 2017

Codecov Report

Merging #2152 into master will increase coverage by 0.02%.
The diff coverage is 92.59%.

@@            Coverage Diff            @@
##           master   #2152      +/-   ##
=========================================
+ Coverage   72.37%   72.4%   +0.02%     
=========================================
  Files         319     319              
  Lines       55014   55055      +41     
=========================================
+ Hits        39819   39862      +43     
+ Misses      15195   15193       -2

@astaric astaric added this to the 3.4.2 milestone Apr 7, 2017
@nikicc
Copy link
Contributor

nikicc commented Apr 7, 2017

@jerneju I haven't checked much of the code but this still fails for me on sparse:
screen shot 2017-04-07 at 10 19 42

@jerneju
Copy link
Contributor Author

jerneju commented Apr 7, 2017

@nikicc: I will write the code again and that time I am planning to use new methods added in util.py.

self.cb_class_density.setEnabled(self.graph.can_draw_density())
self.cb_reg_line.setEnabled(self.graph.can_draw_regresssion_line())
self.unconditional_commit()
self.corpus_to_table()

def corpus_to_table(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rename this to something else? In theory we could have sparse table also in Orange, nut just in Text.

@nikicc nikicc modified the milestones: future, 3.4.2 Apr 7, 2017
Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things still don't work for me:

  • colouring doesn't work if I want to colour by a discrete variable stored in metas.
  • giving a subset of the data points doesn't colour the selection on the graph.

@@ -394,18 +395,40 @@ def set_subset_data(self, subset_data):

# called when all signals are received, so the graph is updated only once
def handleNewSignals(self):
self.graph.new_data(self.data_metas_X, self.subset_data)
if self.data is None or (self.data and not sp.issparse(self.data.X)):# table.issparse() !!!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this simply be if self.data is None or not sp.issparse(self.data.X)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe even better:

if self.data is not None and self.data.is_sparse():
    self.sparse_to_dense()
else:
    self.graph.new_data(self.data_metas_X, self.subset_data)

Doing so you also don't need that if at the beginning of sparse_to_dense method.

if self.attribute_selection_list and \
all(attr in self.graph.domain
for attr in self.attribute_selection_list):
self.attr_x = self.attribute_selection_list[0]
self.attr_y = self.attribute_selection_list[1]
self.attribute_selection_list = None
#if not sp.issparse(self.data.X):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably leftover from some debugging. Can you please remove it?

return
keys = []
for i, attr in enumerate(self.data.domain):
if attr in set([self.attr_x, self.attr_y, self.graph.attr_color]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably nitpicking here, but could you move this outside the loop into a variable so you do not create a new set in each iteration?

if self.data is None or not self.data.is_sparse():
return
keys = []
for i, attr in enumerate(self.data.domain):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only iterates over features and class attributes, but not metas. What happens if we color by discrete variable inside metas?

dmx = Table.from_table(new_domain, self.data)
dmx = self.move_primitive_metas_to_X(dmx)
dmx.X = dmx.X.toarray()
if sp.issparse(dmx.Y):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a todo here? Something like:
#TODO: remove once we make sure Y is always dense.

dmx.X = dmx.X.toarray()
if sp.issparse(dmx.Y):
dmx.Y = dmx.Y.toarray()
self.subset_data = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone provide sparse subset data?

@jerneju jerneju changed the title [FIX] Scatter Plot: dealing with scipy sparse matrix [WIP][FIX] Scatter Plot: dealing with scipy sparse matrix Apr 14, 2017
@astaric astaric modified the milestone: future Apr 19, 2017
@jerneju
Copy link
Contributor Author

jerneju commented Apr 21, 2017

Code updated.

self.update_graph()

def sparse_to_dense(self, input_data=None):
if input_data is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to:

if input_data is None or not input_data.is_sparse():
    return input_data

self.graph.new_data(self.data_metas_X, self.subset_data)
self.graph.new_data(self.sparse_to_dense(self.data_metas_X),
self.sparse_to_dense(self.subset_data)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this ) at the end of previous line?

self.graph.new_data(self.sparse_to_dense(self.data_metas_X),
self.sparse_to_dense(self.subset_data),
new=False
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this ) at the end of previous line?

self.sparse_to_dense(self.subset_data),
new=False
)
self.update_graph()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this update graph really needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not.

return input_data
self.vizrank_button.setVisible(False) # not for subset data?
keys = []
attrs = set([self.attr_x,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PyCharm suggests: function call can be replaces with set literal.

return None
if not input_data.is_sparse():
return input_data
self.vizrank_button.setVisible(False) # not for subset data?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to use setEnabled here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, once this gets disabled it is not re-enabled if dense data comes in. I suggest we add something like:

self.vizrank_button.setEnabled(not (self.data and self.data.is_sparse()))

@@ -972,6 +979,7 @@ def compute_symbols(self):
return shape_data

def update_shapes(self):
self.master.prepare_data()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add something like if self.attr_shape not in self.data.domain?

table.Y = sp.csr_matrix(table._Y) # pylint: disable=protected-access
self.assertTrue(sp.issparse(table.Y))
self.send_signal("Data", table)
self.widget.set_subset_data(table[0:30])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be just table[:30]

@jerneju
Copy link
Contributor Author

jerneju commented Apr 21, 2017

Fixed.

@nikicc nikicc changed the title [WIP][FIX] Scatter Plot: dealing with scipy sparse matrix [FIX] Scatter Plot: dealing with scipy sparse matrix Apr 21, 2017
@nikicc nikicc merged commit 9274a20 into biolab:master Apr 21, 2017
@jerneju jerneju deleted the value-scatterplot branch April 24, 2017 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants