Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Table: Fix printing data with sparse Y #2457

Merged
merged 3 commits into from
Jul 25, 2017

Conversation

pavlin-policar
Copy link
Collaborator

Issue

Table __str__ method crashed when the Y was sparse.

Steps to reproduce:

data = Table('iris')
data.X = sp.csr_matrix(data.X)
# Transpose is needed due to the way sparse converts 1d arrays into sparse matrices
data.Y = sp.csr_matrix(data.Y).T
print(data)

Also, printing any Table would be missing the final closing ] e.g.

[[0, 1],
 [1, 0]

I've fixed this to

[[0, 1],
 [1, 0]]
Description of changes

The __str__ method no longer crashes on a sparse Y. Also, I've made it so var_name=value is printed for any non-zero value and all discrete values. The var_name is now also hidden for any target variable (the reasoning being that there are usually not that many target variables, and showing the variable name wouldn't add any clarity to the output).

@nikicc could you please check if this is ok?

Includes
  • Code changes
  • Tests
  • Documentation

return s

table = self.table
domain = table.domain
row = self.row_index
s = "[" + sp_values(table.X, domain.attributes)
if domain.class_vars:
s += " | " + sp_values(table._Y, domain.class_vars)
s += " | " + sp_values(table._Y, domain.class_vars, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this isn't OK if there are multiple variables. For example, pass some text data through BoW and move all features to class variables. Then in the print, I only get values, and there is no way to know which value corresponds to which variable.

screen shot 2017-07-10 at 16 21 58

I agree that we can probably skip the label if there is only one sparse variable set as class, but we cannot skip it when there are multiple.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to show all the class labels, even if they are 0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For continuous variables probably not; Table also doesn't show those.

@@ -2195,6 +2195,12 @@ class _ExtendedTable(data.Table):
self.assertIsInstance(data_file, _ExtendedTable)
self.assertIsInstance(data_url, _ExtendedTable)

def test_str(self):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this better than simply str(self.table) without the try-except block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't.

Copy link
Member

@astaric astaric Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I add this test (without changing the problematic code) to master, all tests pass. Could you add a test that fails unless the error is fixed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Tbh I have no idea what's the real issue with the old code, it seems to be some complicated indexing issue back from when sparse matrices didn't support numpy style indexing.

("%s=" % var.name if (var.is_discrete or matrix[row, idx]) and
show_labels else "")
+ var.str_val(matrix[row, idx])
for var, idx in zip(variables, range(max_idx)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about for idx, var in enumerate(variable[:max_idx])?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, thanks!

@codecov-io
Copy link

codecov-io commented Jul 11, 2017

Codecov Report

Merging #2457 into master will increase coverage by 0.03%.
The diff coverage is 94.44%.

@@            Coverage Diff             @@
##           master    #2457      +/-   ##
==========================================
+ Coverage   74.49%   74.53%   +0.03%     
==========================================
  Files         321      321              
  Lines       56086    56095       +9     
==========================================
+ Hits        41780    41809      +29     
+ Misses      14306    14286      -20

Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you check if anything can be done about PyLint issues?

"{}={}".format(var.name, var.str_val(val))
for var, val in zip(variables, matrix.data[begptr:rendptr]))
if limit and rendptr != endptr:
("%s=" % var.name if (var.is_discrete or matrix[row, idx]) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how I only spotted this now, but this only conditions the variable name on the value matrix[row, idx] being non-zero; while the value itself is still printed. Check this:
screen shot 2017-07-11 at 14 15 05

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this undesirable? Couldn't the values be different from 1, and we'd want to show that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably misunderstood. I wasn't objecting to the interface=1.000 part. The problem is in the following 0.000 part for which I do not know which feature it represents. Also, we should probably skip showing zeros in sparse data altogether (which we do not in the above example).

I suggest we stick to the format that is used in Table. That is, for all values that are non-zero we show <feature>=<value>. We skip printing features with zero values altogether, except for discrete features when we print <feature>=<value> pair regardless of the value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok now?

@pavlin-policar
Copy link
Collaborator Author

I've now also changed the Table.Y property setter to properly handle a sparse row. This means we can now do

iris.Y = sp.csr_matrix(iris.Y)

instead of having to do

iris.Y = sp.csr_matrix(iris.Y).T

The reason this was happening is that converting a 1d numpy array to a sparse matrix would produce a matrix with shape (1, 150) for iris, whereas we really want a shape of (150, 1).

This is also consistent with the way a (150, 2) array is cast to sparse, which does produce a sparse matrix with shape (150, 2).

Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavlin-policar before this will be done, could you also edit the commit history a bit? Currently, some commits contain changes that the following ones either revert or further modify.

max_idx, has_more_columns = min(5, columns), columns > 5

row_entries = []
for idx, var in enumerate(variables[:max_idx]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, crap. There is one more thing we forgot about. When dealing with sparse data, just taking first five features won't do since we are not printing zero values. What we actually need, is first five non-zero features.

Look at this example, where I have at least some features defined in each row, but print doesn't show any of them for the third and fifth example.
screen shot 2017-07-13 at 13 16 08


s = ", ".join(row_entries)

if limit and has_more_columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, has_more_columns will have to only be True once there are more non-zero values.

return s

table = self.table
domain = table.domain
row = self.row_index
s = "[" + sp_values(table.X, domain.attributes)
if domain.class_vars:
s += " | " + sp_values(table._Y, domain.class_vars)
n_targets = table.Y.shape[-1]
s += " | " + sp_values(table.Y, domain.class_vars, n_targets > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still considering the option to always show labels — even if we only have one sparse target. I know it is a bit redundant but is consistent with what we show in Table and more importantly, IMO it reminds the user that this column is in sparse and that's why only some values are present.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@@ -178,6 +192,8 @@ def Y(self):
def Y(self, value):
if len(value.shape) == 1:
value = value[:, None]
if sp.issparse(value) and value.shape[0] == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, this might be a bit more complicated. Currently, this doesn't work for tables with only one instance but multiple targets.

>>> data = Table('iris')[:1]

>>> data.X = sp.csr_matrix(data.X)
>>> multiple_targets = np.hstack((data.Y[:, None], data.Y[:, None]))
>>> data.Y = sp.csr_matrix(multiple_targets)

>>> print(data.X.shape)
(1, 4)
>>> print(multiple_targets.shape)
(1, 2)
>>> print(data.Y.shape)
(2, 1)

Maybe something like this will do?
sp.issparse(value) and len(self) != value.shape[0] and value.shape[1] == len(self)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sp.issparse(value) and len(self) != value.shape[0] appears to cover it.

if var.is_discrete or matrix[row, idx]:
if show_labels:
s += "%s=" % var.name
s += var.str_val(matrix[row, idx])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s = "%s=" % var.name if show_labels else ''? Though this might change due to comments above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this only makes it less readable and also if we don't need to display labels, we skip a slow string concatenation.

@pavlin-policar pavlin-policar force-pushed the table-sparse-target branch 3 times, most recently from 64b60af to 4ef28b1 Compare July 14, 2017 10:07
Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the failing tests, rebase to the latest master should fix the issue.

_, columns = matrix.shape

row_entries, idx = [], 0
while len(row_entries) < 5 and idx < len(variables):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't consider the limit argument. I.e. only 5 values are printed regardless whether the limit is set to either true or false.

screen shot 2017-07-15 at 17 19 19

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to fixing this, but this should be alright now.


def test_Y_setter_2d_single_instance(self):
iris = Table('iris')[:1]
# Convert iris.Y to (150, 1) shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Convert iris.Y to (1, 1) shape

new_y = iris.Y[:, np.newaxis]
iris.Y = np.hstack((new_y, new_y))
iris.Y = csr_matrix(iris.Y)
# We expect the Y shape to match the X shape, which is (150, 4) in iris
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# We expect the Y shape to match the X shape, which is (1, 4)

@@ -1254,7 +1254,7 @@ def test_is_sparse(self):
self.assertTrue(table.is_sparse())

def test_repr_sparse_with_one_row(self):
table = data.Table("iris")[:1]
table = data.Table("iris")[::50]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? If we change this, then the name test_repr_sparse_with_one_row does not fit anymore since we have three instances.

nikicc added a commit to nikicc/orange3-text that referenced this pull request Jul 18, 2017
After biolab/orange3#2457 using matas for __len__ can causes problems with from_table. Since setter of Y calls len, and in from_table Y is set before metas the setter for Y can be given wronl length.
nikicc added a commit to nikicc/orange3-text that referenced this pull request Jul 18, 2017
After biolab/orange3#2457 using metas for __len__ can cause problems with from_table. Since setter of Y calls len, and in from_table Y is set before metas, the setter for Y can be given wrong length.
@pavlin-policar pavlin-policar force-pushed the table-sparse-target branch 2 times, most recently from bb354df to 2b4b818 Compare July 23, 2017 12:00
Copy link
Contributor

@nikicc nikicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks OK to me now 👍

Before merging, would you be willing to edit commit history a bit? I suggest we put all changes regarding sp_values method in one commit. Changes regarding setter of Y in another, and changes regarding PyLint in a separate third one.


# Sparse matrices can't handle slices where an index is out of
# bounds like numpy arrays can, so we must determine largest index
_, columns = matrix.shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Columns isn't used anymore, right?

@nikicc nikicc changed the title [FIX] Table: Fix crash on sparse Y [FIX] Table: Fix printing data with sparse Y Jul 25, 2017
@nikicc nikicc merged commit 755426e into biolab:master Jul 25, 2017
@pavlin-policar pavlin-policar deleted the table-sparse-target branch July 29, 2017 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants