[FIX] Table: Fix printing data with sparse Y #2457

pavlin-policar · 2017-07-08T10:43:56Z

Issue

Table __str__ method crashed when the Y was sparse.

Steps to reproduce:

data = Table('iris')
data.X = sp.csr_matrix(data.X)
# Transpose is needed due to the way sparse converts 1d arrays into sparse matrices
data.Y = sp.csr_matrix(data.Y).T
print(data)

Also, printing any Table would be missing the final closing ] e.g.

[[0, 1],
 [1, 0]

I've fixed this to

[[0, 1],
 [1, 0]]

Description of changes

The __str__ method no longer crashes on a sparse Y. Also, I've made it so var_name=value is printed for any non-zero value and all discrete values. The var_name is now also hidden for any target variable (the reasoning being that there are usually not that many target variables, and showing the variable name wouldn't add any clarity to the output).

@nikicc could you please check if this is ok?

Includes

Code changes
Tests
Documentation

nikicc · 2017-07-10T14:26:09Z

Orange/data/table.py

            return s

        table = self.table
        domain = table.domain
        row = self.row_index
        s = "[" + sp_values(table.X, domain.attributes)
        if domain.class_vars:
-            s += " | " + sp_values(table._Y, domain.class_vars)
+            s += " | " + sp_values(table._Y, domain.class_vars, False)


IMO this isn't OK if there are multiple variables. For example, pass some text data through BoW and move all features to class variables. Then in the print, I only get values, and there is no way to know which value corresponds to which variable.

I agree that we can probably skip the label if there is only one sparse variable set as class, but we cannot skip it when there are multiple.

Do we want to show all the class labels, even if they are 0?

For continuous variables probably not; Table also doesn't show those.

nikicc · 2017-07-10T14:27:17Z

Orange/tests/test_table.py

@@ -2195,6 +2195,12 @@ class _ExtendedTable(data.Table):
        self.assertIsInstance(data_file, _ExtendedTable)
        self.assertIsInstance(data_url, _ExtendedTable)

+    def test_str(self):
+        try:


Why is this better than simply str(self.table) without the try-except block?

If I add this test (without changing the problematic code) to master, all tests pass. Could you add a test that fails unless the error is fixed?

Fixed. Tbh I have no idea what's the real issue with the old code, it seems to be some complicated indexing issue back from when sparse matrices didn't support numpy style indexing.

nikicc · 2017-07-10T14:29:18Z

Orange/data/table.py

+                ("%s=" % var.name if (var.is_discrete or matrix[row, idx]) and
+                                     show_labels else "")
+                + var.str_val(matrix[row, idx])
+                for var, idx in zip(variables, range(max_idx)))


What about for idx, var in enumerate(variable[:max_idx])?

Much better, thanks!

codecov-io · 2017-07-11T07:59:15Z

Codecov Report

Merging #2457 into master will increase coverage by 0.03%.
The diff coverage is 94.44%.

@@            Coverage Diff             @@
##           master    #2457      +/-   ##
==========================================
+ Coverage   74.49%   74.53%   +0.03%     
==========================================
  Files         321      321              
  Lines       56086    56095       +9     
==========================================
+ Hits        41780    41809      +29     
+ Misses      14306    14286      -20

nikicc

Also, can you check if anything can be done about PyLint issues?

nikicc · 2017-07-11T12:20:19Z

Orange/data/table.py

-                "{}={}".format(var.name, var.str_val(val))
-                for var, val in zip(variables, matrix.data[begptr:rendptr]))
-            if limit and rendptr != endptr:
+                ("%s=" % var.name if (var.is_discrete or matrix[row, idx]) and


Not sure how I only spotted this now, but this only conditions the variable name on the value matrix[row, idx] being non-zero; while the value itself is still printed. Check this:

Is this undesirable? Couldn't the values be different from 1, and we'd want to show that?

We probably misunderstood. I wasn't objecting to the interface=1.000 part. The problem is in the following 0.000 part for which I do not know which feature it represents. Also, we should probably skip showing zeros in sparse data altogether (which we do not in the above example).

I suggest we stick to the format that is used in Table. That is, for all values that are non-zero we show <feature>=<value>. We skip printing features with zero values altogether, except for discrete features when we print <feature>=<value> pair regardless of the value.

Is it ok now?

pavlin-policar · 2017-07-12T14:58:40Z

I've now also changed the Table.Y property setter to properly handle a sparse row. This means we can now do

iris.Y = sp.csr_matrix(iris.Y)

instead of having to do

iris.Y = sp.csr_matrix(iris.Y).T

The reason this was happening is that converting a 1d numpy array to a sparse matrix would produce a matrix with shape (1, 150) for iris, whereas we really want a shape of (150, 1).

This is also consistent with the way a (150, 2) array is cast to sparse, which does produce a sparse matrix with shape (150, 2).

nikicc

@pavlin-policar before this will be done, could you also edit the commit history a bit? Currently, some commits contain changes that the following ones either revert or further modify.

nikicc · 2017-07-13T11:20:13Z

Orange/data/table.py

+            max_idx, has_more_columns = min(5, columns), columns > 5
+
+            row_entries = []
+            for idx, var in enumerate(variables[:max_idx]):


Ahhh, crap. There is one more thing we forgot about. When dealing with sparse data, just taking first five features won't do since we are not printing zero values. What we actually need, is first five non-zero features.

Look at this example, where I have at least some features defined in each row, but print doesn't show any of them for the third and fifth example.

nikicc · 2017-07-13T11:22:05Z

Orange/data/table.py

+
+            s = ", ".join(row_entries)
+
+            if limit and has_more_columns:


Similarly here, has_more_columns will have to only be True once there are more non-zero values.

nikicc · 2017-07-13T11:28:40Z

Orange/data/table.py

            return s

        table = self.table
        domain = table.domain
        row = self.row_index
        s = "[" + sp_values(table.X, domain.attributes)
        if domain.class_vars:
-            s += " | " + sp_values(table._Y, domain.class_vars)
+            n_targets = table.Y.shape[-1]
+            s += " | " + sp_values(table.Y, domain.class_vars, n_targets > 1)


I'm still considering the option to always show labels — even if we only have one sparse target. I know it is a bit redundant but is consistent with what we show in Table and more importantly, IMO it reminds the user that this column is in sparse and that's why only some values are present.

What do you think?

Sounds good.

nikicc · 2017-07-13T12:19:37Z

Orange/data/table.py

@@ -178,6 +192,8 @@ def Y(self):
    def Y(self, value):
        if len(value.shape) == 1:
            value = value[:, None]
+        if sp.issparse(value) and value.shape[0] == 1:


Hmmm, this might be a bit more complicated. Currently, this doesn't work for tables with only one instance but multiple targets.

>>> data = Table('iris')[:1] >>> data.X = sp.csr_matrix(data.X) >>> multiple_targets = np.hstack((data.Y[:, None], data.Y[:, None])) >>> data.Y = sp.csr_matrix(multiple_targets) >>> print(data.X.shape) (1, 4) >>> print(multiple_targets.shape) (1, 2) >>> print(data.Y.shape) (2, 1)

Maybe something like this will do?
sp.issparse(value) and len(self) != value.shape[0] and value.shape[1] == len(self)?

sp.issparse(value) and len(self) != value.shape[0] appears to cover it.

nikicc · 2017-07-13T12:49:41Z

Orange/data/table.py

+                if var.is_discrete or matrix[row, idx]:
+                    if show_labels:
+                        s += "%s=" % var.name
+                    s += var.str_val(matrix[row, idx])


s = "%s=" % var.name if show_labels else ''? Though this might change due to comments above.

I think this only makes it less readable and also if we don't need to display labels, we skip a slow string concatenation.

nikicc

Regarding the failing tests, rebase to the latest master should fix the issue.

nikicc · 2017-07-15T15:21:39Z

Orange/data/table.py

+            _, columns = matrix.shape
+
+            row_entries, idx = [], 0
+            while len(row_entries) < 5 and idx < len(variables):


This doesn't consider the limit argument. I.e. only 5 values are printed regardless whether the limit is set to either true or false.

Sorry for taking so long to fixing this, but this should be alright now.

nikicc · 2017-07-15T15:38:30Z

Orange/tests/test_sparse_table.py

+
+    def test_Y_setter_2d_single_instance(self):
+        iris = Table('iris')[:1]
+        # Convert iris.Y to (150, 1) shape


# Convert iris.Y to (1, 1) shape

nikicc · 2017-07-15T15:39:19Z

Orange/tests/test_sparse_table.py

+        new_y = iris.Y[:, np.newaxis]
+        iris.Y = np.hstack((new_y, new_y))
+        iris.Y = csr_matrix(iris.Y)
+        # We expect the Y shape to match the X shape, which is (150, 4) in iris


# We expect the Y shape to match the X shape, which is (1, 4)

nikicc · 2017-07-15T15:41:17Z

Orange/tests/test_table.py

@@ -1254,7 +1254,7 @@ def test_is_sparse(self):
        self.assertTrue(table.is_sparse())

    def test_repr_sparse_with_one_row(self):
-        table = data.Table("iris")[:1]
+        table = data.Table("iris")[::50]


Why this change? If we change this, then the name test_repr_sparse_with_one_row does not fit anymore since we have three instances.

After biolab/orange3#2457 using matas for __len__ can causes problems with from_table. Since setter of Y calls len, and in from_table Y is set before metas the setter for Y can be given wronl length.

After biolab/orange3#2457 using metas for __len__ can cause problems with from_table. Since setter of Y calls len, and in from_table Y is set before metas, the setter for Y can be given wrong length.

nikicc

All looks OK to me now 👍

Before merging, would you be willing to edit commit history a bit? I suggest we put all changes regarding sp_values method in one commit. Changes regarding setter of Y in another, and changes regarding PyLint in a separate third one.

nikicc · 2017-07-25T08:57:36Z

Orange/data/table.py

+
+            # Sparse matrices can't handle slices where an index is out of
+            # bounds like numpy arrays can, so we must determine largest index
+            _, columns = matrix.shape


Columns isn't used anymore, right?

pavlin-policar force-pushed the table-sparse-target branch from 4d09824 to f075928 Compare July 8, 2017 10:44

nikicc suggested changes Jul 10, 2017

View reviewed changes

nikicc suggested changes Jul 11, 2017

View reviewed changes

nikicc suggested changes Jul 13, 2017

View reviewed changes

pavlin-policar force-pushed the table-sparse-target branch 3 times, most recently from 64b60af to 4ef28b1 Compare July 14, 2017 10:07

nikicc suggested changes Jul 15, 2017

View reviewed changes

nikicc mentioned this pull request Jul 18, 2017

Corpus: Remove __len__, use Table's __len__ instead biolab/orange3-text#280

Merged

3 tasks

pavlin-policar force-pushed the table-sparse-target branch 2 times, most recently from bb354df to 2b4b818 Compare July 23, 2017 12:00

nikicc approved these changes Jul 25, 2017

View reviewed changes

pavlin-policar added 3 commits July 25, 2017 18:29

Table: Fix print and repr on sparse Y

afaa2f6

Table: Fix five pylint issues

d9063e1

Table: Transpose Y setter when getting a sparse 1d array

830e53a

nikicc force-pushed the table-sparse-target branch from 2b4b818 to 830e53a Compare July 25, 2017 16:36

nikicc changed the title ~~[FIX] Table: Fix crash on sparse Y~~ [FIX] Table: Fix printing data with sparse Y Jul 25, 2017

nikicc merged commit 755426e into biolab:master Jul 25, 2017

pavlin-policar deleted the table-sparse-target branch July 29, 2017 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Table: Fix printing data with sparse Y #2457

[FIX] Table: Fix printing data with sparse Y #2457

pavlin-policar commented Jul 8, 2017

nikicc Jul 10, 2017

pavlin-policar Jul 11, 2017

nikicc Jul 11, 2017

nikicc Jul 10, 2017

pavlin-policar Jul 11, 2017

astaric Jul 11, 2017 •

edited

Loading

pavlin-policar Jul 11, 2017

nikicc Jul 10, 2017

pavlin-policar Jul 11, 2017

codecov-io commented Jul 11, 2017 •

edited

Loading

nikicc left a comment

nikicc Jul 11, 2017

pavlin-policar Jul 12, 2017

nikicc Jul 12, 2017

pavlin-policar Jul 12, 2017

pavlin-policar commented Jul 12, 2017

nikicc left a comment

nikicc Jul 13, 2017

nikicc Jul 13, 2017

nikicc Jul 13, 2017

pavlin-policar Jul 14, 2017

nikicc Jul 13, 2017

pavlin-policar Jul 14, 2017

nikicc Jul 13, 2017

pavlin-policar Jul 14, 2017

nikicc left a comment

nikicc Jul 15, 2017

pavlin-policar Jul 23, 2017

nikicc Jul 15, 2017

nikicc Jul 15, 2017

nikicc Jul 15, 2017

nikicc left a comment

nikicc Jul 25, 2017

[FIX] Table: Fix printing data with sparse Y #2457

[FIX] Table: Fix printing data with sparse Y #2457

Conversation

pavlin-policar commented Jul 8, 2017

Issue

Description of changes

Includes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astaric Jul 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jul 11, 2017 • edited Loading

Codecov Report

nikicc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavlin-policar commented Jul 12, 2017

nikicc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikicc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikicc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astaric Jul 11, 2017 •

edited

Loading

codecov-io commented Jul 11, 2017 •

edited

Loading