-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Table: Fix printing data with sparse Y #2457
Conversation
4d09824
to
f075928
Compare
Orange/data/table.py
Outdated
return s | ||
|
||
table = self.table | ||
domain = table.domain | ||
row = self.row_index | ||
s = "[" + sp_values(table.X, domain.attributes) | ||
if domain.class_vars: | ||
s += " | " + sp_values(table._Y, domain.class_vars) | ||
s += " | " + sp_values(table._Y, domain.class_vars, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this isn't OK if there are multiple variables. For example, pass some text data through BoW and move all features to class variables. Then in the print, I only get values, and there is no way to know which value corresponds to which variable.
I agree that we can probably skip the label if there is only one sparse variable set as class, but we cannot skip it when there are multiple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to show all the class labels, even if they are 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For continuous variables probably not; Table also doesn't show those.
Orange/tests/test_table.py
Outdated
@@ -2195,6 +2195,12 @@ class _ExtendedTable(data.Table): | |||
self.assertIsInstance(data_file, _ExtendedTable) | |||
self.assertIsInstance(data_url, _ExtendedTable) | |||
|
|||
def test_str(self): | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this better than simply str(self.table)
without the try-except block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I add this test (without changing the problematic code) to master, all tests pass. Could you add a test that fails unless the error is fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Tbh I have no idea what's the real issue with the old code, it seems to be some complicated indexing issue back from when sparse matrices didn't support numpy style indexing.
Orange/data/table.py
Outdated
("%s=" % var.name if (var.is_discrete or matrix[row, idx]) and | ||
show_labels else "") | ||
+ var.str_val(matrix[row, idx]) | ||
for var, idx in zip(variables, range(max_idx))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about for idx, var in enumerate(variable[:max_idx])
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better, thanks!
Codecov Report
@@ Coverage Diff @@
## master #2457 +/- ##
==========================================
+ Coverage 74.49% 74.53% +0.03%
==========================================
Files 321 321
Lines 56086 56095 +9
==========================================
+ Hits 41780 41809 +29
+ Misses 14306 14286 -20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you check if anything can be done about PyLint issues?
Orange/data/table.py
Outdated
"{}={}".format(var.name, var.str_val(val)) | ||
for var, val in zip(variables, matrix.data[begptr:rendptr])) | ||
if limit and rendptr != endptr: | ||
("%s=" % var.name if (var.is_discrete or matrix[row, idx]) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this undesirable? Couldn't the values be different from 1, and we'd want to show that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably misunderstood. I wasn't objecting to the interface=1.000
part. The problem is in the following 0.000
part for which I do not know which feature it represents. Also, we should probably skip showing zeros in sparse data altogether (which we do not in the above example).
I suggest we stick to the format that is used in Table. That is, for all values that are non-zero we show <feature>=<value>
. We skip printing features with zero values altogether, except for discrete features when we print <feature>=<value>
pair regardless of the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok now?
I've now also changed the Table.Y property setter to properly handle a sparse row. This means we can now do
instead of having to do
The reason this was happening is that converting a 1d numpy array to a sparse matrix would produce a matrix with shape (1, 150) for iris, whereas we really want a shape of (150, 1). This is also consistent with the way a (150, 2) array is cast to sparse, which does produce a sparse matrix with shape (150, 2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavlin-policar before this will be done, could you also edit the commit history a bit? Currently, some commits contain changes that the following ones either revert or further modify.
Orange/data/table.py
Outdated
max_idx, has_more_columns = min(5, columns), columns > 5 | ||
|
||
row_entries = [] | ||
for idx, var in enumerate(variables[:max_idx]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, crap. There is one more thing we forgot about. When dealing with sparse data, just taking first five features won't do since we are not printing zero values. What we actually need, is first five non-zero features.
Look at this example, where I have at least some features defined in each row, but print doesn't show any of them for the third and fifth example.
Orange/data/table.py
Outdated
|
||
s = ", ".join(row_entries) | ||
|
||
if limit and has_more_columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, has_more_columns
will have to only be True once there are more non-zero values.
Orange/data/table.py
Outdated
return s | ||
|
||
table = self.table | ||
domain = table.domain | ||
row = self.row_index | ||
s = "[" + sp_values(table.X, domain.attributes) | ||
if domain.class_vars: | ||
s += " | " + sp_values(table._Y, domain.class_vars) | ||
n_targets = table.Y.shape[-1] | ||
s += " | " + sp_values(table.Y, domain.class_vars, n_targets > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still considering the option to always show labels — even if we only have one sparse target. I know it is a bit redundant but is consistent with what we show in Table and more importantly, IMO it reminds the user that this column is in sparse and that's why only some values are present.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
Orange/data/table.py
Outdated
@@ -178,6 +192,8 @@ def Y(self): | |||
def Y(self, value): | |||
if len(value.shape) == 1: | |||
value = value[:, None] | |||
if sp.issparse(value) and value.shape[0] == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, this might be a bit more complicated. Currently, this doesn't work for tables with only one instance but multiple targets.
>>> data = Table('iris')[:1]
>>> data.X = sp.csr_matrix(data.X)
>>> multiple_targets = np.hstack((data.Y[:, None], data.Y[:, None]))
>>> data.Y = sp.csr_matrix(multiple_targets)
>>> print(data.X.shape)
(1, 4)
>>> print(multiple_targets.shape)
(1, 2)
>>> print(data.Y.shape)
(2, 1)
Maybe something like this will do?
sp.issparse(value) and len(self) != value.shape[0] and value.shape[1] == len(self)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sp.issparse(value) and len(self) != value.shape[0]
appears to cover it.
Orange/data/table.py
Outdated
if var.is_discrete or matrix[row, idx]: | ||
if show_labels: | ||
s += "%s=" % var.name | ||
s += var.str_val(matrix[row, idx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s = "%s=" % var.name if show_labels else ''
? Though this might change due to comments above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this only makes it less readable and also if we don't need to display labels, we skip a slow string concatenation.
64b60af
to
4ef28b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the failing tests, rebase to the latest master should fix the issue.
Orange/data/table.py
Outdated
_, columns = matrix.shape | ||
|
||
row_entries, idx = [], 0 | ||
while len(row_entries) < 5 and idx < len(variables): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long to fixing this, but this should be alright now.
Orange/tests/test_sparse_table.py
Outdated
|
||
def test_Y_setter_2d_single_instance(self): | ||
iris = Table('iris')[:1] | ||
# Convert iris.Y to (150, 1) shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Convert iris.Y to (1, 1) shape
new_y = iris.Y[:, np.newaxis] | ||
iris.Y = np.hstack((new_y, new_y)) | ||
iris.Y = csr_matrix(iris.Y) | ||
# We expect the Y shape to match the X shape, which is (150, 4) in iris |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# We expect the Y shape to match the X shape, which is (1, 4)
Orange/tests/test_table.py
Outdated
@@ -1254,7 +1254,7 @@ def test_is_sparse(self): | |||
self.assertTrue(table.is_sparse()) | |||
|
|||
def test_repr_sparse_with_one_row(self): | |||
table = data.Table("iris")[:1] | |||
table = data.Table("iris")[::50] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? If we change this, then the name test_repr_sparse_with_one_row
does not fit anymore since we have three instances.
After biolab/orange3#2457 using matas for __len__ can causes problems with from_table. Since setter of Y calls len, and in from_table Y is set before metas the setter for Y can be given wronl length.
After biolab/orange3#2457 using metas for __len__ can cause problems with from_table. Since setter of Y calls len, and in from_table Y is set before metas, the setter for Y can be given wrong length.
bb354df
to
2b4b818
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks OK to me now 👍
Before merging, would you be willing to edit commit history a bit? I suggest we put all changes regarding sp_values
method in one commit. Changes regarding setter of Y in another, and changes regarding PyLint in a separate third one.
Orange/data/table.py
Outdated
|
||
# Sparse matrices can't handle slices where an index is out of | ||
# bounds like numpy arrays can, so we must determine largest index | ||
_, columns = matrix.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Columns
isn't used anymore, right?
2b4b818
to
830e53a
Compare
Issue
Table
__str__
method crashed when the Y was sparse.Steps to reproduce:
Also, printing any Table would be missing the final closing
]
e.g.I've fixed this to
Description of changes
The
__str__
method no longer crashes on a sparse Y. Also, I've made it sovar_name=value
is printed for any non-zero value and all discrete values. Thevar_name
is now also hidden for any target variable (the reasoning being that there are usually not that many target variables, and showing the variable name wouldn't add any clarity to the output).@nikicc could you please check if this is ok?
Includes