[ENH] Table: Add methods get_column and set_column #6058

janezd · 2022-07-12T10:14:27Z

Issue

There are two differences between data.get_column_view(x)[0] and data.transform(Domain([x]).X[:, 0].

The former is simple and short.
The latter works.

If using Text to Columns used get_column_view, Apply Domain could not apply the second transformation because the required attribute (discretized petal length) is not in the domain.

Using transform works because it invokes compute_value.

Description of changes

Add Table.get_column which should be preferred over Table.get_column_view, and must be used in places where the variable is not guaranteed to exist (in particular, in compute_value functions).

If the column exists, it calls get_column_view, otherwise it uses compute_value.

Includes

Code changes
Tests
Documentation

markotoplak · 2022-07-15T09:00:28Z

As you say, using get_column_view on an unknown table is wrong.

For me, this is a confusing mixture of two very different behaviors. It implies a fast operation. So I do not like the name. I do not have any better ideas though.

ajdapretnar · 2022-07-15T09:07:45Z

Would it be possible to have something similar to pandas' data["sepal length"]? This would not require to name the method, but it would likely require too much messing with the existing Table structure.

markotoplak · 2022-07-15T10:11:07Z

@ajdapretnar, that would be more messing with the data structure. :) Especially in this case because the actual variable (with its compute value) is important, not the name.

Now I think get_column is fine, especially since then programmers who do not completely understand what is going on will find it first. The function you wrote is still as fast as get_column_view when the latter would be appropriate, so I see no loss.

I'd deprecate get_column_view.

Slightly related: do we ever enforce what shape functions doing compute_value return? I remember we have some code that reshapes compute_value outputs within table transforms, but that might be something stale? Is the shape of the output array always the same?

markotoplak · 2022-07-15T12:21:52Z

I checked what should the compute_value do. We never enforce what its output should look like. Instead, match_density assures it is a (1D array for dense and 2D for sparse) after calling it in _ArrayConversion.get_columns. So I was running tests with assert statements:

@@ -299,10 +299,14 @@ class _ArrayConversion:
                     if shared is None:
                         shared = col.compute_shared(sourceri)
                         _idcache_save(shared_cache, (col.compute_shared, source), shared)
+                    ttt = col(sourceri, shared_data=shared)
                     col_array = match_density(
-                        col(sourceri, shared_data=shared))
+                            ttt)
+                    assert issparse(ttt) or ttt.ndim == 1, "Not 1D " + str(ttt.shape)
                 else:
-                    col_array = match_density(col(sourceri))
+                    ttt = col(sourceri)
+                    col_array = match_density(ttt)
+                    assert issparse(ttt) or ttt.ndim == 1, "Not 1D " + str(ttt.shape)
             elif col < 0:

For dense arrays, compute_value usually returns 1D numpy arrays. There are exceptions. PCA tests fail because there is a column array at one spot, then there are problems with t-SNE, and for the Feature Constructor, compute_value can be also a constant or even a python list.

Slicing on sparse matrices, on the other hand, always returns something that is 2D, because they do not have 1D slices at all.

For now, the output shape of this get_columns is thus undefined. :) Should we reshape it somehow? Also, if we want the same output shape for sparse and dense matrices, a (rows, 1) shaped vector is currently the only way to go. What do you think, @janezd?

janezd · 2022-07-16T09:37:16Z

You're right, the result of compute_value should be reshaped. compute_value is currently called in domain conversion, which (for dense arrays) goes through assure_column_dense, which calls reshape(-1).

What about sparse data?

We could also request compute_value to return either a vector or a sparse matrix. For a few years, returning anything else would trigger a warning.

I'd deprecate get_column_view.

Me too, but its everywhere and we'll never remove it. We can just recommend not using it.

One argument is that we (in particular: I) use(d) it where we shouldn't.
Another: get_column_view does not even always return a view! In '19, that @janezd guy added code that maps between different ordering of categorical values!

There may be cases when one needs a view, e.g. because you just created and unlocked a table. As current function does not really always give a view, but it has a good name that we can't reuse: the function get_column that I propose here has an argument copy that, if set to True, forces a copy. We could add view=False; if set to True, it would return a view and fail if it can't.

(Setting both view and copy to True would always return a sparse matrix, because the joker who calls it this way deserves it. 🤪)

codecov · 2022-10-20T12:10:02Z

Codecov Report

Merging #6058 (3a549a2) into master (036e96d) will increase coverage by 0.00%.
The diff coverage is 94.59%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6058   +/-   ##
=======================================
  Coverage   86.65%   86.65%           
=======================================
  Files         315      315           
  Lines       67957    67974   +17     
=======================================
+ Hits        58890    58905   +15     
- Misses       9067     9069    +2

Orange/data/table.py

markotoplak · 2022-10-24T11:57:30Z

@janezd, thanks for this. All the new usages look much simpler, especially when the values needed to be set. Merging as soon as the tests pass.

janezd added the needs discussion Core developers need to discuss the issue label Sep 8, 2022

janezd self-assigned this Sep 23, 2022

janezd force-pushed the table-get-column branch from 7d5be06 to 58a7280 Compare September 29, 2022 20:46

janezd assigned janezd and markotoplak and unassigned janezd Sep 30, 2022

janezd force-pushed the table-get-column branch 3 times, most recently from 6e09a84 to 123ad30 Compare October 1, 2022 14:10

janezd force-pushed the table-get-column branch from 123ad30 to 4b8aa55 Compare October 14, 2022 09:27

janezd removed their assignment Oct 14, 2022

janezd force-pushed the table-get-column branch from 8cb7f58 to ee199e0 Compare October 21, 2022 11:08

janezd mentioned this pull request Oct 21, 2022

[FIX] Hierarchical Clustering: Color by metas #6178

Merged

3 tasks

janezd force-pushed the table-get-column branch from ee199e0 to d347de5 Compare October 21, 2022 11:37

janezd added 3 commits October 21, 2022 13:51

Table: Add method get_column

fd9bd41

Replace Table.get_column_view with Table.get_column

582ce2e

Goodbye Table.get_column(..., view=True), hello Table.set_column

7137d0d

janezd force-pushed the table-get-column branch from d347de5 to 7137d0d Compare October 21, 2022 11:53

janezd removed the needs discussion Core developers need to discuss the issue label Oct 21, 2022

markotoplak reviewed Oct 24, 2022

View reviewed changes

Orange/data/table.py Outdated Show resolved Hide resolved

Rename compute_column to _compute_column

3a549a2

markotoplak changed the title ~~Table: Add method get_column~~ [ENH] Table: Add methods get_column and set_column Oct 24, 2022

markotoplak merged commit 0f5025a into biolab:master Oct 24, 2022

markotoplak mentioned this pull request Nov 8, 2022

Replace deprecated get_column_view Quasars/orange-spectroscopy#645

Merged

janezd mentioned this pull request Nov 21, 2023

Split: Refactor for discrete values, add tests, rename to "Text to Columns" biolab/orange3-prototypes#253

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Table: Add methods get_column and set_column #6058

[ENH] Table: Add methods get_column and set_column #6058

janezd commented Jul 12, 2022

markotoplak commented Jul 15, 2022

ajdapretnar commented Jul 15, 2022

markotoplak commented Jul 15, 2022

markotoplak commented Jul 15, 2022

janezd commented Jul 16, 2022

codecov bot commented Oct 20, 2022 •

edited

Loading

markotoplak commented Oct 24, 2022

[ENH] Table: Add methods get_column and set_column #6058

[ENH] Table: Add methods get_column and set_column #6058

Conversation

janezd commented Jul 12, 2022

Issue

Description of changes

Includes

markotoplak commented Jul 15, 2022

ajdapretnar commented Jul 15, 2022

markotoplak commented Jul 15, 2022

markotoplak commented Jul 15, 2022

janezd commented Jul 16, 2022

codecov bot commented Oct 20, 2022 • edited Loading

Codecov Report

markotoplak commented Oct 24, 2022

codecov bot commented Oct 20, 2022 •

edited

Loading