Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Table: Add methods get_column and set_column #6058

Merged
merged 4 commits into from
Oct 24, 2022

Conversation

janezd
Copy link
Contributor

@janezd janezd commented Jul 12, 2022

Issue

There are two differences between data.get_column_view(x)[0] and data.transform(Domain([x]).X[:, 0].

  1. The former is simple and short.
  2. The latter works.

Screen Shot 2022-07-12 at 11 55 39

If using Text to Columns used get_column_view, Apply Domain could not apply the second transformation because the required attribute (discretized petal length) is not in the domain.

Using transform works because it invokes compute_value.

Description of changes

Add Table.get_column which should be preferred over Table.get_column_view, and must be used in places where the variable is not guaranteed to exist (in particular, in compute_value functions).

If the column exists, it calls get_column_view, otherwise it uses compute_value.

Includes
  • Code changes
  • Tests
  • Documentation

@markotoplak
Copy link
Member

As you say, using get_column_view on an unknown table is wrong.

For me, this is a confusing mixture of two very different behaviors. It implies a fast operation. So I do not like the name. I do not have any better ideas though.

@ajdapretnar
Copy link
Contributor

Would it be possible to have something similar to pandas' data["sepal length"]? This would not require to name the method, but it would likely require too much messing with the existing Table structure.

@markotoplak
Copy link
Member

@ajdapretnar, that would be more messing with the data structure. :) Especially in this case because the actual variable (with its compute value) is important, not the name.

Now I think get_column is fine, especially since then programmers who do not completely understand what is going on will find it first. The function you wrote is still as fast as get_column_view when the latter would be appropriate, so I see no loss.

I'd deprecate get_column_view.

Slightly related: do we ever enforce what shape functions doing compute_value return? I remember we have some code that reshapes compute_value outputs within table transforms, but that might be something stale? Is the shape of the output array always the same?

@markotoplak
Copy link
Member

I checked what should the compute_value do. We never enforce what its output should look like. Instead, match_density assures it is a (1D array for dense and 2D for sparse) after calling it in _ArrayConversion.get_columns. So I was running tests with assert statements:

@@ -299,10 +299,14 @@ class _ArrayConversion:
                     if shared is None:
                         shared = col.compute_shared(sourceri)
                         _idcache_save(shared_cache, (col.compute_shared, source), shared)
+                    ttt = col(sourceri, shared_data=shared)
                     col_array = match_density(
-                        col(sourceri, shared_data=shared))
+                            ttt)
+                    assert issparse(ttt) or ttt.ndim == 1, "Not 1D " + str(ttt.shape)
                 else:
-                    col_array = match_density(col(sourceri))
+                    ttt = col(sourceri)
+                    col_array = match_density(ttt)
+                    assert issparse(ttt) or ttt.ndim == 1, "Not 1D " + str(ttt.shape)
             elif col < 0:

For dense arrays, compute_value usually returns 1D numpy arrays. There are exceptions. PCA tests fail because there is a column array at one spot, then there are problems with t-SNE, and for the Feature Constructor, compute_value can be also a constant or even a python list.

Slicing on sparse matrices, on the other hand, always returns something that is 2D, because they do not have 1D slices at all.

For now, the output shape of this get_columns is thus undefined. :) Should we reshape it somehow? Also, if we want the same output shape for sparse and dense matrices, a (rows, 1) shaped vector is currently the only way to go. What do you think, @janezd?

@janezd
Copy link
Contributor Author

janezd commented Jul 16, 2022

You're right, the result of compute_value should be reshaped. compute_value is currently called in domain conversion, which (for dense arrays) goes through assure_column_dense, which calls reshape(-1).

What about sparse data?

We could also request compute_value to return either a vector or a sparse matrix. For a few years, returning anything else would trigger a warning.

I'd deprecate get_column_view.

Me too, but its everywhere and we'll never remove it. We can just recommend not using it.

  • One argument is that we (in particular: I) use(d) it where we shouldn't.
  • Another: get_column_view does not even always return a view! In '19, that @janezd guy added code that maps between different ordering of categorical values!

There may be cases when one needs a view, e.g. because you just created and unlocked a table. As current function does not really always give a view, but it has a good name that we can't reuse: the function get_column that I propose here has an argument copy that, if set to True, forces a copy. We could add view=False; if set to True, it would return a view and fail if it can't.

(Setting both view and copy to True would always return a sparse matrix, because the joker who calls it this way deserves it. 🤪)

@janezd janezd added the needs discussion Core developers need to discuss the issue label Sep 8, 2022
@janezd janezd self-assigned this Sep 23, 2022
@janezd janezd assigned janezd and markotoplak and unassigned janezd Sep 30, 2022
@janezd janezd force-pushed the table-get-column branch 3 times, most recently from 6e09a84 to 123ad30 Compare October 1, 2022 14:10
@janezd janezd removed their assignment Oct 14, 2022
@codecov
Copy link

codecov bot commented Oct 20, 2022

Codecov Report

Merging #6058 (3a549a2) into master (036e96d) will increase coverage by 0.00%.
The diff coverage is 94.59%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6058   +/-   ##
=======================================
  Coverage   86.65%   86.65%           
=======================================
  Files         315      315           
  Lines       67957    67974   +17     
=======================================
+ Hits        58890    58905   +15     
- Misses       9067     9069    +2     

@janezd janezd removed the needs discussion Core developers need to discuss the issue label Oct 21, 2022
Orange/data/table.py Outdated Show resolved Hide resolved
@markotoplak
Copy link
Member

@janezd, thanks for this. All the new usages look much simpler, especially when the values needed to be set. Merging as soon as the tests pass.

@markotoplak markotoplak changed the title Table: Add method get_column [ENH] Table: Add methods get_column and set_column Oct 24, 2022
@markotoplak markotoplak merged commit 0f5025a into biolab:master Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants