[ENH] New widget: Group By #5541

PrimozGodec · 2021-08-02T17:41:24Z

Issue

Orange is missing group by - aggregate widget with similar behaviour than Panda groupby - aggregate

Description of changes

Aggregate module and function on the Table inspired by Pandas. One can call gropuby on the table:
```
gb =  table.groupby([list of variables])
gb.aggregate(aggregations)
```
It is implemented on Table such that alternative implementation of the table can override it and do some extra stuff, e.g. Corpus would like to keep tokens.
New groupby widget

Documentation will be added when we agree on the widget's functionality.

Includes

Code changes
Tests
Documentation

codecov · 2021-08-02T17:55:00Z

Codecov Report

Merging #5541 (02d205e) into master (67b6cef) will increase coverage by 0.05%.
The diff coverage is 94.25%.

@@            Coverage Diff             @@
##           master    #5541      +/-   ##
==========================================
+ Coverage   85.96%   86.01%   +0.05%     
==========================================
  Files         313      315       +2     
  Lines       65471    65936     +465     
==========================================
+ Hits        56280    56714     +434     
- Misses       9191     9222      +31

pylintrc

ajdapretnar · 2021-08-04T14:34:34Z

When the data set contains TimeVariable, I get this error (try with Banking Crises from Datasets):

Traceback (most recent call last):
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/orangecanvas/scheme/signalmanager.py", line 1024, in __process_next
    if self.__process_next_helper(use_max_active=True):
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/orangecanvas/scheme/signalmanager.py", line 1062, in __process_next_helper
    self.process_node(selected_node)
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/orangecanvas/scheme/signalmanager.py", line 690, in process_node
    self.send_to_node(node, signals_in)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 806, in send_to_node
    self.process_signals_for_widget(node, widget, signals)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 820, in process_signals_for_widget
    process_signals_for_widget(widget, signals, workflow)
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/functools.py", line 875, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 921, in process_signals_for_widget
    process_signal_input(input_meta, widget, signal, workflow)
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/functools.py", line 875, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 884, in process_signal_input_default
    notify_input_helper(
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/functools.py", line 875, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/utils/signals.py", line 643, in set_input_helper
    handler(*args)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/utils/signals.py", line 198, in summarize_wrapper
    method(widget, value)
  File "/Users/ajda/orange/orange3/Orange/widgets/data/owgroupby.py", line 507, in set_data
    {
  File "/Users/ajda/orange/orange3/Orange/widgets/data/owgroupby.py", line 508, in <dictcomp>
    attr: DEFAULT_AGGREGATIONS[type(attr)].copy()
KeyError: <class 'Orange.data.variable.TimeVariable'>

ajdapretnar · 2021-08-04T14:35:03Z

Also, wouldn't it make sense to put the grouped by variables in metas for the output data table since they are unique?

ajdapretnar · 2021-08-04T14:37:00Z

Filter doesn't work properly. Also when connecting new data, the filter value should probably be reset (unless perhaps the query matches certain variables). Dunno, just an idea.

ajdapretnar · 2021-08-04T14:41:09Z

Corpus instances are not retained (i.e. a grouped Corpus is not a Corpus on the output). The problem is (possibly) how the self.text_variable is treated in the new data set. We could perhaps just set it as it was if the variable is present in the new data set, else we output a Table?

PrimozGodec · 2021-08-10T13:44:00Z

This PR need #5547 to work properly for TimeVariables

PrimozGodec · 2021-08-31T10:32:38Z

Filter doesn't work properly. Also when connecting new data, the filter value should probably be reset (unless perhaps the query matches certain variables). Dunno, just an idea.

It is a bug in ListViewSearch. Will be addressed in the separate PR since it is an issue that also appears elsewhere.

janezd

Nicely done.

I wrote some minor comments. I'll have more in person. :)

janezd · 2021-09-24T07:46:34Z

Orange/widgets/data/owgroupby.py

+    return " ".join(str(v) for v in x if not pd.isnull(v) and len(str(v)) > 0)
+
+
+AGGREGATIONS = {


The meaning of "Delta" is not obvious, and it's place doesn't help; I guess it would be better to put it after min and max.

Concatenate is easier to guess (at least for me), but maybe "List of values" would be more informative. For my taste, it is closer to sum than to delta.

Consider renaming and reordering the aggreagations, maybe like this:

Mean, Median, Mode, Standard deviation, Variance,
Sum, List of all values, Min. value, Max. value, Span
First value, Last Value, Random Value, Count defined, Count

"List of values" doesn't really sound right when talking about strings I would say.

This is not a list of possible values but space-separated values that appear in the column (with repetitions).

I agree "List of values" is ambiguous. Concatenate may be better after all, unless we come up with something better.

I also don't have a better idea for now. I Will leave concatenate and we can change later

janezd · 2021-09-24T07:54:22Z

Orange/widgets/data/owgroupby.py

+        """
+        Reset the aggregation values in the table for the attribute
+        """
+        index = self.index(self.domain.index(attribute), 1)


>>> Orange.data.Table("zoo").domain.index("name") -1

I'm guilty as charged, but that's what we inherited from Orange 2. :(

janezd · 2021-09-24T07:55:24Z

Orange/widgets/data/owgroupby.py

+    def rowCount(self, parent=None) -> int:  # pylint: disable=unused-argument
+        return (
+            0
+            if self.domain is None


or parent.isValid(), I guess.

janezd · 2021-09-24T07:58:39Z

Orange/widgets/data/owgroupby.py

+                aggs = sorted(
+                    self.parent.aggregations.get(val, []), key=AGGREGATIONS_ORD.index
+                )
+                n_more = "" if len(aggs) <= 2 else f" and {len(aggs) - 2} more"


If there are <= 3, show all; otherwise show the first two and add and ... more. Showing "This, that and 1 more" takes about the same width as "This, that and that" :)

janezd · 2021-09-24T08:02:45Z

Orange/widgets/data/owgroupby.py

+            agg = self.text()
+            selected_attrs = self.parent.get_selected_attributes()
+            types_ = set(type(attr) for attr in selected_attrs)
+            can_be_applied_all = (AGGREGATIONS[agg].types & types_) == types_


Isn't this the same as types_ <= AGGREGATIONS[agg].types?

janezd · 2021-09-24T08:18:14Z

Orange/widgets/data/owgroupby.py

+        """Callback for table selection change; update checkboxes"""
+        selected_attrs = self.get_selected_attributes()
+
+        types_ = set(type(attr) for attr in selected_attrs)


It is interesting that we never write list(type(attr) for attr in selected_attrs) instead of [type(attr) for attr in selected_attrs], but we often write set(type(attr) for attr in selected_attrs) instead of {type(attr) for attr in selected_attrs}.

In my case, the reason for this is the fact that {} produces dict and not set and for an empty set, one must write set(). Then sometimes I automatically use set(...) in comprehensions instead of {...}

Yes, that's probably the reason for all of us.

janezd · 2021-09-24T08:19:05Z

Orange/widgets/data/owgroupby.py

+        types_ = set(type(attr) for attr in selected_attrs)
+        active_aggregations = [self.aggregations[attr] for attr in selected_attrs]
+        for agg, cb in self.agg_checkboxes.items():
+            cb.setDisabled(len(types_ & AGGREGATIONS[agg].types) == 0)


Prehaps not types_ & AGGREGATIONS[agg].types?

janezd · 2021-09-24T08:23:03Z

Orange/widgets/data/owgroupby.py

+        for agg, cb in self.agg_checkboxes.items():
+            cb.setDisabled(len(types_ & AGGREGATIONS[agg].types) == 0)
+
+            activated = [agg in a for a in active_aggregations]


Just an observation: it would be fun to use a set here, activated = {agg in a for a in active_aggregations}, because the condition would then be Qt.checked if activated == {True} else (Qt.Unchecked if activated == {False} else Qt.PartiallyChecked). :)

janezd · 2021-09-24T08:27:44Z

Orange/widgets/data/owgroupby.py

+        self.openContext(self.data)
+
+        # update selections in widgets and re-plot
+        self.agg_table_model.set_domain(data.domain if data else None)


Domain is already set above, before opening the context (as it should be). Is there any reason to repeat this?

Above we set the domain to gb_attrs_model (attribute model for the group by list). Here we set the domain to agg_table_model which is used for the table. It is set after the context is open to plot the table with settings opened by the context (aggreagtions). Initializing it before the context is open would require another call to the table to replot after the context is open.

janezd · 2021-09-24T08:30:48Z

Orange/data/aggregate.py

+
+class OrangeTableGroupBy:
+    """
+    The class which object is the result of the groupby operation on Orange's


A class representing the result of ...

mw25 · 2021-09-27T07:50:34Z

Thank you for this widget! I needed this functionality too and was about to start trying an implementation until I saw this pull request.

I like it very much, but if I may, I would like to add some more ideas:

How about an aggregation "None" where only grouping is done, but no aggregation. Or am I missing something and this is already easier to do with another widget?

I work with time series data and when I group and aggregate I also always have to select the time range of the group to aggregate over. So from a group or stage always only e.g. the last 10 values or 10 minutes or similar are selected and aggregated, because first a settling process must be waited for. Do you understand what I mean? I think this is a relatively common approach.

For this the function pandas.core.groupby.GroupBy.tail would be good, where you can specify the last n values.
For time series data, it's a bit more complicated (The last x seconds/minutes/... of each group). I think you would have to take a combination of pandas.core.groupby.GroupBy.apply and pandas.DataFrame.last, where you can pass arbitrary time periods. In a python script I do it like this:
df.groupby("B").apply(lambda df: df.last('2min').mean())

For each of these aggregations, an additional parameter is required. Also, you might need to use two groupby widgets in a row, the first for grouping and cutting the range, the second for a subsequent aggregation like mean.

The apply function could perhaps be called "Custom" and allow text input to use other pandas functions.
See screenshot. (I only extended the GUI, there is no function behind it).

What do you guys think about this?
I hope that was understandable and not too much.
Thanks again for the status so far.

janezd · 2021-09-27T08:07:10Z

How about an aggregation "None" where only grouping is done, but no aggregation. Or am I missing something and this is already easier to do with another widget?

A widget called Unique already does it. Also, if you are not interested in any aggregation, you can use whichever you want and ignore it. :) Though your suggestion is still reasonable.

mw25 · 2021-09-27T08:56:06Z

How about an aggregation "None" where only grouping is done, but no aggregation. Or am I missing something and this is already easier to do with another widget?

A widget called Unique already does it. Also, if you are not interested in any aggregation, you can use whichever you want and ignore it. :) Though your suggestion is still reasonable.

Thanks for the hint.
To be honest, I can't remember which use case I had in mind with the "None" option...

mw25 · 2021-09-27T08:57:21Z

For each of these aggregations, an additional parameter is required. Also, you might need to use two groupby widgets in a row, the first for grouping and cutting the range, the second for a subsequent aggregation like mean.

Here could be a problem: The normal functions return one value per group, mine suggested usually several. What happens to the output table if "Mean" is selected for one signal and "Tail" for another?
Maybe the functions I suggested (if they are wanted at all) should not be considered as further aggregate functions, but as optional preprocessing.

When selecting a time range (with e.g. last('10s')) the selection of the time channel to be used would also be necessary.

PrimozGodec · 2021-09-27T09:26:52Z

How about an aggregation "None" where only grouping is done, but no aggregation. Or am I missing something and this is already easier to do with another widget?

A widget called Unique already does it. Also, if you are not interested in any aggregation, you can use whichever you want and ignore it. :) Though your suggestion is still reasonable.

What unique widget do can be already done with this widget (except the option Discard non-unique instances) with using Frist value, Last value, ...

I think @mw25 had in mind a case where we just group and do not attach any aggregation to it. For example, if we have a table with attributes a, b, c and we group by a and b the table would include only columns a and b (with unique values combinations). Do I understand it right? It cannot be done with the unique widget.

I actually think that Group By widget should return this kind of "aggregation" when turning off all aggregations on all attributes. Currently, the error is raised by Pandas. I will fix that.

mw25 · 2021-09-27T10:17:00Z

I think @mw25 had in mind a case where we just group and do not attach any aggregation to it. For example, if we have a table with attributes a, b, c and we group by a and b the table would include only columns a and b (with unique values combinations). Do I understand it right? It cannot be done with the unique widget.

I tried hard again to remember what I originally wanted =D
If I group my table (columns A, B, C) by A and B I wanted the following result:
The signals stay as they are, so the number of rows does not change. But a new column is added with the grouping. So for example a new numbering (that would be the result of pandas.core.groupby.GroupBy.ngroup) or the values of the columns A and B as text.

ajdapretnar · 2021-09-27T10:26:51Z

You can already achieve the second option with Feature Constructor and str(A)+"-"+str(B) option.

PrimozGodec · 2021-09-27T11:18:52Z

I work with time series data and when I group and aggregate I also always have to select the time range of the group to aggregate over. So from a group or stage always only e.g. the last 10 values or 10 minutes or similar are selected and aggregated, because first a settling process must be waited for. Do you understand what I mean? I think this is a relatively common approach.

For this the function pandas.core.groupby.GroupBy.tail would be good, where you can specify the last n values.
For time series data, it's a bit more complicated (The last x seconds/minutes/... of each group). I think you would have to take a combination of pandas.core.groupby.GroupBy.apply and pandas.DataFrame.last, where you can pass arbitrary time periods. In a python script I do it like this:
df.groupby("B").apply(lambda df: df.last('2min').mean())

Regarding the other two aggregations:

tail (and possibly the head) would be confusing. What happens when the user select tail (with 10 values) and mean? Mean returns only one value per group tail multiple. What to output then. It would be actually the complete new widget.
custom function is possible but the question is if is not too complicated for users? What do you think @janezd and @ajdapretnar.

mw25 · 2021-09-27T11:31:59Z

You can already achieve the second option with Feature Constructor and str(A)+"-"+str(B) option.

Oh yes that's right. I knew I missed something...

tail (and possibly the head) would be confusing. What happens when the user select tail (with 10 values) and mean? Mean returns only one value per group tail multiple. What to output then. It would be actually the complete new widget.

@PrimozGodec Yes, I noticed that afterwards, too. Did you see my comment:

Maybe the functions I suggested (if they are wanted at all) should not be considered as further aggregate functions, but as optional preprocessing.

These preprocessing options would then of course apply to all signals and be listed separately in the widget.

ajdapretnar · 2021-09-27T11:40:31Z

Personally, I'd have a separate widget in Timeseries to handle such use cases. I'd prefer to keep the core Group by widget simple and straightforward.

janezd · 2021-09-27T12:34:12Z

custom function is possible but the question is if is not too complicated for users? What do you think @janezd and @ajdapretnar.

I didn't want to comment it, but if you ask me directly, I will: I don't like it. For those who want to code, there are Feature Constructor and Python Script. If we add it here, we'll want to add it to the next widget next week and soon many widgets will have inputs that will only be understood by programmers.

mw25 · 2021-09-28T12:22:59Z

Thank you for your feedback.

I think we agree that a custom option is not an option.

But what about another box (e.g. in the controlArea under the signal list) with an checkable option "Preprocessing" or "Range selection" or something? There could be a dropdown with the functions tail (n values), head (n values), tail (n seconds), head (n seconds), ... and another input field for n (and maybe another one for the time channel?).

Would this still be possible in the groupby widget? Otherwise I think @ajdapretnar idea to implement this in timeseries is good.
But what would be the procedure? Should the groupby widget be derived and simply extended with these elements?
(Of course I would also participate).

mw25 · 2021-10-06T07:05:44Z

I have extended OrangeTableGroupBy and Table.groupby() a bit to allow preprocessing / range selection for each group.

Regardless of whether the preprocessing is placed in the groupby widget or a timeseries widget, or is only available via scripting (or only in a private widget), I would like to propose my extension.

How can I commit or propose my changes?

ajdapretnar · 2021-10-06T07:10:56Z

@mw25 As I think your preprocessing is very much related to timeseries, I would propose you make a PR on the Timeseries repository, where we can review it and merge it. The PR should include at least tests and some documentation. Ideally, this would be a widget for timeseries preprocessing called Preprocess Timeseries. That's how I see it.

janezd · 2021-10-08T09:57:22Z

I would propose one additional change (and to demonstrate my commitment, I already made a commit).

I would rearrange the columns, that are currently like this:

to be like this:

which, although visually a bit unusual because the single item is in the last, not the first column, better groups items into categories (statistical moments, sums and extremes, single values and counts). @ajdapretnar (who otherwise visually prefers the original, but agrees that the organization is better in the reformatted version) also commented that this looks better for discrete attributes, because it doesn't disable a randomly scattered collection of options but basically the first two columns (except Concatenate).

PrimozGodec · 2021-10-08T10:17:16Z

@janezd thank you for the changes made. It was one of the modifications that I wanted to make today. It looked a bit bad to me that in the last column two items were missing, so I had an idea to rearrange such that one is missing in the last two columns. But what you proposed looks actually better (ok it looks a bit off in the beginning but it makes more sense). :) So I agree with changes.

PrimozGodec · 2021-10-08T11:22:53Z

I fixed the tests. Now it is ready to merge from the implementation part of the widget. Still, the documentation is missing. I will work on it but will probably not have time today. You can merge it if you think it would be good to have a widget in Orange soon and I can then add the documentation in the next PR. If you think we can wait I can add the documentation to this PR in the beginning of next week.

Orange/widgets/data/owgroupby.py

markotoplak changed the title ~~New widget: Group By~~ [ENH] New widget: Group By Aug 3, 2021

PrimozGodec commented Aug 4, 2021

View reviewed changes

pylintrc Show resolved Hide resolved

PrimozGodec force-pushed the aggregate branch 7 times, most recently from b7274a3 to 75524c7 Compare September 2, 2021 10:42

PrimozGodec changed the title ~~[ENH] New widget: Group By~~ [RFC][ENH] New widget: Group By Sep 2, 2021

PrimozGodec marked this pull request as ready for review September 2, 2021 12:54

janezd self-assigned this Sep 3, 2021

janezd approved these changes Sep 24, 2021

View reviewed changes

janezd removed their assignment Sep 24, 2021

PrimozGodec force-pushed the aggregate branch from 433ada6 to b532270 Compare September 27, 2021 09:15

PrimozGodec force-pushed the aggregate branch 2 times, most recently from edcc500 to cc574b7 Compare September 28, 2021 10:57

janezd self-assigned this Oct 8, 2021

PrimozGodec added 5 commits October 8, 2021 11:51

Aggregate module for Table

71661ca

New widget: group by

b72ea51

Groupby: temp icon

8ae2447

Fixes suggested by janezd

022e41e

New aggregation and handle empty aggregation selection

7735e26

janezd force-pushed the aggregate branch from cc574b7 to 9bc7bbf Compare October 8, 2021 09:51

PrimozGodec force-pushed the aggregate branch from 9bc7bbf to 378d631 Compare October 8, 2021 11:19

PrimozGodec changed the title ~~[RFC][ENH] New widget: Group By~~ [ENH] New widget: Group By Oct 15, 2021

PrimozGodec force-pushed the aggregate branch from cd6308a to 3212dd3 Compare October 15, 2021 14:16

mw25 reviewed Oct 20, 2021

View reviewed changes

Orange/widgets/data/owgroupby.py Outdated Show resolved Hide resolved

janezd and others added 2 commits October 20, 2021 16:28

Groupby: Rearrange columns

e790758

Group by: documentation

02d205e

PrimozGodec force-pushed the aggregate branch from 3212dd3 to 02d205e Compare October 20, 2021 14:28

janezd merged commit 67e6a71 into biolab:master Oct 22, 2021

mw25 mentioned this pull request Nov 2, 2021

Table: extend aggregate module by preprocessing #5678

Closed

3 tasks

PrimozGodec deleted the aggregate branch November 19, 2021 10:01

		return " ".join(str(v) for v in x if not pd.isnull(v) and len(str(v)) > 0)


		AGGREGATIONS = {

[ENH] New widget: Group By #5541

[ENH] New widget: Group By #5541

Conversation

PrimozGodec commented Aug 2, 2021 • edited Loading

Issue

Description of changes

Includes

codecov bot commented Aug 2, 2021 • edited Loading

Codecov Report

ajdapretnar commented Aug 4, 2021

ajdapretnar commented Aug 4, 2021

ajdapretnar commented Aug 4, 2021 • edited Loading

ajdapretnar commented Aug 4, 2021

PrimozGodec commented Aug 10, 2021

PrimozGodec commented Aug 31, 2021

janezd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PrimozGodec Sep 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mw25 commented Sep 27, 2021

janezd commented Sep 27, 2021

mw25 commented Sep 27, 2021

mw25 commented Sep 27, 2021

PrimozGodec commented Sep 27, 2021

mw25 commented Sep 27, 2021

ajdapretnar commented Sep 27, 2021

PrimozGodec commented Sep 27, 2021

mw25 commented Sep 27, 2021

ajdapretnar commented Sep 27, 2021

janezd commented Sep 27, 2021

mw25 commented Sep 28, 2021

mw25 commented Oct 6, 2021

ajdapretnar commented Oct 6, 2021

janezd commented Oct 8, 2021

PrimozGodec commented Oct 8, 2021

PrimozGodec commented Oct 8, 2021

PrimozGodec commented Aug 2, 2021 •

edited

Loading

codecov bot commented Aug 2, 2021 •

edited

Loading

ajdapretnar commented Aug 4, 2021 •

edited

Loading

PrimozGodec Sep 27, 2021 •

edited

Loading