Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] New widget: Group By #5541

Merged
merged 7 commits into from
Oct 22, 2021
Merged

[ENH] New widget: Group By #5541

merged 7 commits into from
Oct 22, 2021

Conversation

PrimozGodec
Copy link
Contributor

@PrimozGodec PrimozGodec commented Aug 2, 2021

Issue

Orange is missing group by - aggregate widget with similar behaviour than Panda groupby - aggregate

Description of changes
  • Aggregate module and function on the Table inspired by Pandas. One can call gropuby on the table:
    gb =  table.groupby([list of variables])
    gb.aggregate(aggregations)
    
    It is implemented on Table such that alternative implementation of the table can override it and do some extra stuff, e.g. Corpus would like to keep tokens.
  • New groupby widget

Documentation will be added when we agree on the widget's functionality.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Aug 2, 2021

Codecov Report

Merging #5541 (02d205e) into master (67b6cef) will increase coverage by 0.05%.
The diff coverage is 94.25%.

@@            Coverage Diff             @@
##           master    #5541      +/-   ##
==========================================
+ Coverage   85.96%   86.01%   +0.05%     
==========================================
  Files         313      315       +2     
  Lines       65471    65936     +465     
==========================================
+ Hits        56280    56714     +434     
- Misses       9191     9222      +31     

@markotoplak markotoplak changed the title New widget: Group By [ENH] New widget: Group By Aug 3, 2021
pylintrc Show resolved Hide resolved
@ajdapretnar
Copy link
Contributor

When the data set contains TimeVariable, I get this error (try with Banking Crises from Datasets):

Traceback (most recent call last):
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/orangecanvas/scheme/signalmanager.py", line 1024, in __process_next
    if self.__process_next_helper(use_max_active=True):
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/orangecanvas/scheme/signalmanager.py", line 1062, in __process_next_helper
    self.process_node(selected_node)
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/site-packages/orangecanvas/scheme/signalmanager.py", line 690, in process_node
    self.send_to_node(node, signals_in)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 806, in send_to_node
    self.process_signals_for_widget(node, widget, signals)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 820, in process_signals_for_widget
    process_signals_for_widget(widget, signals, workflow)
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/functools.py", line 875, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 921, in process_signals_for_widget
    process_signal_input(input_meta, widget, signal, workflow)
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/functools.py", line 875, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/workflow/widgetsscheme.py", line 884, in process_signal_input_default
    notify_input_helper(
  File "/Users/ajda/opt/miniconda3/envs/o3/lib/python3.8/functools.py", line 875, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/utils/signals.py", line 643, in set_input_helper
    handler(*args)
  File "/Users/ajda/orange/orange-widget-base/orangewidget/utils/signals.py", line 198, in summarize_wrapper
    method(widget, value)
  File "/Users/ajda/orange/orange3/Orange/widgets/data/owgroupby.py", line 507, in set_data
    {
  File "/Users/ajda/orange/orange3/Orange/widgets/data/owgroupby.py", line 508, in <dictcomp>
    attr: DEFAULT_AGGREGATIONS[type(attr)].copy()
KeyError: <class 'Orange.data.variable.TimeVariable'>

@ajdapretnar
Copy link
Contributor

Also, wouldn't it make sense to put the grouped by variables in metas for the output data table since they are unique?

@ajdapretnar
Copy link
Contributor

ajdapretnar commented Aug 4, 2021

Filter doesn't work properly. Also when connecting new data, the filter value should probably be reset (unless perhaps the query matches certain variables). Dunno, just an idea.

Screen Shot 2021-08-04 at 16 35 47

@ajdapretnar
Copy link
Contributor

Corpus instances are not retained (i.e. a grouped Corpus is not a Corpus on the output). The problem is (possibly) how the self.text_variable is treated in the new data set. We could perhaps just set it as it was if the variable is present in the new data set, else we output a Table?

@PrimozGodec
Copy link
Contributor Author

This PR need #5547 to work properly for TimeVariables

@PrimozGodec
Copy link
Contributor Author

Filter doesn't work properly. Also when connecting new data, the filter value should probably be reset (unless perhaps the query matches certain variables). Dunno, just an idea.

It is a bug in ListViewSearch. Will be addressed in the separate PR since it is an issue that also appears elsewhere.

@PrimozGodec PrimozGodec force-pushed the aggregate branch 7 times, most recently from b7274a3 to 75524c7 Compare September 2, 2021 10:42
@PrimozGodec PrimozGodec changed the title [ENH] New widget: Group By [RFC][ENH] New widget: Group By Sep 2, 2021
@PrimozGodec PrimozGodec marked this pull request as ready for review September 2, 2021 12:54
@janezd janezd self-assigned this Sep 3, 2021
Copy link
Contributor

@janezd janezd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done.

I wrote some minor comments. I'll have more in person. :)

return " ".join(str(v) for v in x if not pd.isnull(v) and len(str(v)) > 0)


AGGREGATIONS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of "Delta" is not obvious, and it's place doesn't help; I guess it would be better to put it after min and max.

Concatenate is easier to guess (at least for me), but maybe "List of values" would be more informative. For my taste, it is closer to sum than to delta.

Consider renaming and reordering the aggreagations, maybe like this:

Mean, Median, Mode, Standard deviation, Variance,
Sum, List of all values, Min. value, Max. value, Span
First value, Last Value, Random Value, Count defined, Count

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"List of values" doesn't really sound right when talking about strings I would say.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a list of possible values but space-separated values that appear in the column (with repetitions).

I agree "List of values" is ambiguous. Concatenate may be better after all, unless we come up with something better.

Copy link
Contributor Author

@PrimozGodec PrimozGodec Sep 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't have a better idea for now. I Will leave concatenate and we can change later

"""
Reset the aggregation values in the table for the attribute
"""
index = self.index(self.domain.index(attribute), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> Orange.data.Table("zoo").domain.index("name")
-1

I'm guilty as charged, but that's what we inherited from Orange 2. :(

def rowCount(self, parent=None) -> int: # pylint: disable=unused-argument
return (
0
if self.domain is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or parent.isValid(), I guess.

aggs = sorted(
self.parent.aggregations.get(val, []), key=AGGREGATIONS_ORD.index
)
n_more = "" if len(aggs) <= 2 else f" and {len(aggs) - 2} more"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are <= 3, show all; otherwise show the first two and add and ... more. Showing "This, that and 1 more" takes about the same width as "This, that and that" :)

agg = self.text()
selected_attrs = self.parent.get_selected_attributes()
types_ = set(type(attr) for attr in selected_attrs)
can_be_applied_all = (AGGREGATIONS[agg].types & types_) == types_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as types_ <= AGGREGATIONS[agg].types?

"""Callback for table selection change; update checkboxes"""
selected_attrs = self.get_selected_attributes()

types_ = set(type(attr) for attr in selected_attrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting that we never write list(type(attr) for attr in selected_attrs) instead of [type(attr) for attr in selected_attrs], but we often write set(type(attr) for attr in selected_attrs) instead of {type(attr) for attr in selected_attrs}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case, the reason for this is the fact that {} produces dict and not set and for an empty set, one must write set(). Then sometimes I automatically use set(...) in comprehensions instead of {...}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's probably the reason for all of us.

types_ = set(type(attr) for attr in selected_attrs)
active_aggregations = [self.aggregations[attr] for attr in selected_attrs]
for agg, cb in self.agg_checkboxes.items():
cb.setDisabled(len(types_ & AGGREGATIONS[agg].types) == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prehaps not types_ & AGGREGATIONS[agg].types?

for agg, cb in self.agg_checkboxes.items():
cb.setDisabled(len(types_ & AGGREGATIONS[agg].types) == 0)

activated = [agg in a for a in active_aggregations]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation: it would be fun to use a set here, activated = {agg in a for a in active_aggregations}, because the condition would then be Qt.checked if activated == {True} else (Qt.Unchecked if activated == {False} else Qt.PartiallyChecked). :)

self.openContext(self.data)

# update selections in widgets and re-plot
self.agg_table_model.set_domain(data.domain if data else None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Domain is already set above, before opening the context (as it should be). Is there any reason to repeat this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we set the domain to gb_attrs_model (attribute model for the group by list). Here we set the domain to agg_table_model which is used for the table. It is set after the context is open to plot the table with settings opened by the context (aggreagtions). Initializing it before the context is open would require another call to the table to replot after the context is open.


class OrangeTableGroupBy:
"""
The class which object is the result of the groupby operation on Orange's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A class representing the result of ...

@janezd janezd removed their assignment Sep 24, 2021
@mw25
Copy link
Contributor

mw25 commented Sep 27, 2021

Thank you for this widget! I needed this functionality too and was about to start trying an implementation until I saw this pull request.

I like it very much, but if I may, I would like to add some more ideas:

  • How about an aggregation "None" where only grouping is done, but no aggregation. Or am I missing something and this is already easier to do with another widget?

I work with time series data and when I group and aggregate I also always have to select the time range of the group to aggregate over. So from a group or stage always only e.g. the last 10 values or 10 minutes or similar are selected and aggregated, because first a settling process must be waited for. Do you understand what I mean? I think this is a relatively common approach.

  • For this the function pandas.core.groupby.GroupBy.tail would be good, where you can specify the last n values.
  • For time series data, it's a bit more complicated (The last x seconds/minutes/... of each group). I think you would have to take a combination of pandas.core.groupby.GroupBy.apply and pandas.DataFrame.last, where you can pass arbitrary time periods. In a python script I do it like this:
    df.groupby("B").apply(lambda df: df.last('2min').mean())

For each of these aggregations, an additional parameter is required. Also, you might need to use two groupby widgets in a row, the first for grouping and cutting the range, the second for a subsequent aggregation like mean.

The apply function could perhaps be called "Custom" and allow text input to use other pandas functions.
See screenshot. (I only extended the GUI, there is no function behind it).

What do you guys think about this?
I hope that was understandable and not too much.
Thanks again for the status so far.

groupby-ideas

@janezd
Copy link
Contributor

janezd commented Sep 27, 2021

How about an aggregation "None" where only grouping is done, but no aggregation. Or am I missing something and this is already easier to do with another widget?

A widget called Unique already does it. Also, if you are not interested in any aggregation, you can use whichever you want and ignore it. :) Though your suggestion is still reasonable.

@mw25
Copy link
Contributor

mw25 commented Sep 27, 2021

How about an aggregation "None" where only grouping is done, but no aggregation. Or am I missing something and this is already easier to do with another widget?

A widget called Unique already does it. Also, if you are not interested in any aggregation, you can use whichever you want and ignore it. :) Though your suggestion is still reasonable.

Thanks for the hint.
To be honest, I can't remember which use case I had in mind with the "None" option...

@mw25
Copy link
Contributor

mw25 commented Sep 27, 2021

For each of these aggregations, an additional parameter is required. Also, you might need to use two groupby widgets in a row, the first for grouping and cutting the range, the second for a subsequent aggregation like mean.

Here could be a problem: The normal functions return one value per group, mine suggested usually several. What happens to the output table if "Mean" is selected for one signal and "Tail" for another?
Maybe the functions I suggested (if they are wanted at all) should not be considered as further aggregate functions, but as optional preprocessing.

When selecting a time range (with e.g. last('10s')) the selection of the time channel to be used would also be necessary.

@PrimozGodec
Copy link
Contributor Author

How about an aggregation "None" where only grouping is done, but no aggregation. Or am I missing something and this is already easier to do with another widget?

A widget called Unique already does it. Also, if you are not interested in any aggregation, you can use whichever you want and ignore it. :) Though your suggestion is still reasonable.

What unique widget do can be already done with this widget (except the option Discard non-unique instances) with using Frist value, Last value, ...

I think @mw25 had in mind a case where we just group and do not attach any aggregation to it. For example, if we have a table with attributes a, b, c and we group by a and b the table would include only columns a and b (with unique values combinations). Do I understand it right? It cannot be done with the unique widget.

I actually think that Group By widget should return this kind of "aggregation" when turning off all aggregations on all attributes. Currently, the error is raised by Pandas. I will fix that.

@mw25
Copy link
Contributor

mw25 commented Sep 27, 2021

I think @mw25 had in mind a case where we just group and do not attach any aggregation to it. For example, if we have a table with attributes a, b, c and we group by a and b the table would include only columns a and b (with unique values combinations). Do I understand it right? It cannot be done with the unique widget.

I tried hard again to remember what I originally wanted =D
If I group my table (columns A, B, C) by A and B I wanted the following result:
The signals stay as they are, so the number of rows does not change. But a new column is added with the grouping. So for example a new numbering (that would be the result of pandas.core.groupby.GroupBy.ngroup) or the values of the columns A and B as text.

groupby

@ajdapretnar
Copy link
Contributor

You can already achieve the second option with Feature Constructor and str(A)+"-"+str(B) option.

@PrimozGodec
Copy link
Contributor Author

I work with time series data and when I group and aggregate I also always have to select the time range of the group to aggregate over. So from a group or stage always only e.g. the last 10 values or 10 minutes or similar are selected and aggregated, because first a settling process must be waited for. Do you understand what I mean? I think this is a relatively common approach.

For this the function pandas.core.groupby.GroupBy.tail would be good, where you can specify the last n values.
For time series data, it's a bit more complicated (The last x seconds/minutes/... of each group). I think you would have to take a combination of pandas.core.groupby.GroupBy.apply and pandas.DataFrame.last, where you can pass arbitrary time periods. In a python script I do it like this:
df.groupby("B").apply(lambda df: df.last('2min').mean())

Regarding the other two aggregations:

  • tail (and possibly the head) would be confusing. What happens when the user select tail (with 10 values) and mean? Mean returns only one value per group tail multiple. What to output then. It would be actually the complete new widget.
  • custom function is possible but the question is if is not too complicated for users? What do you think @janezd and @ajdapretnar.

@mw25
Copy link
Contributor

mw25 commented Sep 27, 2021

You can already achieve the second option with Feature Constructor and str(A)+"-"+str(B) option.

Oh yes that's right. I knew I missed something...

tail (and possibly the head) would be confusing. What happens when the user select tail (with 10 values) and mean? Mean returns only one value per group tail multiple. What to output then. It would be actually the complete new widget.

@PrimozGodec Yes, I noticed that afterwards, too. Did you see my comment:

Maybe the functions I suggested (if they are wanted at all) should not be considered as further aggregate functions, but as optional preprocessing.

These preprocessing options would then of course apply to all signals and be listed separately in the widget.

@ajdapretnar
Copy link
Contributor

Personally, I'd have a separate widget in Timeseries to handle such use cases. I'd prefer to keep the core Group by widget simple and straightforward.

@janezd
Copy link
Contributor

janezd commented Sep 27, 2021

custom function is possible but the question is if is not too complicated for users? What do you think @janezd and @ajdapretnar.

I didn't want to comment it, but if you ask me directly, I will: I don't like it. For those who want to code, there are Feature Constructor and Python Script. If we add it here, we'll want to add it to the next widget next week and soon many widgets will have inputs that will only be understood by programmers.

@PrimozGodec PrimozGodec force-pushed the aggregate branch 2 times, most recently from edcc500 to cc574b7 Compare September 28, 2021 10:57
@mw25
Copy link
Contributor

mw25 commented Sep 28, 2021

Thank you for your feedback.

I think we agree that a custom option is not an option.

But what about another box (e.g. in the controlArea under the signal list) with an checkable option "Preprocessing" or "Range selection" or something? There could be a dropdown with the functions tail (n values), head (n values), tail (n seconds), head (n seconds), ... and another input field for n (and maybe another one for the time channel?).

Would this still be possible in the groupby widget? Otherwise I think @ajdapretnar idea to implement this in timeseries is good.
But what would be the procedure? Should the groupby widget be derived and simply extended with these elements?
(Of course I would also participate).

@mw25
Copy link
Contributor

mw25 commented Oct 6, 2021

I have extended OrangeTableGroupBy and Table.groupby() a bit to allow preprocessing / range selection for each group.

Regardless of whether the preprocessing is placed in the groupby widget or a timeseries widget, or is only available via scripting (or only in a private widget), I would like to propose my extension.

How can I commit or propose my changes?

@ajdapretnar
Copy link
Contributor

@mw25 As I think your preprocessing is very much related to timeseries, I would propose you make a PR on the Timeseries repository, where we can review it and merge it. The PR should include at least tests and some documentation. Ideally, this would be a widget for timeseries preprocessing called Preprocess Timeseries. That's how I see it.

@janezd janezd self-assigned this Oct 8, 2021
@janezd
Copy link
Contributor

janezd commented Oct 8, 2021

I would propose one additional change (and to demonstrate my commitment, I already made a commit).

I would rearrange the columns, that are currently like this:

Screenshot 2021-10-08 at 11 23 21

to be like this:

Screenshot 2021-10-08 at 11 30 53

which, although visually a bit unusual because the single item is in the last, not the first column, better groups items into categories (statistical moments, sums and extremes, single values and counts). @ajdapretnar (who otherwise visually prefers the original, but agrees that the organization is better in the reformatted version) also commented that this looks better for discrete attributes, because it doesn't disable a randomly scattered collection of options but basically the first two columns (except Concatenate).

Screenshot 2021-10-08 at 11 30 56

@PrimozGodec
Copy link
Contributor Author

@janezd thank you for the changes made. It was one of the modifications that I wanted to make today. It looked a bit bad to me that in the last column two items were missing, so I had an idea to rearrange such that one is missing in the last two columns. But what you proposed looks actually better (ok it looks a bit off in the beginning but it makes more sense). :) So I agree with changes.

@PrimozGodec
Copy link
Contributor Author

I fixed the tests. Now it is ready to merge from the implementation part of the widget. Still, the documentation is missing. I will work on it but will probably not have time today. You can merge it if you think it would be good to have a widget in Orange soon and I can then add the documentation in the next PR. If you think we can wait I can add the documentation to this PR in the beginning of next week.

@PrimozGodec PrimozGodec changed the title [RFC][ENH] New widget: Group By [ENH] New widget: Group By Oct 15, 2021
@janezd janezd merged commit 67e6a71 into biolab:master Oct 22, 2021
@PrimozGodec PrimozGodec deleted the aggregate branch November 19, 2021 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants