Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dask,pandas): make groupby null behavior consistent with other backends #9526

Closed

Conversation

hottwaj
Copy link

@hottwaj hottwaj commented Jul 5, 2024

This is for consistency with other backends (duck, postgres), NULL should be included in output groups

Please point me to where unit tests should be added e.g. maybe somewhere in here?

Description of changes

adds dropna=False to .groupby() call for pandas, otherwise if any rows contain NULL in the groupby columns, they are dropped from the output

Issues closed

-->

Copy link
Contributor

github-actions bot commented Jul 5, 2024

ACTION NEEDED

Ibis follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message.

Please update your PR title and description to match the specification.

@cpcloud
Copy link
Member

cpcloud commented Jul 5, 2024

@hottwaj It looks like you may need to remove the xfail markers on the tests that are now XPASSing.

@cpcloud
Copy link
Member

cpcloud commented Jul 5, 2024

Which is pretty nice, it means there were already a couple of tests that hit this code path :)

@cpcloud cpcloud added this to the 10.0 milestone Jul 6, 2024
@cpcloud cpcloud added breaking change Changes that introduce an API break at any level pandas The pandas backend dask The Dask backend labels Jul 6, 2024
@cpcloud cpcloud changed the title fixes pandas backend for groupby on column containing null values fix(dask,pandas): make groupby null behavior consistent with other backends Jul 6, 2024
@cpcloud cpcloud force-pushed the fix-groupby-pandas-col-with-null branch from 67e3d34 to f883286 Compare July 6, 2024 13:13
@cpcloud cpcloud force-pushed the fix-groupby-pandas-col-with-null branch from f883286 to ce5a7fd Compare July 6, 2024 13:14
@cpcloud cpcloud force-pushed the fix-groupby-pandas-col-with-null branch from ce5a7fd to 84781c5 Compare July 6, 2024 13:14
@cpcloud
Copy link
Member

cpcloud commented Jul 6, 2024

@hottwaj Who is @jc-5s? Ideally the people that submit PRs are the same people who are making commits.

@hottwaj
Copy link
Author

hottwaj commented Jul 6, 2024

@hottwaj Who is @jc-5s? Ideally the people that submit PRs are the same people who are making commits.

Ah they are both my accounts, one personal and one work related. Didn't realise I had switched between the two sorry :)

@hottwaj
Copy link
Author

hottwaj commented Jul 8, 2024

Thanks for this, great that there are tests in place already :)

It seems like this test doesn't specifically check on a groupby where NULL is one of several values on which to group - should I add a test that better covers that case?
https://github.com/ibis-project/ibis/pull/9526/files#diff-db87b69de7d36308b9695fb95ec797a308009badb8d76f8be92a369c1796edf7L1578

Or maybe I've misunderstood as I can't work out the source of arg "df" in the test

@hottwaj
Copy link
Author

hottwaj commented Jul 11, 2024

Hi there, sorry for bugging: is there anything I can do to help progress this, or is this ready to go?

I still think a more targeted unit test might be needed, but maybe I've misunderstood what the existing tests cover...

Thanks!

@cpcloud
Copy link
Member

cpcloud commented Jul 12, 2024

Hey @hottwaj!

Since this is a breaking change to output behavior, we'll merge it into 10.0. In the meantime, if you want to add a more targeted unit test that'd be great!

@cpcloud
Copy link
Member

cpcloud commented Sep 14, 2024

This PR is no longer viable since we removed the dask and pandas backends!

@cpcloud cpcloud closed this Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that introduce an API break at any level dask The Dask backend pandas The pandas backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: pandas does not preserve the "null" group in group_by on a column containing null values
3 participants