Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(dask): remove the dask backend #9772

Merged
merged 2 commits into from
Sep 13, 2024

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Aug 5, 2024

Description of changes

This PR removes the dask backend from Ibis.

The dask backend has been in pure maintenance mode since the introduction of
dask-expr, which overall seems like a net improvement to dask.

It's not clear what value Ibis continues to add on top of dask.

Dask is a useful library on its own, and I think Dask + Ibis is adding unjustified
complexity for very little gain.

The work to get a version of Dask with dask-expr working doesn't seem worth it.

I tried to do it a few weeks ago and there were numerous tests of ours that
started failing. Going through the whole cycle of upstream bug reporting,
fixing and waiting just doesn't seem worth the trouble.

BREAKING CHANGE: The dask backend is removed. Please use one of the
other backends that Ibis supports.

@cpcloud cpcloud added this to the 10.0 milestone Aug 5, 2024
@cpcloud
Copy link
Member Author

cpcloud commented Aug 5, 2024

This PR is open for discussion, just to have something to anchor discussions. Just because there's a PR doesn't mean it will be merged.

@cpcloud
Copy link
Member Author

cpcloud commented Aug 5, 2024

Lint failures are related to the BREAKING CHANGE bit. I will fix those up if we decide to merge this.

@cpcloud cpcloud force-pushed the remove-dask-backend branch from ec634f7 to da701b5 Compare August 5, 2024 20:30
@jcrist
Copy link
Member

jcrist commented Aug 5, 2024

+1 to all of that. Of the two non-SQL backends, the dask backend is the one that has some amount of value add since it provides access to a distributed compute engine (while the pandas backend is worse in every metric than the duckdb backend, while providing no added functionality). That said, the maintenance cost and current architecture make maintaining it without known users asking for it not worth it IMO.

I think we should drop both the dask and pandas backends as a breaking change in 10.0.

@cpcloud cpcloud force-pushed the remove-dask-backend branch from da701b5 to d871dd3 Compare August 6, 2024 13:14
@gforsyth
Copy link
Member

gforsyth commented Aug 6, 2024

I agree with everything above.

I agree the dask backend could be of use if we rewrote it, but nothing about its current state would inform that rewrite.

I think the pandas backend is a net-negative for the project -- that's not a strike on pandas itself, but trying to shoehorn an eager evaluation engine into Ibis leads to a very unpleasant experience, and new users often try the pandas backend because they are looking for something familiar / think they need to because they already have a dataframe in-memory.

Rip it all out.

@cpcloud cpcloud force-pushed the remove-dask-backend branch from d871dd3 to c830a29 Compare August 10, 2024 10:28
@kszucs
Copy link
Member

kszucs commented Aug 16, 2024

My preference would be:

  • keep the pandas backend because it is much simpler now than it used to be and the pandas rewrites are used by the polars backend too, the core logic in the pandas backend can also be used to support other eagerly evaluated compute engines
  • rewrite the dask backend to use dask-expr instead of removing it entirely

@cpcloud
Copy link
Member Author

cpcloud commented Aug 16, 2024

rewrite the dask backend to use dask-expr instead of removing it entirely

This is purely additional work, without any benefit. Concretely, who is using the dask backend and would actually benefit from this?

Why should we do this?

keep the pandas backend because it is much simpler now than it used to be and the pandas rewrites are used by the polars backend too, the core logic in the pandas backend can also be used to support other eagerly evaluated compute engines

We shouldn't keep a backend around because we might use its infrastructure someday. Also, we can just move whatever polars is using into the polars backend.

The pandas backend at this point just causes confusion and incorrect perceptions about Ibis's performance.

Should we really put person hours of effort in to address those performance issues just because some random person decided to try it out and didn't have a good experience?

@cpcloud cpcloud force-pushed the remove-dask-backend branch from c830a29 to 885a4eb Compare August 17, 2024 18:11
@cpcloud
Copy link
Member Author

cpcloud commented Aug 19, 2024

@gforsyth @jcrist @lostmygithubaccount @kszucs @ncclementi I would really like to avoid this PR stalling.

I think we should include this as a breaking change for 10.0.

@jcrist
Copy link
Member

jcrist commented Aug 19, 2024

I agree (as I said above). I also think we should do the same for the pandas backend.

@lostmygithubaccount
Copy link
Member

can we go through a deprecation cycle with a warning instead of removing entirely in a release? I'm worried on:

who is using the dask backend and would actually benefit from this?

that there are actually people using it in production that we don't know about. I vaguely remember at least one person mentioning they were on GH (though I could not find this), and there could be many who don't interact in the community

instead of removing the backends, could we start off with a warning if you're using them in 10.0 and look to remove in 11.0? ideally I think we should ensure users are aware that this change is coming and have an opportunity to weigh in (without needing to closely follow the project on GitHub)

@cpcloud
Copy link
Member Author

cpcloud commented Aug 19, 2024

Is there any reason to wait?

Here are the scenarios:

plan to remove in 10.0

  1. someone complains after we remove it and we revert (requires a bump to 11.0)
  2. no one complains

plan to remove in 11.0

  1. someone complains before the release and we do nothing
  2. no one complains before the release

I don't really see much benefit in waiting given that there's only a single scenario that requires a revert.

gforsyth
gforsyth previously approved these changes Aug 19, 2024
Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rip it out

@lostmygithubaccount
Copy link
Member

Is there any reason to wait?

as an outsider, what I would see is a backend being removed without warning because the maintainers did not want to maintain it. this makes me worry that the backend I use or would want to use could be removed at any moment without warning. of course in this case, there are good reasons for the removal, but I may not be aware of those or care enough to dig through GitHub to understand the context

if, instead, I see that the backend was deprecated for 1-6 months with a warning message and an announcement, and then finally gets removed, my takeaway is very different

I suspect:

someone complains after we remove it and we revert (requires a bump to 11.0)

is fairly likely, and would prefer a longer deprecation cycle with proper warning

@cpcloud
Copy link
Member Author

cpcloud commented Aug 20, 2024

After talking with @ianmcook I think we should give a small buffer for folks to move away from Dask/Pandas.

Next steps are:

  1. Write a blog stating we're removing both in 11.0 and why we're doing it. (I can take that on)
  2. Post the blog on Zulip, X, and LinkedIn.
  3. Deprecate in 9.4 (the next release).
  4. Release 10.0 at least a month later, if not longer, containing the removal of both the pandas and dask backends.

@jreback
Copy link
Contributor

jreback commented Aug 20, 2024

-1 on pandas removal here
the cost is pretty low and it's likely highly used

@cpcloud cpcloud added the breaking change Changes that introduce an API break at any level label Aug 24, 2024
@cpcloud cpcloud force-pushed the remove-dask-backend branch 2 times, most recently from fe71e80 to 1db5d0e Compare August 24, 2024 14:59
@cpcloud cpcloud force-pushed the remove-dask-backend branch 7 times, most recently from 74a3bd9 to 306b3d4 Compare September 12, 2024 09:47
@cpcloud cpcloud added the dask The Dask backend label Sep 12, 2024
@cpcloud cpcloud force-pushed the remove-dask-backend branch from 306b3d4 to 4e5b949 Compare September 12, 2024 09:58
@cpcloud
Copy link
Member Author

cpcloud commented Sep 12, 2024

Started going through and cleaning up deprecations.

I'm going to submit a separate PR for each removal.

In the first PR I'll disable the deprecation check in the release verification script and then once we're done removing stuff, I'll re-enable it.

@gforsyth gforsyth dismissed their stale review September 12, 2024 15:35

out of date

@cpcloud
Copy link
Member Author

cpcloud commented Sep 12, 2024

ok I'm going to back out all the changes and just keep the backend removal

i will disable the check for deprecation in the release dry run script for now, and we can enable it again once we get the breakages in

BREAKING CHANGE: The `dask` backend is removed. Please use one of the
other backends that Ibis supports.
@cpcloud cpcloud force-pushed the remove-dask-backend branch from e10b105 to 8f9e213 Compare September 12, 2024 17:17
Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty comprehensive to me.

We may want to update docs/backends_sankey.py after we've also removed pandas and then update the sankey diagram to reflect the new backend landscape.

I figure this should have multiple approvals before merging.

# move back to 3.12 when dask-expr is supported or the dask backend is
# removed
default = ibis310;
default = ibis312;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@cpcloud cpcloud merged commit 6faead2 into ibis-project:main Sep 13, 2024
92 checks passed
@cpcloud cpcloud deleted the remove-dask-backend branch September 13, 2024 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that introduce an API break at any level dask The Dask backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants