Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][WIP] Add backend exec protocol #711

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rjzamora
Copy link
Member

Note that this is not a high-priority, but I explored the idea a few months ago and wanted to share the branch in case others had interest.

The idea here is that the expression system used in dask-expr also makes it pretty easy to directly execute the query using the backend library (rather than constructing a task graph and scheduling the tasks). I'd expect this to be useful for both debugging and for smallish-data applications. For the latter case, this effectively allows the user to apply query optimization to a "serial" pandas or cudf query.

This is only a half-baked idea for now. Only a small subset of expression types are supported (e.g. FromPandas, Blockwise, SortValues, SetIndex, Merge, GroupbyAggregation).

@mrocklin
Copy link
Member

Two thoughts:

  1. Probably the per-class methods should only have to do their own operation, and not ask their children to exec themselves. There should probably be some other thing that's walking the tree, similar to how we handle simplify. It's important, I think, to make the per-class stuff as simple as possible, and take as much burden as we can onto a centralized traversal system.
  2. I'm curious about the protocol name. Something like __exec__ could make sense. I could also imagine being library specific, like __exec_pandas__. I don't have confidence here one way or the other though.

@mrocklin
Copy link
Member

Also, I agree that this isn't high priority. I'd be fine personally if people ignored this until after the dask migration is done (which is high priority I think).

@rjzamora
Copy link
Member Author

Probably the per-class methods should only have to do their own operation, and not ask their children to exec themselves.

Yeah, I agree that something else should probably walk the tree to keep the protocol code simple. The current implementation is just a very simple way to demonstrate the concept.

I'm curious about the protocol name

I'm personally open to anything. I ended up using "exec" instead of "apply", since apply already has a meaning in pandas. My only hesitation from using "pandas" in the name is that it would be nice to use the same language for array expressions in the future.

I'd be fine personally if people ignored this until after the dask migration is done

Yup - Dask-expr + cudf is pretty much completely broken at the moment, so this proposal is pretty low on my list as well. Just want to make sure the "exec" idea is visible, and that there is a space to discuss it.

@mrocklin
Copy link
Member

Also cc'ing @TomNicholas and @tomwhite who have expressed interest in using dask-expr for things other than Dask. This PR isn't mature, but it's a good example of feasibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants