Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register read_parquet and read_csv as "dispatchable" #1114

Merged
merged 3 commits into from
Aug 12, 2024
Merged

Conversation

rjzamora
Copy link
Member

Registers the read_parquet and read_csv definitions in dask-expr for the "pandas" backend.

Some Notes:

  • I am doing this now, while RAPIDS is temporarily pinned to dask-2024.7.1 anyway
  • I don't have specific plans to register anything "fancy" for the "cudf" backend right away. However, it probably makes sense to move away from the parquet Engine approach eventually (it probably makes more sense to lean into the Expr class structure to implement backend-specific logic).

@rjzamora rjzamora added the enhancement New feature or request label Jul 31, 2024
@rjzamora rjzamora self-assigned this Jul 31, 2024
@@ -5174,6 +5175,7 @@ def read_fwf(
)


@dataframe_creation_dispatch.register_inplace("pandas")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cudf backend, dask-cudf can register a simple function that calls back into this function with engine=CudfEngine (already confirmed that this will work fine). The dask-cudf logic can also check the filesystem argument and do something like read_parquet(..., filesystem="arrow").to_backend("cudf") (since filesystem="arrow" is currently broken in dask-expr when the backend is "cudf")

def read_csv(
path,
*args,
header="infer",
dtype_backend=None,
storage_options=None,
_legacy_dataframe_backend="pandas",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to other ideas. We can copy this entire function into dask-cudf if necessary, but it would be nice to have a simple hook like this while the read_csv logic is still so simple.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not having a top-level argument for uses like this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Removed it (Dask cuDF can just copy over the boilerplate code).

@rjzamora rjzamora requested a review from phofl August 12, 2024 13:27
dask_expr/_collection.py Outdated Show resolved Hide resolved
dask_expr/_collection.py Outdated Show resolved Hide resolved
@phofl phofl merged commit 39d09eb into main Aug 12, 2024
6 checks passed
@phofl
Copy link
Collaborator

phofl commented Aug 12, 2024

thx

@phofl phofl deleted the dispatch-csv-parquet branch August 12, 2024 16:31
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Aug 13, 2024
After dask/dask-expr#1114, Dask cuDF must register specific `read_parquet` and `read_csv` functions to be used when query-planning is enabled (the default).

**This PR is required for CI to pass with dask>2024.8.0**

**NOTE**: It probably doesn't make sense to add specific tests for this change. Once the 2014.7.1 dask pin is removed, all `dask_cudf` tests using `read_parquet` and  `read_csv` will fail without this change...

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #16535
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants