Register `read_parquet` and `read_csv` as "dispatchable" #1114

rjzamora · 2024-07-31T16:35:05Z

Registers the read_parquet and read_csv definitions in dask-expr for the "pandas" backend.

Some Notes:

I am doing this now, while RAPIDS is temporarily pinned to dask-2024.7.1 anyway
I don't have specific plans to register anything "fancy" for the "cudf" backend right away. However, it probably makes sense to move away from the parquet Engine approach eventually (it probably makes more sense to lean into the Expr class structure to implement backend-specific logic).

rjzamora · 2024-07-31T16:39:19Z

dask_expr/_collection.py

@@ -5174,6 +5175,7 @@ def read_fwf(
    )


+@dataframe_creation_dispatch.register_inplace("pandas")


For the cudf backend, dask-cudf can register a simple function that calls back into this function with engine=CudfEngine (already confirmed that this will work fine). The dask-cudf logic can also check the filesystem argument and do something like read_parquet(..., filesystem="arrow").to_backend("cudf") (since filesystem="arrow" is currently broken in dask-expr when the backend is "cudf")

rjzamora · 2024-07-31T16:40:59Z

dask_expr/_collection.py

 def read_csv(
    path,
    *args,
    header="infer",
    dtype_backend=None,
    storage_options=None,
+    _legacy_dataframe_backend="pandas",


I'm open to other ideas. We can copy this entire function into dask-cudf if necessary, but it would be nice to have a simple hook like this while the read_csv logic is still so simple.

I'd prefer not having a top-level argument for uses like this

Fair enough. Removed it (Dask cuDF can just copy over the boilerplate code).

dask_expr/_collection.py

phofl · 2024-08-12T16:31:02Z

thx

After dask/dask-expr#1114, Dask cuDF must register specific `read_parquet` and `read_csv` functions to be used when query-planning is enabled (the default). **This PR is required for CI to pass with dask>2024.8.0** **NOTE**: It probably doesn't make sense to add specific tests for this change. Once the 2014.7.1 dask pin is removed, all `dask_cudf` tests using `read_parquet` and `read_csv` will fail without this change... Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Benjamin Zaitlen (https://github.com/quasiben) URL: #16535

rjzamora added 2 commits July 31, 2024 09:00

register read_parquet and read_csv as dispatchable

4b40662

formatting

beb5955

rjzamora added the enhancement New feature or request label Jul 31, 2024

rjzamora self-assigned this Jul 31, 2024

rjzamora commented Jul 31, 2024

View reviewed changes

rjzamora requested a review from phofl August 12, 2024 13:27

rjzamora commented Aug 12, 2024

View reviewed changes

dask_expr/_collection.py Outdated Show resolved Hide resolved

rjzamora commented Aug 12, 2024

View reviewed changes

dask_expr/_collection.py Outdated Show resolved Hide resolved

Apply suggestions from code review

5af50d4

phofl approved these changes Aug 12, 2024

View reviewed changes

phofl merged commit 39d09eb into main Aug 12, 2024
6 checks passed

phofl deleted the dispatch-csv-parquet branch August 12, 2024 16:31

rjzamora mentioned this pull request Aug 12, 2024

Register read_parquet and read_csv with dask-expr rapidsai/cudf#16535

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register `read_parquet` and `read_csv` as "dispatchable" #1114

Register `read_parquet` and `read_csv` as "dispatchable" #1114

rjzamora commented Jul 31, 2024

rjzamora Jul 31, 2024

rjzamora Jul 31, 2024

phofl Aug 12, 2024

rjzamora Aug 12, 2024

phofl commented Aug 12, 2024

		@@ -5174,6 +5175,7 @@ def read_fwf(
		)


		@dataframe_creation_dispatch.register_inplace("pandas")

Register read_parquet and read_csv as "dispatchable" #1114

Register read_parquet and read_csv as "dispatchable" #1114

Conversation

rjzamora commented Jul 31, 2024

rjzamora Jul 31, 2024

Choose a reason for hiding this comment

rjzamora Jul 31, 2024

Choose a reason for hiding this comment

phofl Aug 12, 2024

Choose a reason for hiding this comment

rjzamora Aug 12, 2024

Choose a reason for hiding this comment

phofl commented Aug 12, 2024

Register `read_parquet` and `read_csv` as "dispatchable" #1114

Register `read_parquet` and `read_csv` as "dispatchable" #1114