-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Register read_parquet
and read_csv
as "dispatchable"
#1114
Conversation
@@ -5174,6 +5175,7 @@ def read_fwf( | |||
) | |||
|
|||
|
|||
@dataframe_creation_dispatch.register_inplace("pandas") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the cudf backend, dask-cudf can register a simple function that calls back into this function with engine=CudfEngine
(already confirmed that this will work fine). The dask-cudf logic can also check the filesystem
argument and do something like read_parquet(..., filesystem="arrow").to_backend("cudf")
(since filesystem="arrow" is currently broken in dask-expr when the backend is "cudf"
)
dask_expr/_collection.py
Outdated
def read_csv( | ||
path, | ||
*args, | ||
header="infer", | ||
dtype_backend=None, | ||
storage_options=None, | ||
_legacy_dataframe_backend="pandas", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to other ideas. We can copy this entire function into dask-cudf if necessary, but it would be nice to have a simple hook like this while the read_csv
logic is still so simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not having a top-level argument for uses like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Removed it (Dask cuDF can just copy over the boilerplate code).
thx |
After dask/dask-expr#1114, Dask cuDF must register specific `read_parquet` and `read_csv` functions to be used when query-planning is enabled (the default). **This PR is required for CI to pass with dask>2024.8.0** **NOTE**: It probably doesn't make sense to add specific tests for this change. Once the 2014.7.1 dask pin is removed, all `dask_cudf` tests using `read_parquet` and `read_csv` will fail without this change... Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Benjamin Zaitlen (https://github.com/quasiben) URL: #16535
Registers the
read_parquet
andread_csv
definitions in dask-expr for the"pandas"
backend.Some Notes:
"cudf"
backend right away. However, it probably makes sense to move away from the parquetEngine
approach eventually (it probably makes more sense to lean into theExpr
class structure to implement backend-specific logic).