Enable basic p2p shuffle for dask-cudf #7743

rjzamora · 2023-04-04T03:02:01Z

These changes are meant to enable basic "p2p" functionality for dask_cudf.DataFrame collections by addressing some minor inconsistencies between pandas and cudf for DataFrame<->pa.Table conversion. This PR does not attempt to optimize the the p2p shuffle algorithm for GPU-backed data.

TODO (in follow-up work):

Support null-index round-tripping?
Clean up type hints?

rjzamora · 2023-04-04T03:12:52Z

@wence- @quasiben - I know we want to implement a more-efficient version of p2p for GPU-backed data, but I'm thinking basic compatibility is worth pursuing in the short-to-medium term. Let me know what you guys think.

github-actions · 2023-04-04T04:08:01Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      21 files ±0       21 suites ±0 11h 25m 49s ⏱️ -52s
  3 771 tests +1   3 655 ✔️ - 6   108 💤 +1 8 ❌ +6
36 465 runs +8 34 648 ✔️ - 5 1 809 💤 +7 8 ❌ +6

For more details on these failures, see this check.

Results for commit 50beefc. ± Comparison against base commit 3082d27.

♻️ This comment has been updated with latest results.

fjetter · 2023-04-06T12:21:01Z

I think most of the CI failures are not related and should've been fixed on main already.

fjetter · 2023-05-17T10:19:50Z

Do you want us to have a closer look? Is this intended to be merged at some point or is this just for show and tell?

rjzamora · 2023-05-17T15:37:29Z

Do you want us to have a closer look?

I don't think this is ready for a detailed review. However, I would welcome thoughts at any level. The immediate goal of this PR is to figure out a simple way to get the "p2p" shuffle working for cudf-backed data. That is, performance can be a secondary concern for now. Therefore, it would be good to know your thoughts on ways to relax the pandas-specific logic currently used in "p2p".

Is this intended to be merged at some point or is this just for show and tell?

This started off as an experiment, but it would be extremely useful on the RAPIDs side to not have to worry about adding workarounds to avoid the "p2p" shuffle default.

quasiben · 2023-06-02T19:16:12Z

Would probably be good to add a test for this

rjzamora · 2023-06-09T13:27:49Z

distributed/shuffle/_arrow.py

@@ -45,6 +45,14 @@ def check_minimal_arrow_version() -> None:
        )


+def _arrow_to_df(table: pa.Table, like: Any) -> pd.DataFrame:


The p2p code is good about using type hints. Unfortunately, it uses pd.DataFrame in many places where we want something like pd.DataFrame | cudf.DataFrame. The obvious problem is that I cannot just add the | cudf.DataFrame, because there is no guarantee that cudf is installed (and we don't want to eagerly import it, even if it is installed).

What is the established type-hinting convention for cases like this? Do we need something like a run-time checkable BackendDataFrame protocol? cc @jrbourbeau

You could do this:

from typing import TYPE_CHECKING, Union from typing_extensions import TypeAlias import pandas as pd DataFrameT: TypeAlias = Union[pd.DataFrame, "cudf.DataFrame"] if TYPE_CHECKING: import cudf

I think?

Ah, but I guess mypy still complains because it can't find the import at type-checking time.

I suppose you could also make this generic and say:

T = TypeVar("T") def _arrow_to_df(table: pa.Table, like: T) -> T: ...

Do we need something like a run-time checkable BackendDataFrame protocol?

The static-typing approach to this in python is PEP 544

The static-typing approach to this in python is PEP 544

Right exactly. This is why I asked if we should add a BackendDataFrame protocol. Perhaps it makes sense to (eventually) replace is_dataframe_like with a DataFrameLike protocol in dask/dask?

jorisvandenbossche · 2023-07-10T16:01:51Z

FWIW, this would also be useful for dask-geopandas, to allow letting us have control over how a geopandas.GeoDataFrame gets converted to a pyarrow table and back (xref geopandas/dask-geopandas#256)

rjzamora · 2023-07-11T13:47:40Z

FWIW, this would also be useful for dask-geopandas, to allow letting us have control over how a geopandas.GeoDataFrame gets converted to a pyarrow table and back (xref geopandas/dask-geopandas#256)

Dask main now offers to/from arrow-Table dispatch functions. For example, here is the to_pyarrow_table_dispatch implementation in dask_cudf. With that said, I'd still expect the supported kwargs to change a bit.

hendrikmakait

Thanks, @rjzamora!

rjzamora · 2023-08-16T17:29:09Z

Thanks for the review @hendrikmakait !

Let me know if there is anything left to do to get this merged (I don't think any of the CI failures are related to the changes in any way).

hendrikmakait · 2023-08-16T17:34:33Z

Ready to merge from my perspective, when I reviewed it, this still had a [WIP] flag.

rjzamora · 2023-08-16T17:41:32Z

Ready to merge from my perspective, when I reviewed it, this still had a [WIP] flag.

Sorry, forgot about the "[WIP]" label - makes sense. Note that I don't have write permissions in distributed :)

hendrikmakait · 2023-08-16T17:42:57Z

Note that I don't have write permissions in distributed

I was not aware of that, thanks for pointing it out!

This PR allows explicit `shuffle="p2p"` usage within the dask-cudf API now that dask/distributed#7743 is in. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) - Ray Douglass (https://github.com/raydouglass) - gpuCI (https://github.com/GPUtester) - Mike Wendt (https://github.com/mike-wendt) - AJ Schmidt (https://github.com/ajschmidt8) - GALI PREM SAGAR (https://github.com/galipremsagar) - Lawrence Mitchell (https://github.com/wence-) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #13893

rjzamora added 4 commits April 3, 2023 15:54

basic support for cudf-backed collection in p2p shuffle

7210c38

use schema metadata

b03c132

avoid leaving worker_for as dict

0e86785

avoid updating metadata for pandas

ce23c47

rjzamora requested a review from fjetter as a code owner April 4, 2023 03:02

rjzamora marked this pull request as draft April 4, 2023 03:02

rjzamora added 2 commits April 6, 2023 06:32

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

bdbf05a

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

b60a1da

rjzamora added 3 commits May 24, 2023 06:42

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

cc0f0cc

use new get_meta_library utility

ad57395

linting

964b57a

rjzamora mentioned this pull request May 24, 2023

Add dispatching mechanisms for pyarrow.Table conversion dask/dask#10312

Merged

rjzamora added 2 commits May 31, 2023 10:31

save state

b088b25

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

0a57f59

hendrikmakait self-requested a review June 1, 2023 15:08

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

f8123fb

rjzamora mentioned this pull request Jun 8, 2023

Avoid meta roundtrip in P2P shuffle #7895

Merged

2 tasks

rjzamora added 4 commits June 8, 2023 11:04

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

dd61398

add test

20616bb

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

d5141d6

leverage meta instead of custom pyarrow metadata

d39a774

rjzamora commented Jun 9, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

def693a

rjzamora mentioned this pull request Jun 13, 2023

Add dispatch for cudf.Dataframe to/from pyarrow.Table conversion rapidsai/cudf#13558

Merged

3 tasks

rjzamora added 10 commits June 14, 2023 07:35

check attr

d87db60

catch importerror in test

d171dc2

assume latest version of dask

cbd3026

back to original imports

dac7291

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

cb6fce1

ignore dispatch warnings

aae562a

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

af14903

move imports

338ecff

use _constructor_sliced

3da3a05

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

f48a8ca

jorisvandenbossche mentioned this pull request Jul 10, 2023

Spatial_shuffle() can result in ArrowTypeError when using pyarrow 12 geopandas/dask-geopandas#256

Open

rjzamora added 4 commits July 24, 2023 13:45

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

b049a84

DataFrame constructor bugfix

b221af5

mypy workaround

99232e7

Merge remote-tracking branch 'upstream/main' into p2p-cudf-support

50beefc

rjzamora marked this pull request as ready for review August 16, 2023 13:27

hendrikmakait mentioned this pull request Aug 16, 2023

[Tracking] Advancements for P2P #8043

Open

15 tasks

hendrikmakait approved these changes Aug 16, 2023

View reviewed changes

rjzamora changed the title ~~[WIP] Enable basic p2p shuffle for dask-cudf~~ Enable basic p2p shuffle for dask-cudf Aug 16, 2023

hendrikmakait merged commit 7df32af into dask:main Aug 16, 2023
18 of 26 checks passed

rjzamora deleted the p2p-cudf-support branch August 16, 2023 17:44

hendrikmakait mentioned this pull request Aug 16, 2023

Add committers.yaml to all repositories dask/community#338

Open

rjzamora mentioned this pull request Aug 16, 2023

Allow explicit shuffle="p2p" within dask-cudf API rapidsai/cudf#13893

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable basic p2p shuffle for dask-cudf #7743

Enable basic p2p shuffle for dask-cudf #7743

rjzamora commented Apr 4, 2023 •

edited

Loading

rjzamora commented Apr 4, 2023

github-actions bot commented Apr 4, 2023 •

edited

Loading

fjetter commented Apr 6, 2023

fjetter commented May 17, 2023

rjzamora commented May 17, 2023

quasiben commented Jun 2, 2023

rjzamora Jun 9, 2023

wence- Jun 9, 2023

wence- Jun 9, 2023

wence- Jun 9, 2023

wence- Jun 9, 2023

rjzamora Jun 9, 2023

jorisvandenbossche commented Jul 10, 2023

rjzamora commented Jul 11, 2023

hendrikmakait left a comment

rjzamora commented Aug 16, 2023

hendrikmakait commented Aug 16, 2023

rjzamora commented Aug 16, 2023

hendrikmakait commented Aug 16, 2023

		@@ -45,6 +45,14 @@ def check_minimal_arrow_version() -> None:
		)


		def _arrow_to_df(table: pa.Table, like: Any) -> pd.DataFrame:

Enable basic p2p shuffle for dask-cudf #7743

Enable basic p2p shuffle for dask-cudf #7743

Conversation

rjzamora commented Apr 4, 2023 • edited Loading

rjzamora commented Apr 4, 2023

github-actions bot commented Apr 4, 2023 • edited Loading

Unit Test Results

fjetter commented Apr 6, 2023

fjetter commented May 17, 2023

rjzamora commented May 17, 2023

quasiben commented Jun 2, 2023

rjzamora Jun 9, 2023

Choose a reason for hiding this comment

wence- Jun 9, 2023

Choose a reason for hiding this comment

wence- Jun 9, 2023

Choose a reason for hiding this comment

wence- Jun 9, 2023

Choose a reason for hiding this comment

wence- Jun 9, 2023

Choose a reason for hiding this comment

rjzamora Jun 9, 2023

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 10, 2023

rjzamora commented Jul 11, 2023

hendrikmakait left a comment

Choose a reason for hiding this comment

rjzamora commented Aug 16, 2023

hendrikmakait commented Aug 16, 2023

rjzamora commented Aug 16, 2023

hendrikmakait commented Aug 16, 2023

rjzamora commented Apr 4, 2023 •

edited

Loading

github-actions bot commented Apr 4, 2023 •

edited

Loading