Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable basic p2p shuffle for dask-cudf #7743

Merged
merged 33 commits into from
Aug 16, 2023

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Apr 4, 2023

These changes are meant to enable basic "p2p" functionality for dask_cudf.DataFrame collections by addressing some minor inconsistencies between pandas and cudf for DataFrame<->pa.Table conversion. This PR does not attempt to optimize the the p2p shuffle algorithm for GPU-backed data.

TODO (in follow-up work):

  • Support null-index round-tripping?
  • Clean up type hints?

@rjzamora rjzamora requested a review from fjetter as a code owner April 4, 2023 03:02
@rjzamora rjzamora marked this pull request as draft April 4, 2023 03:02
@rjzamora
Copy link
Member Author

rjzamora commented Apr 4, 2023

@wence- @quasiben - I know we want to implement a more-efficient version of p2p for GPU-backed data, but I'm thinking basic compatibility is worth pursuing in the short-to-medium term. Let me know what you guys think.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       21 files  ±0         21 suites  ±0   11h 25m 49s ⏱️ -52s
  3 771 tests +1    3 655 ✔️  - 6     108 💤 +1  8 +6 
36 465 runs  +8  34 648 ✔️  - 5  1 809 💤 +7  8 +6 

For more details on these failures, see this check.

Results for commit 50beefc. ± Comparison against base commit 3082d27.

♻️ This comment has been updated with latest results.

@fjetter
Copy link
Member

fjetter commented Apr 6, 2023

I think most of the CI failures are not related and should've been fixed on main already.

@fjetter
Copy link
Member

fjetter commented May 17, 2023

Do you want us to have a closer look? Is this intended to be merged at some point or is this just for show and tell?

@rjzamora
Copy link
Member Author

Do you want us to have a closer look?

I don't think this is ready for a detailed review. However, I would welcome thoughts at any level. The immediate goal of this PR is to figure out a simple way to get the "p2p" shuffle working for cudf-backed data. That is, performance can be a secondary concern for now. Therefore, it would be good to know your thoughts on ways to relax the pandas-specific logic currently used in "p2p".

Is this intended to be merged at some point or is this just for show and tell?

This started off as an experiment, but it would be extremely useful on the RAPIDs side to not have to worry about adding workarounds to avoid the "p2p" shuffle default.

@hendrikmakait hendrikmakait self-requested a review June 1, 2023 15:08
@quasiben
Copy link
Member

quasiben commented Jun 2, 2023

Would probably be good to add a test for this

@rjzamora rjzamora mentioned this pull request Jun 8, 2023
2 tasks
@@ -45,6 +45,14 @@ def check_minimal_arrow_version() -> None:
)


def _arrow_to_df(table: pa.Table, like: Any) -> pd.DataFrame:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The p2p code is good about using type hints. Unfortunately, it uses pd.DataFrame in many places where we want something like pd.DataFrame | cudf.DataFrame. The obvious problem is that I cannot just add the | cudf.DataFrame, because there is no guarantee that cudf is installed (and we don't want to eagerly import it, even if it is installed).

What is the established type-hinting convention for cases like this? Do we need something like a run-time checkable BackendDataFrame protocol? cc @jrbourbeau

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do this:

from typing import TYPE_CHECKING, Union
from typing_extensions import TypeAlias
import pandas as pd

DataFrameT: TypeAlias = Union[pd.DataFrame, "cudf.DataFrame"]

if TYPE_CHECKING:
    import cudf

I think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but I guess mypy still complains because it can't find the import at type-checking time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you could also make this generic and say:

T = TypeVar("T")
def _arrow_to_df(table: pa.Table, like: T) -> T:
   ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need something like a run-time checkable BackendDataFrame protocol?

The static-typing approach to this in python is PEP 544

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static-typing approach to this in python is PEP 544

Right exactly. This is why I asked if we should add a BackendDataFrame protocol. Perhaps it makes sense to (eventually) replace is_dataframe_like with a DataFrameLike protocol in dask/dask?

@jorisvandenbossche
Copy link
Member

FWIW, this would also be useful for dask-geopandas, to allow letting us have control over how a geopandas.GeoDataFrame gets converted to a pyarrow table and back (xref geopandas/dask-geopandas#256)

@rjzamora
Copy link
Member Author

FWIW, this would also be useful for dask-geopandas, to allow letting us have control over how a geopandas.GeoDataFrame gets converted to a pyarrow table and back (xref geopandas/dask-geopandas#256)

Dask main now offers to/from arrow-Table dispatch functions. For example, here is the to_pyarrow_table_dispatch implementation in dask_cudf. With that said, I'd still expect the supported kwargs to change a bit.

@rjzamora rjzamora marked this pull request as ready for review August 16, 2023 13:27
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @rjzamora!

@rjzamora rjzamora changed the title [WIP] Enable basic p2p shuffle for dask-cudf Enable basic p2p shuffle for dask-cudf Aug 16, 2023
@rjzamora
Copy link
Member Author

Thanks for the review @hendrikmakait !

Let me know if there is anything left to do to get this merged (I don't think any of the CI failures are related to the changes in any way).

@hendrikmakait
Copy link
Member

Ready to merge from my perspective, when I reviewed it, this still had a [WIP] flag.

@rjzamora
Copy link
Member Author

Ready to merge from my perspective, when I reviewed it, this still had a [WIP] flag.

Sorry, forgot about the "[WIP]" label - makes sense. Note that I don't have write permissions in distributed :)

@hendrikmakait hendrikmakait merged commit 7df32af into dask:main Aug 16, 2023
18 of 26 checks passed
@hendrikmakait
Copy link
Member

Note that I don't have write permissions in distributed

I was not aware of that, thanks for pointing it out!

@rjzamora rjzamora deleted the p2p-cudf-support branch August 16, 2023 17:44
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Sep 25, 2023
This PR allows explicit `shuffle="p2p"` usage within the dask-cudf API now that dask/distributed#7743 is in.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Ray Douglass (https://github.com/raydouglass)
  - gpuCI (https://github.com/GPUtester)
  - Mike Wendt (https://github.com/mike-wendt)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13893
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants