Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow explicit shuffle="p2p" within dask-cudf API #13893

Merged
merged 119 commits into from
Sep 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
19f5174
Merge pull request #4714 from rapidsai/branch-0.13
raydouglass Mar 30, 2020
a2804c3
REL v0.13.0 release
GPUtester Mar 31, 2020
fef2a2b
REL v0.13.0 CHANGELOG Updates
mike-wendt Apr 1, 2020
ab00eb0
Merge pull request #5310 from rapidsai/branch-0.14
raydouglass Jun 3, 2020
b34b838
REL v0.14.0 release
GPUtester Jun 3, 2020
9ff9cdb
update master references
ajschmidt8 Jul 14, 2020
789d19b
REL DOC Updates for main branch switch
mike-wendt Jul 16, 2020
819f514
Merge pull request #6079 from rapidsai/branch-0.15
raydouglass Aug 26, 2020
3a0f214
REL v0.15.0 release
GPUtester Aug 26, 2020
f947393
Merge pull request #6101 from rapidsai/branch-0.15
raydouglass Aug 27, 2020
71cb8c0
REL v0.15.0 release
GPUtester Aug 27, 2020
7ef8174
Merge pull request #6547 from rapidsai/branch-0.16
raydouglass Oct 21, 2020
2b8298f
REL v0.16.0 release
GPUtester Oct 21, 2020
d72b1eb
Merge pull request #6935 from rapidsai/branch-0.17
ajschmidt8 Dec 10, 2020
f56ef85
REL v0.17.0 release
GPUtester Dec 10, 2020
b7e1a85
Merge pull request #7405 from rapidsai/branch-0.18
raydouglass Feb 24, 2021
20778e5
REL v0.18.0 release
GPUtester Feb 24, 2021
042c20f
Merge pull request #7585 from rapidsai/branch-0.18
raydouglass Mar 15, 2021
999be56
REL v0.18.1 release
raydouglass Mar 15, 2021
2391864
Merge pull request #7969 from rapidsai/branch-0.18
raydouglass Apr 15, 2021
3341561
REL v0.18.2 release
raydouglass Apr 15, 2021
6573759
Merge pull request #7626 from rapidsai/branch-0.19
raydouglass Apr 21, 2021
f07b251
REL v0.19.0 release
GPUtester Apr 21, 2021
61e5a20
REL Changelog update
ajschmidt8 Apr 21, 2021
a13e8dc
Merge pull request #8037 from rapidsai/branch-0.19
raydouglass Apr 22, 2021
a9f3453
REL v0.19.1 release
GPUtester Apr 22, 2021
2089fc9
Merge pull request #8100 from rapidsai/branch-0.19
raydouglass Apr 28, 2021
ab3b3f6
REL v0.19.2 release
GPUtester Apr 28, 2021
f9d5e2e
Merge pull request #8418 from rapidsai/branch-21.06
raydouglass Jun 9, 2021
ae44046
REL v21.06.00 release
GPUtester Jun 9, 2021
3b831c3
Merge pull request #8488 from rapidsai/branch-21.06
ajschmidt8 Jun 10, 2021
d56ac1d
Merge pull request #8542 from rapidsai/branch-21.06
raydouglass Jun 17, 2021
cddc64f
REL v21.06.01 release
GPUtester Jun 17, 2021
101fc0f
REL Merge pull request #8544 from rapidsai/branch-21.06
raydouglass Jun 17, 2021
e9dabf8
Merge pull request #8840 from rapidsai/branch-21.08
raydouglass Aug 4, 2021
106039c
REL v21.08.00 release
GPUtester Aug 4, 2021
8055721
Merge pull request #8986 from rapidsai/branch-21.08
raydouglass Aug 6, 2021
e0a8114
REL v21.08.01 release
GPUtester Aug 6, 2021
a7391e6
Merge pull request #8990 from rapidsai/branch-21.08
raydouglass Aug 6, 2021
f6d31fa
REL v21.08.02 release
GPUtester Aug 6, 2021
dff45e5
Merge pull request #9116 from rapidsai/branch-21.08
ajschmidt8 Sep 16, 2021
e4313b6
REL v21.08.03 release
GPUtester Sep 16, 2021
5638329
Merge pull request #9301 from rapidsai/branch-21.10
ajschmidt8 Oct 6, 2021
072fd86
REL v21.10.00 release
GPUtester Oct 6, 2021
8cfb8e5
Merge pull request #9420 from rapidsai/branch-21.10
raydouglass Oct 12, 2021
a1d2d13
REL v21.10.01 release
GPUtester Oct 12, 2021
3ceb0c0
Merge pull request #9689 from rapidsai/branch-21.12
raydouglass Dec 3, 2021
f1ef2d2
REL v21.12.00 release
GPUtester Dec 3, 2021
fd04831
Merge pull request #9880 from rapidsai/branch-21.12
raydouglass Dec 9, 2021
a0a0a3a
REL v21.12.01 release
GPUtester Dec 9, 2021
c74e24f
Merge pull request #9924 from rapidsai/branch-21.12
raydouglass Dec 16, 2021
06540b9
REL v21.12.02 release
GPUtester Dec 16, 2021
f39f559
Merge pull request #10101 from rapidsai/branch-22.02
raydouglass Feb 2, 2022
774d859
REL v22.02.00 release
GPUtester Feb 2, 2022
803c42a
Merge pull request #10512 from rapidsai/branch-22.04
raydouglass Apr 6, 2022
8bf0520
REL v22.04.00 release
GPUtester Apr 6, 2022
0363197
REL Merge pull request #10633 from rapidsai/branch-22.04
raydouglass Apr 11, 2022
89c7736
Merge pull request #10969 from rapidsai/branch-22.06
raydouglass Jun 7, 2022
5658c5b
REL v22.06.00 release
GPUtester Jun 7, 2022
a1fe591
Merge pull request #11208 from rapidsai/branch-22.06
raydouglass Jul 6, 2022
0dab0f8
REL v22.06.01 release
GPUtester Jul 6, 2022
a7f8de5
Merge pull request #11444 from rapidsai/branch-22.08
raydouglass Aug 17, 2022
b71873c
REL v22.08.00 release
GPUtester Aug 17, 2022
aa58765
pin numpy version (#11824)
galipremsagar Sep 29, 2022
78d3655
Merge pull request #11826 from rapidsai/branch-22.08
raydouglass Sep 29, 2022
31337c9
REL v22.08.01 release
GPUtester Sep 29, 2022
b466b6a
Merge pull request #11858 from rapidsai/branch-22.10
raydouglass Oct 12, 2022
8ffe375
REL v22.10.00 release
GPUtester Oct 12, 2022
432fb37
Merge pull request #12061 from rapidsai/branch-22.10
raydouglass Nov 3, 2022
d90f7e9
REL v22.10.01 release
GPUtester Nov 3, 2022
ca9a422
REL Merge pull request #12069 from rapidsai/branch-22.10
raydouglass Nov 4, 2022
a7dcfdf
Merge pull request #12200 from rapidsai/branch-22.12
raydouglass Dec 8, 2022
baae3a6
REL v22.12.00 release
GPUtester Dec 8, 2022
b2dfcdf
Merge pull request #12346 from rapidsai/branch-22.12
raydouglass Dec 8, 2022
f700408
REL v22.12.01 release
GPUtester Dec 8, 2022
93c5b34
Merge pull request #12660 from rapidsai/branch-23.02
raydouglass Feb 9, 2023
d5b59a2
Merge pull request #12746 from rapidsai/branch-23.02
raydouglass Feb 9, 2023
5ad4a85
REL v23.02.00 release
raydouglass Feb 9, 2023
471fa64
Merge pull request #13038 from rapidsai/branch-23.04
raydouglass Apr 12, 2023
cd71208
REL v23.04.00 release
raydouglass Apr 12, 2023
4d31a6f
REL v23.04.00 release
raydouglass Apr 12, 2023
d023acc
Merge pull request #13197 from rapidsai/branch-23.04
raydouglass Apr 21, 2023
7e070fc
REL v23.04.01 release
raydouglass Apr 21, 2023
88cb6db
REL Merge pull request #13280 from rapidsai/branch-23.04
raydouglass May 3, 2023
4548010
Merge remote-tracking branch 'upstream/branch-23.06'
raydouglass Jun 7, 2023
f881d40
REL v23.06.00 release
raydouglass Jun 7, 2023
7d33d20
Merge pull request #13640 from rapidsai/branch-23.06
raydouglass Jun 29, 2023
6a548b0
REL v23.06.01 release
raydouglass Jun 29, 2023
d9589b7
Merge pull request #13781 from rapidsai/branch-23.08
raydouglass Aug 9, 2023
8150d38
REL v23.08.00 release
raydouglass Aug 9, 2023
b4c82d2
enable shuffle='p2p' and add teset
rjzamora Aug 16, 2023
17a47e3
add comment
rjzamora Aug 16, 2023
5664a17
update date
rjzamora Aug 16, 2023
281783d
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Aug 16, 2023
6159a37
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Aug 18, 2023
e83b0b2
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Aug 24, 2023
593f495
Merge remote-tracking branch 'upstream/branch-23.10' into unlock-p2p
rjzamora Aug 25, 2023
62dd5ef
formatting
rjzamora Aug 25, 2023
25458f6
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Sep 5, 2023
db10de0
enable p2p, but keep default as tasks for now
rjzamora Sep 13, 2023
5ca814e
Merge branch 'unlock-p2p' of https://github.com/rjzamora/cudf into un…
rjzamora Sep 13, 2023
b6e94b6
Merge remote-tracking branch 'upstream/branch-23.10' into unlock-p2p
rjzamora Sep 13, 2023
4891269
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Sep 13, 2023
d9dc84c
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Sep 18, 2023
7c6886c
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Sep 20, 2023
2195ef7
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Sep 20, 2023
c122553
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Sep 20, 2023
ad43c91
Merge remote-tracking branch 'upstream/main' into unlock-p2p
rjzamora Sep 21, 2023
bd91580
Merge remote-tracking branch 'upstream/branch-23.10' into unlock-p2p
rjzamora Sep 21, 2023
eb24445
add band-aid for 14159
rjzamora Sep 21, 2023
562c240
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Sep 21, 2023
2e1db4f
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Sep 22, 2023
57fb092
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Sep 22, 2023
46dd00b
update test
rjzamora Sep 22, 2023
5f14ae5
Apply suggestions from code review
rjzamora Sep 22, 2023
e3d664e
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Sep 22, 2023
39c9e1d
Simplifiy index conversion for RangeIndex
wence- Sep 25, 2023
3fa2154
Merge branch 'branch-23.10' into unlock-p2p
rjzamora Sep 25, 2023
c600e6f
formatting
rjzamora Sep 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 27 additions & 4 deletions python/dask_cudf/dask_cudf/backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -373,22 +373,37 @@ def percentile_cudf(a, q, interpolation="linear"):


@pyarrow_schema_dispatch.register((cudf.DataFrame,))
def _get_pyarrow_schema_cudf(obj, preserve_index=True, **kwargs):
def _get_pyarrow_schema_cudf(obj, preserve_index=None, **kwargs):
if kwargs:
warnings.warn(
"Ignoring the following arguments to "
f"`pyarrow_schema_dispatch`: {list(kwargs)}"
)
return meta_nonempty(obj).to_arrow(preserve_index=preserve_index).schema

return _cudf_to_table(
meta_nonempty(obj), preserve_index=preserve_index
).schema


@to_pyarrow_table_dispatch.register(cudf.DataFrame)
def _cudf_to_table(obj, preserve_index=True, **kwargs):
def _cudf_to_table(obj, preserve_index=None, **kwargs):
if kwargs:
warnings.warn(
"Ignoring the following arguments to "
f"`to_pyarrow_table_dispatch`: {list(kwargs)}"
)

# TODO: Remove this logic when cudf#14159 is resolved
# (see: https://github.com/rapidsai/cudf/issues/14159)
if preserve_index and isinstance(obj.index, cudf.RangeIndex):
obj = obj.copy()
obj.index.name = (
obj.index.name
if obj.index.name is not None
else "__index_level_0__"
)
obj.index = obj.index._as_int_index()

return obj.to_arrow(preserve_index=preserve_index)


Expand All @@ -401,7 +416,15 @@ def _table_to_cudf(obj, table, self_destruct=None, **kwargs):
f"Ignoring the following arguments to "
f"`from_pyarrow_table_dispatch`: {list(kwargs)}"
)
return obj.from_arrow(table)
result = obj.from_arrow(table)

# TODO: Remove this logic when cudf#14159 is resolved
# (see: https://github.com/rapidsai/cudf/issues/14159)
if "__index_level_0__" in result.index.names:
assert len(result.index.names) == 1
result.index.name = None

return result


@union_categoricals_dispatch.register((cudf.Series, cudf.BaseIndex))
Expand Down
26 changes: 19 additions & 7 deletions python/dask_cudf/dask_cudf/sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import numpy as np
import tlz as toolz

import dask
from dask import config
from dask.base import tokenize
from dask.dataframe import methods
from dask.dataframe.core import DataFrame, Index, Series
Expand All @@ -18,6 +18,8 @@
from cudf.api.types import is_categorical_dtype
from cudf.utils.utils import _dask_cudf_nvtx_annotate

_SHUFFLE_SUPPORT = ("tasks", "p2p") # "disk" not supported


@_dask_cudf_nvtx_annotate
def set_index_post(df, index_name, drop, column_dtype):
Expand Down Expand Up @@ -307,15 +309,25 @@ def sort_values(
return df4


def get_default_shuffle_method():
# Note that `dask.utils.get_default_shuffle_method`
# will return "p2p" by default when a distributed
# client is present. Dask-cudf supports "p2p", but
# will not use it by default (yet)
default = config.get("dataframe.shuffle.method", "tasks")
if default not in _SHUFFLE_SUPPORT:
default = "tasks"
return default


def _get_shuffle_type(shuffle):
# Utility to set the shuffle-kwarg default
# and to validate user-specified options.
# The only supported options is currently "tasks"
shuffle = shuffle or dask.config.get("shuffle", "tasks")
if shuffle != "tasks":
# and to validate user-specified options
shuffle = shuffle or get_default_shuffle_method()
if shuffle not in _SHUFFLE_SUPPORT:
raise ValueError(
f"Dask-cudf only supports in-memory shuffling with "
f"'tasks'. Got shuffle={shuffle}"
"Dask-cudf only supports the following shuffle "
f"methods: {_SHUFFLE_SUPPORT}. Got shuffle={shuffle}"
)

return shuffle
11 changes: 9 additions & 2 deletions python/dask_cudf/dask_cudf/tests/test_dispatch.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,25 @@ def test_is_categorical_dispatch():
assert is_categorical_dtype(cudf.Index([1, 2, 3], dtype="category"))


def test_pyarrow_conversion_dispatch():
@pytest.mark.parametrize("preserve_index", [True, False])
def test_pyarrow_conversion_dispatch(preserve_index):
from dask.dataframe.dispatch import (
from_pyarrow_table_dispatch,
to_pyarrow_table_dispatch,
)

df1 = cudf.DataFrame(np.random.randn(10, 3), columns=list("abc"))
df2 = from_pyarrow_table_dispatch(df1, to_pyarrow_table_dispatch(df1))
df2 = from_pyarrow_table_dispatch(
df1, to_pyarrow_table_dispatch(df1, preserve_index=preserve_index)
)

assert type(df1) == type(df2)
assert_eq(df1, df2)

# Check that preserve_index does not produce a RangeIndex
if preserve_index:
assert not isinstance(df2.index, cudf.RangeIndex)


@pytest.mark.parametrize("index", [None, [1, 2] * 5])
def test_deterministic_tokenize(index):
Expand Down
22 changes: 21 additions & 1 deletion python/dask_cudf/dask_cudf/tests/test_distributed.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020-2022, NVIDIA CORPORATION.
# Copyright (c) 2020-2023, NVIDIA CORPORATION.

import numba.cuda
import pytest
Expand Down Expand Up @@ -77,3 +77,23 @@ def test_str_series_roundtrip():

actual = dask_series.compute()
assert_eq(actual, expected)


def test_p2p_shuffle():
# Check that we can use `shuffle="p2p"`
with dask_cuda.LocalCUDACluster(n_workers=1) as cluster:
with Client(cluster):
ddf = (
dask.datasets.timeseries(
start="2000-01-01",
end="2000-01-08",
dtypes={"x": int},
)
.reset_index(drop=True)
.to_backend("cudf")
)
dd.assert_eq(
ddf.sort_values("x", shuffle="p2p").compute(),
ddf.compute().sort_values("x"),
check_index=False,
)