[Data] optimize dataset.unique() #49296

wingkitlee0 · 2024-12-17T02:30:20Z

Why are these changes needed?

The current implementation uses groupby(column).count() that causes a full sort. The new implementation uses AggregateFn which uses groupby(None) and set() to aggregate unique values.

The time complexity should be O(N / parallelism) according to ds.aggregate().

It's about 10x faster in my local test.

Some part of test_unique is removed because it was designed for the original implementation.

Related issue number

Closes #49298

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

wingkitlee0 · 2024-12-27T15:45:30Z

@raulchen this should be ready for review. thanks

raulchen

thanks for submitting this PR.
LGTM overall. Just a few small comments.

python/ray/data/tests/test_all_to_all.py

python/ray/data/_internal/aggregate.py

python/ray/data/dataset.py

- small clean up to private functions in ds.aggregate Signed-off-by: Kit Lee <[email protected]>

python/ray/data/_internal/aggregate.py

Signed-off-by: Hao Chen <[email protected]>

wingkitlee0 · 2025-01-02T15:54:21Z

@raulchen Happy new year! it looks like something failed in the premerge (after the docstring commit), but I don't have access to see what's going on.

raulchen · 2025-01-02T23:59:35Z

Looks unrelated. Retrying failed jobs.

## Why are these changes needed? The current implementation uses `groupby(column).count()` that causes a full sort. The new implementation uses `AggregateFn` which uses `groupby(None)` and `set()` to aggregate unique values. The time complexity should be O(N / parallelism) according to `ds.aggregate()`. It's about 10x faster in my local test. Some part of `test_unique` is removed because it was designed for the original implementation. ## Related issue number Closes ray-project#49298 --------- Signed-off-by: Kit Lee <[email protected]> Signed-off-by: Hao Chen <[email protected]> Co-authored-by: Hao Chen <[email protected]>

## Why are these changes needed? The current implementation uses `groupby(column).count()` that causes a full sort. The new implementation uses `AggregateFn` which uses `groupby(None)` and `set()` to aggregate unique values. The time complexity should be O(N / parallelism) according to `ds.aggregate()`. It's about 10x faster in my local test. Some part of `test_unique` is removed because it was designed for the original implementation. ## Related issue number Closes ray-project#49298 --------- Signed-off-by: Kit Lee <[email protected]> Signed-off-by: Hao Chen <[email protected]> Co-authored-by: Hao Chen <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>

## Why are these changes needed? The current implementation uses `groupby(column).count()` that causes a full sort. The new implementation uses `AggregateFn` which uses `groupby(None)` and `set()` to aggregate unique values. The time complexity should be O(N / parallelism) according to `ds.aggregate()`. It's about 10x faster in my local test. Some part of `test_unique` is removed because it was designed for the original implementation. ## Related issue number Closes ray-project#49298 --------- Signed-off-by: Kit Lee <[email protected]> Signed-off-by: Hao Chen <[email protected]> Co-authored-by: Hao Chen <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>

wingkitlee0 force-pushed the optimize-dataset-unique branch 2 times, most recently from f21dbeb to a2270a7 Compare December 17, 2024 03:42

richardliaw assigned raulchen Dec 17, 2024

wingkitlee0 force-pushed the optimize-dataset-unique branch 4 times, most recently from 91c0e5a to 3802155 Compare December 19, 2024 03:09

wingkitlee0 marked this pull request as ready for review December 19, 2024 03:10

wingkitlee0 requested a review from a team as a code owner December 19, 2024 03:10

wingkitlee0 force-pushed the optimize-dataset-unique branch from 2882ed5 to 5c5cc7f Compare December 19, 2024 03:47

raulchen approved these changes Dec 28, 2024

View reviewed changes

python/ray/data/tests/test_all_to_all.py Outdated Show resolved Hide resolved

python/ray/data/_internal/aggregate.py Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

raulchen added the go add ONLY when ready to merge, run all tests label Dec 30, 2024

wingkitlee0 force-pushed the optimize-dataset-unique branch 2 times, most recently from 706c2ae to 80f6cb2 Compare December 31, 2024 00:26

[Data] Optimize dataset.unique()

4001a64

- small clean up to private functions in ds.aggregate Signed-off-by: Kit Lee <[email protected]>

wingkitlee0 force-pushed the optimize-dataset-unique branch from 80f6cb2 to 4001a64 Compare December 31, 2024 00:47

raulchen reviewed Dec 31, 2024

View reviewed changes

python/ray/data/_internal/aggregate.py Outdated Show resolved Hide resolved

Update python/ray/data/_internal/aggregate.py

fab1a5e

Signed-off-by: Hao Chen <[email protected]>

raulchen enabled auto-merge (squash) December 31, 2024 18:14

Merge branch 'master' into optimize-dataset-unique

60d836b

github-actions bot disabled auto-merge January 1, 2025 15:41

raulchen merged commit acec4fe into ray-project:master Jan 3, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] optimize dataset.unique() #49296

[Data] optimize dataset.unique() #49296

wingkitlee0 commented Dec 17, 2024 •

edited

Loading

wingkitlee0 commented Dec 27, 2024

raulchen left a comment

wingkitlee0 commented Jan 2, 2025

raulchen commented Jan 2, 2025

[Data] optimize dataset.unique() #49296

[Data] optimize dataset.unique() #49296

Conversation

wingkitlee0 commented Dec 17, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

wingkitlee0 commented Dec 27, 2024

raulchen left a comment

Choose a reason for hiding this comment

wingkitlee0 commented Jan 2, 2025

raulchen commented Jan 2, 2025

wingkitlee0 commented Dec 17, 2024 •

edited

Loading