-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] optimize dataset.unique() #49296
[Data] optimize dataset.unique() #49296
Conversation
f21dbeb
to
a2270a7
Compare
91c0e5a
to
3802155
Compare
2882ed5
to
5c5cc7f
Compare
@raulchen this should be ready for review. thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for submitting this PR.
LGTM overall. Just a few small comments.
706c2ae
to
80f6cb2
Compare
- small clean up to private functions in ds.aggregate Signed-off-by: Kit Lee <[email protected]>
80f6cb2
to
4001a64
Compare
Signed-off-by: Hao Chen <[email protected]>
@raulchen Happy new year! it looks like something failed in the premerge (after the docstring commit), but I don't have access to see what's going on. |
Looks unrelated. Retrying failed jobs. |
## Why are these changes needed? The current implementation uses `groupby(column).count()` that causes a full sort. The new implementation uses `AggregateFn` which uses `groupby(None)` and `set()` to aggregate unique values. The time complexity should be O(N / parallelism) according to `ds.aggregate()`. It's about 10x faster in my local test. Some part of `test_unique` is removed because it was designed for the original implementation. ## Related issue number Closes ray-project#49298 --------- Signed-off-by: Kit Lee <[email protected]> Signed-off-by: Hao Chen <[email protected]> Co-authored-by: Hao Chen <[email protected]>
## Why are these changes needed? The current implementation uses `groupby(column).count()` that causes a full sort. The new implementation uses `AggregateFn` which uses `groupby(None)` and `set()` to aggregate unique values. The time complexity should be O(N / parallelism) according to `ds.aggregate()`. It's about 10x faster in my local test. Some part of `test_unique` is removed because it was designed for the original implementation. ## Related issue number Closes ray-project#49298 --------- Signed-off-by: Kit Lee <[email protected]> Signed-off-by: Hao Chen <[email protected]> Co-authored-by: Hao Chen <[email protected]> Signed-off-by: Roshan Kathawate <[email protected]>
## Why are these changes needed? The current implementation uses `groupby(column).count()` that causes a full sort. The new implementation uses `AggregateFn` which uses `groupby(None)` and `set()` to aggregate unique values. The time complexity should be O(N / parallelism) according to `ds.aggregate()`. It's about 10x faster in my local test. Some part of `test_unique` is removed because it was designed for the original implementation. ## Related issue number Closes ray-project#49298 --------- Signed-off-by: Kit Lee <[email protected]> Signed-off-by: Hao Chen <[email protected]> Co-authored-by: Hao Chen <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
Why are these changes needed?
The current implementation uses
groupby(column).count()
that causes a full sort. The new implementation usesAggregateFn
which usesgroupby(None)
andset()
to aggregate unique values.The time complexity should be O(N / parallelism) according to
ds.aggregate()
.It's about 10x faster in my local test.
Some part of
test_unique
is removed because it was designed for the original implementation.Related issue number
Closes #49298
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.