Update filter documentation for expressions #49309

srinathk10 · 2024-12-17T19:17:31Z

Why are these changes needed?

Update filter documentation for expressions

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Srinath Krishnamachari <[email protected]>

bveeramani · 2024-12-17T21:23:26Z

python/ray/data/dataset.py

+           If you can represent your filter as an expression that leverages Arrow
+           Dataset Expression, we will be do highly optimized filtering using native
+           Arrow interfaces.
+


Let's remove the two below tips?

If you can represent your predicate with NumPy or pandas operations,
:meth:Dataset.map_batches might be faster. You can implement filter by
dropping rows.

If you're reading parquet files with :meth:ray.data.read_parquet,
and the filter is a simple predicate, you might
be able to speed it up by using filter pushdown; see
:ref:Parquet row pruning <parquet_row_pruning> for details.

And for "Parquet row pruning", let's remove the corresponding section from the performance tips user guide?

That seemed useful to me when I saw the tip and it's valid even now. Let me remove it and upload here. Will help with discussion if it's still valid.

python/ray/data/dataset.py

bveeramani · 2024-12-17T21:38:44Z

python/ray/data/dataset.py

            >>> ds.filter(lambda row: row["id"] % 2 == 0).take_all()
            [{'id': 0}, {'id': 2}, {'id': 4}, ...]


Maybe remove this since we don't want people using the fn parameter?

Given expr is limited, I thought, we can retain it. Let me remove this one.

Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Signed-off-by: Srinath Krishnamachari <[email protected]>

Signed-off-by: srinathk10 <[email protected]>

alexeykudinkin · 2024-12-20T20:25:37Z

python/ray/data/dataset.py

-            >>> ds.filter(lambda row: row["id"] % 2 == 0).take_all()
-            [{'id': 0}, {'id': 2}, {'id': 4}, ...]
+            >>> ds.filter(expr="id <= 4").take_all()
+            [{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}]

        Time complexity: O(dataset size / parallelism)


I think it's worth showing both UDF and expr based one and clearly call out that expr based one has very clear performance advantages (of skipping deserialization, etc)

Update filter documentation

b6524ce

Signed-off-by: Srinath Krishnamachari <[email protected]>

srinathk10 requested a review from a team as a code owner December 17, 2024 19:17

Update filter documentation

0e6fedc

Signed-off-by: Srinath Krishnamachari <[email protected]>

bveeramani reviewed Dec 17, 2024

View reviewed changes

srinathk10 and others added 3 commits December 18, 2024 14:16

Update python/ray/data/dataset.py

c245f88

Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Apply suggestions from code review

ee3c853

Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: srinathk10 <[email protected]>

Addressed review comments

3d0902e

Signed-off-by: Srinath Krishnamachari <[email protected]>

bveeramani approved these changes Dec 18, 2024

View reviewed changes

srinathk10 and others added 6 commits December 18, 2024 22:56

Fix lint issue

36e2061

Signed-off-by: Srinath Krishnamachari <[email protected]>

Misc fixes

66cbadd

Signed-off-by: Srinath Krishnamachari <[email protected]>

Merge branch 'master' into srinathk10-filter-docs

d538bb8

Update dataset.py

aaacda4

Signed-off-by: srinathk10 <[email protected]>

Merge branch 'master' into srinathk10-filter-docs

d0d9721

Merge branch 'master' into srinathk10-filter-docs

8bb7d31

alexeykudinkin reviewed Dec 20, 2024

View reviewed changes

Merge branch 'master' into srinathk10-filter-docs

4750daf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update filter documentation for expressions #49309

Update filter documentation for expressions #49309

srinathk10 commented Dec 17, 2024

bveeramani Dec 17, 2024

bveeramani Dec 17, 2024

srinathk10 Dec 18, 2024

bveeramani Dec 17, 2024

srinathk10 Dec 18, 2024

alexeykudinkin Dec 20, 2024

		>>> ds.filter(lambda row: row["id"] % 2 == 0).take_all()
		[{'id': 0}, {'id': 2}, {'id': 4}, ...]

Update filter documentation for expressions #49309

Are you sure you want to change the base?

Update filter documentation for expressions #49309

Conversation

srinathk10 commented Dec 17, 2024

Why are these changes needed?

Related issue number

Checks

bveeramani Dec 17, 2024

Choose a reason for hiding this comment

bveeramani Dec 17, 2024

Choose a reason for hiding this comment

srinathk10 Dec 18, 2024

Choose a reason for hiding this comment

bveeramani Dec 17, 2024

Choose a reason for hiding this comment

srinathk10 Dec 18, 2024

Choose a reason for hiding this comment

alexeykudinkin Dec 20, 2024

Choose a reason for hiding this comment