Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update filter documentation for expressions #49309

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

srinathk10
Copy link
Contributor

Why are these changes needed?

Update filter documentation for expressions

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Srinath Krishnamachari <[email protected]>
@srinathk10 srinathk10 requested a review from a team as a code owner December 17, 2024 19:17
Signed-off-by: Srinath Krishnamachari <[email protected]>
If you can represent your filter as an expression that leverages Arrow
Dataset Expression, we will be do highly optimized filtering using native
Arrow interfaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the two below tips?

If you can represent your predicate with NumPy or pandas operations,
:meth:Dataset.map_batches might be faster. You can implement filter by
dropping rows.

If you're reading parquet files with :meth:ray.data.read_parquet,
and the filter is a simple predicate, you might
be able to speed it up by using filter pushdown; see
:ref:Parquet row pruning <parquet_row_pruning> for details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for "Parquet row pruning", let's remove the corresponding section from the performance tips user guide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seemed useful to me when I saw the tip and it's valid even now. Let me remove it and upload here. Will help with discussion if it's still valid.

python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
Comment on lines 1203 to 1204
>>> ds.filter(lambda row: row["id"] % 2 == 0).take_all()
[{'id': 0}, {'id': 2}, {'id': 4}, ...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove this since we don't want people using the fn parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given expr is limited, I thought, we can retain it. Let me remove this one.

srinathk10 and others added 3 commits December 18, 2024 14:16
Co-authored-by: Balaji Veeramani <[email protected]>
Signed-off-by: srinathk10 <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Signed-off-by: srinathk10 <[email protected]>
Signed-off-by: Srinath Krishnamachari <[email protected]>
>>> ds.filter(lambda row: row["id"] % 2 == 0).take_all()
[{'id': 0}, {'id': 2}, {'id': 4}, ...]
>>> ds.filter(expr="id <= 4").take_all()
[{'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4}]

Time complexity: O(dataset size / parallelism)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth showing both UDF and expr based one and clearly call out that expr based one has very clear performance advantages (of skipping deserialization, etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants