-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43187 [C++] Support is_in predicates for SimplifyWithGuarantee #43256
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that in #43187 you mentioned:
Our current approach adds a new field to SetLookupOptions which allows the user to declare whether the value set is pre-sorted and deduplicated.
I think this makes sense as it makes the is_in
expression more self-explaining. Why is it not in this PR? (Instead, an extra argument is_in_value_set_sorted
is added.)
I was unsure if people would have liked that approach so I went with something simpler. But it sounds like we're OK with it, so I'll update this PR with the SetLookupOptions change. |
Hi guys, @mapleFU @felipecrv what's your opinions on the current interface design? I.e., adding flag I can proceed with reviewing the implementation when people agree on the interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea that using SimplifyIsIn
to remove the elements in "is_in" until it's whole pruned. But I think sorted_and_deduped
can put into separate patch since it's a bit weird to define a "ordering" here, and can easily break the invariant.
What about removing sorted_and_deduped
firstly and sort during simplify?
cpp/src/arrow/compute/expression.cc
Outdated
if (!value_set) return expr; | ||
if (value_set->length() == 0) return literal(false); | ||
|
||
if (!options->sorted_and_deduped) return expr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why without sorted_and_deduped
the underlying data isn't get checked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the value set isn't sorted, then we'd have to do a linear scan. For large value sets (the use case that I care about), this cost is prohibitive. Our goal after all is support fast parquet filtering on is_in
.
However, I could see this being useful for small value sets. Do you think we should add another option, either to SetLookupOption
or SimplifyWithGuarantee
, to enable simplifying is_in
via linear scan? Or can we punt until someone has a use case and actually wants this feature?
IIUC, an |
I agree, but a flag in planning phase is so weird, and considering something like nan in float, I think we'd at least separating it? "In" list might be large, sort is important for performance, however still I think adding a "planning flag" is so hacking, at least in the first time, the "is_in" doesn't need any flag? |
I'd rather not consider it a "flag in planning phase". It's an execution flag, being set in the planning phase, like any However I can kind of get your point about the weirdness. Maybe it tastes weird in the sense that it is a "hint" (about another field |
Yes, this solving is pretty good but I think it would be better when we encapsulate all "is_in" and sorted checking within |
I just catch that this flag would introduce a new flag, and avoid previous user from filtering "is_in" and might get worse performance... |
Thanks for the review. To give some context, we are trying to optimize parquet filtering on high cardinality columns. You can think of this column as an ID on the row, so a typical use case might be to read 100K rows out of a file with 10MM rows. We express the 100K rows that we want to read as an
I agree that having some sort of planning or preprocessing step is what we ultimately want, I'm just unsure of what the right interface is. If we put the option inside |
cpp/src/arrow/compute/expression.cc
Outdated
@@ -1242,33 +1242,104 @@ struct Inequality { | |||
/*insert_implicit_casts=*/false, &exec_context); | |||
} | |||
|
|||
/// Simplify an is_in predicate against this inequality as a guarantee. | |||
Result<Expression> SimplifyIsIn(Expression expr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expr
should probably be named is_in_call_expr
to make it clear that this function assumes pre-conditions that should be ensured by the caller and it can't be called with any generic Expression
.
In a sense, sorting the arrow/cpp/src/arrow/compute/expression.h Lines 53 to 59 in c3ebdf5
You can encode the "pre-sorted" fact in the
|
Ah I didn't realize that binding initializes the kernel state and constructs the lookup table. Turns out there was a "bug" in my proof of concept where I copy over the post-bind fields with a simplified value set instead of re-binding. But this happens to DTRT for parquet predicate pushdown since the simplified expression is simply thrown away after we check that it is satisfiable, and the original, unsimplified expression is actually used to filter the data. To fix this, I think we'd also need to delay binding until the end of Otherwise, thanks for the suggestion! I like this approach, but just to check my understanding of how this would work:
@felipecrv Does this sound right? |
It's probably a bad idea to mutate You don't have to do it in
Yeah, don't mutate
I don't understand this.
Yeah. You should be able to look at the |
cpp/src/arrow/compute/expression.cc
Outdated
if (cmp & Comparison::LESS_EQUAL) { | ||
lo = mid; | ||
} else { | ||
hi = mid; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this binary-search doesn't necessarily work for all possible comparison operators. I don't have time to think carefully about this yet though.
@bkietz since you're familiar with this code, can you review it? I don't think I can make progress with the review without checking out the changes and trying the suggestions myself. And that would take a lot of my time at the moment. |
cpp/src/arrow/compute/expression.cc
Outdated
} | ||
|
||
switch (cmp_type.id()) { | ||
CASE(UInt8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a naive question, would decimal could be checked here? ( or just not support it yet?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like supporting decimal types would be easy - the basic decimal C types support comparison operators and the is_in
function has kernels for decimal types.
But here you mention that we should leave out floating point types, so should we support decimals but not floats? Or just support both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this needs more tests and should at least initially be simplified to avoid sorting/uniquing sets altogether. Initially winnowing of value sets should be accomplished by filtering with the guarantee.
After this PR merges with a complete set of correctness tests, a follow up PR can add the performance enhancement of slicing sorted value sets (ideally with a benchmark to demonstrate the improvement for large value sets/many guarantees/...)
cpp/src/arrow/compute/expression.cc
Outdated
} | ||
if (!state->sorted_and_unique_value_set->type()->Equals(*type)) { | ||
ARROW_ASSIGN_OR_RAISE( | ||
state->sorted_and_unique_value_set, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a race condition; a kernel state may be shared across multiple threads its Expression was bound in the main thread and then used in a filter on multiple worker threads. In general it is not safe to mutate kernel state in any way unless you are constructing it as part of a new Expression::Call
or otherwise know that only a single thread accesses this kernel state.
Since this is a property which we expect to be a pure function of value_set
, it would also be possible to do this with a memoizing accessor like Result<Datum> SetLookupState::SortedAndUniqueValueSet() const;
(would need atomics).
However, I don't see how this adds value over just sorting and unique-ing all value sets as part of SetLookupState::Init()
. To avoid paying for sorting and unique-ing when the arrays are already sorted and uniqued, bool SetLookupOptions::value_set_is_sorted_and_unique
could be added (but it'd be better if we could add optional<bool> is_sorted
to ArrayStatistics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are ok with requiring the user to set SetLookupOptions::value_set_is_sorted_and_unique
correctly, what do you think about simplifying is_in
only if this option is set? (This was the original approach, but we later switched to attempting to have SimplifyWithGuarantee
automatically sort/unique the value set without requiring the user to do anything).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should at least initially be simplified to avoid sorting/uniquing sets altogether. Initially winnowing of value sets should be accomplished by filtering with the guarantee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that SetLookupOptions::value_set_is_sorted_and_unique
is a bit odd since it's not used directly by the kernel, which is why I'd prefer that if we add the optimization that we wait and do so through ArrayStatistics instead since that doesn't need weird modification of the options for just one function
auto is_in = [](Expression field, std::shared_ptr<DataType> value_set_type, | ||
std::string json_array) { | ||
SetLookupOptions options{ArrayFromJSON(value_set_type, json_array), | ||
SetLookupOptions::MATCH}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR currently simplifies every is_in
call but you only test for null_matching_behavior=MATCH
. Please either restrict the simplification to MATCH
or test for all null matching behaviors
Simplify{is_in(field_ref("i32"), int32(), "[null]")} | ||
.WithGuarantee(greater(field_ref("i32"), literal(2))) | ||
.Expect(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is incorrect; the is_in call is simplified to false but if null_matching_behavior=MATCH then i32 IS IN [null]
should produce a true
in every slot where i32
was null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is subtle enough to benefit from testing evaluation against real data: for some input data (maybe random, just filter by the guarantee to produce an array which satisfies the guarantee), evaluate the original expression and also the simplified expression then assert that the two results are identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I follow, the guarantee i32 > 2
itself guarantees that i32
is non-null.
cpp/src/arrow/compute/expression.cc
Outdated
value_set = value_set->Slice(0, pivot); | ||
break; | ||
case Comparison::LESS_EQUAL: | ||
value_set = value_set->Slice(0, pivot + (found ? 1 : 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will always slice null out of the value set, which is not correct for all null handling behaviors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nulls are already removed in PrepareIsInValueSet
. But as you pointed out above this is incorrect if the null matching behavior is INCONCLUSIVE. I'll handle this by skipping simplification in this case.
I opened #43761 which contains the basic implementation. |
@bkietz before I rebase these changes on top of #43761, do you have any thoughts on the high level approach? Specifically
|
### Rationale for this change Prior to #43256, this PR adds a basic implementation that does a linear scan filter over the value set on each guarantee. This isolates the correctness/semantics of `is_in` predicate simplification from the binary search performance optimization. ### What changes are included in this PR? `SimplifyWithGuarantee` now handles `is_in` expressions. ### Are these changes tested? A new unit test was added to arrow-compute-expression-test testing this change. ### Are there any user-facing changes? No. * GitHub Issue: #43187 Lead-authored-by: Larry Wang <[email protected]> Co-authored-by: larry98 <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
My main note when this PR is rebased is I'd like to see benchmarks comparing the sort-and-slice simplification to the basic filtering one. My intuition is that we'll only prefer sort-and-slice for larger value sets and it'd be best to measure the threshold.
The issue is less about where state should be stored and more about the potential for race conditions when updating KernelState. I'm not opposed to adding memoizations to KernelState, but I don't think it will be necessary here.
A user can set the flag explicitly, but mostly: after the first time a value set is simplified by this optimization, the optimization itself can set that flag (and then subsequent simplifications of the same value set will be able to skip sorting).
This does add complexity and I'd prefer to avoid it if we can. One approach to reducing the cost of binding I'd prefer is pattern matching against the guarantee to extract as much of it as is usable by SimplifyIsIn. For example parquet statistics frequently produces guarantee expressions like |
…pache#43761) ### Rationale for this change Prior to apache#43256, this PR adds a basic implementation that does a linear scan filter over the value set on each guarantee. This isolates the correctness/semantics of `is_in` predicate simplification from the binary search performance optimization. ### What changes are included in this PR? `SimplifyWithGuarantee` now handles `is_in` expressions. ### Are these changes tested? A new unit test was added to arrow-compute-expression-test testing this change. ### Are there any user-facing changes? No. * GitHub Issue: apache#43187 Lead-authored-by: Larry Wang <[email protected]> Co-authored-by: larry98 <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…pache#43761) ### Rationale for this change Prior to apache#43256, this PR adds a basic implementation that does a linear scan filter over the value set on each guarantee. This isolates the correctness/semantics of `is_in` predicate simplification from the binary search performance optimization. ### What changes are included in this PR? `SimplifyWithGuarantee` now handles `is_in` expressions. ### Are these changes tested? A new unit test was added to arrow-compute-expression-test testing this change. ### Are there any user-facing changes? No. * GitHub Issue: apache#43187 Lead-authored-by: Larry Wang <[email protected]> Co-authored-by: larry98 <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
I haven't gotten around to writing the benchmark yet, but here is what the rest of the code looks like. lmk if you have any preliminary thoughts. |
Rationale for this change
We would like to enable parquet predicate pushdown on
is_in
predicates. The first step is to add support foris_in
predicates inSimplifyWithGuarantee
.What changes are included in this PR?
We add a option to
SimplifyWithGuarantee
that users can set if they can ensure that allis_in
value sets in the expression are sorted and contain no duplicates. When this option is set,SimplifyWithGuarantee
will attempt to simplify anyis_in
predicates that it finds in the expression.Are these changes tested?
A new unit test was added to
arrow-compute-expression-test
testing this change.Are there any user-facing changes?
No.