-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harmonized predicate eval #420
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #420 +/- ##
==========================================
+ Coverage 80.20% 80.22% +0.02%
==========================================
Files 58 61 +3
Lines 12994 13327 +333
Branches 12994 13327 +333
==========================================
+ Hits 10422 10692 +270
- Misses 2033 2090 +57
- Partials 539 545 +6 ☔ View full report in Codecov by Sentry. |
(NotEqual, 1, vec![&batch2, &batch1]), | ||
(NotEqual, 3, vec![&batch2, &batch1]), | ||
(NotEqual, 4, vec![&batch2, &batch1]), | ||
(NotEqual, 5, vec![&batch1]), | ||
(NotEqual, 7, vec![&batch1]), | ||
(NotEqual, 5, vec![&batch2, &batch1]), | ||
(NotEqual, 7, vec![&batch2, &batch1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug fix...
( | ||
Expression::literal(3i64), | ||
table_for_numbers(vec![1, 2, 3, 4, 5, 6]), | ||
), | ||
( | ||
column_expr!("number").distinct(3i64), | ||
table_for_numbers(vec![1, 2, 3, 4, 5, 6]), | ||
table_for_numbers(vec![1, 2, 4, 5, 6]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug fix...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only changes in this file are a couple new uses of column_name!
where needed, and a LOT of noise due to renaming get_XXX_stat_value
as get_XXX_stat
(since the method may return an expression for some trait impl).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, that's a lot of stuff :)
I've taken a pass and had a few comments. Will go over again now that I have more of the shape of it in my head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!!! this was a marathon but i think now i both grok the changes and believe them to be correct. left a handful of questions/comments but nothing major
nit: I think in PR comment DefaultPredicateEvaluator
trait is supposed to say ResolveColumnAsScalar
trait?
/// Literal NULL values almost always produce cascading changes in the predicate's structure, so we | ||
/// represent them by `Option::None` rather than `Scalar::Null`. This allows e.g. `A < NULL` to be | ||
/// rewritten as `NULL`, or `AND(NULL, FALSE)` to be rewritten as `FALSE`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we are really saying that AND(NULL, False)
is represented as AND(None, False)
which is simplified to False
. I guess there is some nuance with what is said below since if that NULL
were actually a null scalar then it would evaluate/simplify to NULL
?
This feels a bit weird to have a distinction between None
and Scalar::Null
- i certainly see the utility but I find this semantic to be confusing. Unfortunately I don't have any better idea but just flagging as an area for future investigation/improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use generically use Scalar::Null
because Output
isn't always an Expression
-- the parquet skipping implementation uses Output = bool
. In theory, the code shouldn't ever produce Scalar::Null
but I should make another pass to be sure of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if something does produce Scalar::Null
tho, it should only risk giving up some of those "structural" plan simplifications. It would still evaluate to a correct result. For example, eval_scalar
returns NULL for anything except Scalar::Boolean
, and partial_cmp_scalars
falls through to impl PartialOrd for Scalar
, which has a specific case for Scalar::Null
.
kernel/src/predicates/mod.rs
Outdated
/// A less-than-or-equal comparison, e.g. `<col> <= <value>` | ||
/// | ||
/// NOTE: Caller is responsible to commute and/or invert the operation if needed, | ||
/// e.g. `NOT(<value> <= <col>` becomes `<col> > <value>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// e.g. `NOT(<value> <= <col>` becomes `<col> > <value>`. | |
/// e.g. `NOT(<value> <= <col>)` becomes `<col> < <value>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also to me at least, it feels more intuitive to keep the ordering of the arguments and just flip the sign
(this is in many doc comments above/below)
/// e.g. `NOT(<value> <= <col>` becomes `<col> > <value>`. | |
/// e.g. `NOT(<value> <= <col>)` becomes `<value> > <col>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commuting (reordering) is needed because the eval logic that follows is based on col
being on the left and value
being on the right. Otherwise there are twice as many cases to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh got it :) thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disregard then! (though i think the top one still applies)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch on the >
vs. <
btw!
kernel/src/predicates/mod.rs
Outdated
/// A less-than comparison, e.g. `<col> < <value>`. | ||
/// | ||
/// NOTE: Caller is responsible to commute and/or invert the operation if needed, | ||
/// e.g. `NOT(<value> < <col>` becomes `<col> <= <value>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and lots of times below)
/// e.g. `NOT(<value> < <col>` becomes `<col> <= <value>`. | |
/// e.g. `NOT(<value> < <col>)` becomes `<col> <= <value>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta-question: with github's new fancy 'suggestions' is it easier if I go ahead and fix these all myself with suggestions in github UI? or is that too noisy and just doing once with the (lots of times below)
message better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually find it easier since I just commit the suggestion here and then rebase it back locally for anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I almost never commit changes from github side, but suggestions don't bother me in the slightest. It probably saves you a lot of time to just flag that it's a multiple-instance issue instead of manually fixing it a bunch of times.
kernel/src/predicates/mod.rs
Outdated
/// A predicate evaluator that directly evaluates the predicate to produce an `Option<bool>` | ||
/// result. Column resolution is handled by an embedded [`ResolveColumnAsScalar`] instance. | ||
pub(crate) struct DefaultPredicateEvaluator { | ||
resolver: Box<dyn ResolveColumnAsScalar>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity: why trait object instead of generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Let me dig into that a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean, why not this instead?
pub(crate) struct DefaultPredicateEvaluator<R: ResolveColumnAsScalar> {
resolver: R,
Maybe it's a knee jerk from C++ days, but generic means every method the class defines (including trait methods) has to be monomorphized (= replicated) for every different R
we use. That includes ten trait methods we directly implement, plus all the provided methods we "inherit" from PredicateEvaluator
. That seemed a bit excessive when ResolveColumnAsScalar
only provides a single method.
In rust tho, everything is super-aggressively inlined, so using a Box<dyn _>
might cause more bloat than it prevents? Maybe @nicklan or @hntd187 has a good idea here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I think it's just the trade-off you've stated. Code size vs. runtime overhead. In general I'd say generics are more idiomatic, because rust loves "zero-cost" abstractions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah that's interesting - I actually didn't consider monomorphization a con until you reminded it does increase our code size :)
but yea generally agree with nick given that tradeoff - it seems like generics would be my vote here!
let null = &Scalar::Null(DataType::INTEGER); | ||
|
||
let expressions = [ | ||
Expr::lt(col.clone(), ten.clone()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess doing this to share ref to the same scalar? otherwise could we do something like 5.into()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was mostly to avoid magic constants... otherwise Expr::lt(column_name!("foo"), 5)
would work Just Fine. And maybe that's better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm.. for simple numbers and esp in testing I would maybe advocate for the Expr::lt(column_name!("foo"), 5)
? not a huge opinion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revisited this, and there are several challenges (tho I can still improve the code over what we have today):
- Some operations do not take
impl Into<_>
and so passing a primitive literal like1
does not compile. - Some operations need to take
Scalar
so they can sometimes pass NULL values ColumnName
is notconst
and so can't be made into a constant; and there are format strings relying on the currentcol
that would be "not fun" to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we think this is worth tackling then yea maybe just make a follow-up issue to improve some of these ergonomics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already made some improvements, hopefully it's enough for now.
|
||
/// Retrieves the minimum value of a column, if it exists and has the requested type. | ||
fn get_min_stat(&self, col: &ColumnName, _data_type: &DataType) -> Option<Expr> { | ||
Some(joined_column_expr!("minValues", col)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should probably pull out "minValues"
etc. as constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make them constants, we can't use the macro any more and would end up with e.g.
Some(joined_column_expr!("minValues", col)) | |
Some(ColumnName::new([MIN_VALUES_COL_NAME]).join(col)) |
... which is certainly doable, but also a bit of a mouthful.
Also, there used to be more uses of these magic constants, but this PR leaves only two prod uses of each of those constants: One for its get_xxx_stat
method here, and the stats_schema
created internally by DataSkippingFilter::new
.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm fair point. this seems like not worth spending time on so let's leave as-is. next time someone wants to play with macros we could probably expand the column_name!
etc. to accept a path
in addition to a string literal? but again doesn't really feel worth the time right now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would a path
help? The problem is, macros run before const evaluates (indeed, before we even know what type the const evaluates to), so the macro couldn't verify the content of a const string meets the safety conditions.
impl DataSkippingPredicateEvaluator for DataSkippingPredicateCreator { | ||
type Output = Expr; | ||
type TypedStat = Expr; | ||
type IntStat = Expr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
impressed how well this API generalizes across data skipping and parquet stats :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! Thanks, this is a big improvement. Just had a couple of small things
kernel/src/predicates/mod.rs
Outdated
/// A predicate evaluator that directly evaluates the predicate to produce an `Option<bool>` | ||
/// result. Column resolution is handled by an embedded [`ResolveColumnAsScalar`] instance. | ||
pub(crate) struct DefaultPredicateEvaluator { | ||
resolver: Box<dyn ResolveColumnAsScalar>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I think it's just the trade-off you've stated. Code size vs. runtime overhead. In general I'd say generics are more idiomatic, because rust loves "zero-cost" abstractions.
} | ||
impl DefaultPredicateEvaluator { | ||
// Convenient thin wrapper | ||
fn resolve_column(&self, col: &ColumnName) -> Option<Scalar> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, yeah, this is more clear thanks
kernel/src/predicates/mod.rs
Outdated
/// always the same, provided by [`eval_variadic`]). The results are then assembled back into a | ||
/// variadic expression, in some implementation-defined way (this method). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// always the same, provided by [`eval_variadic`]). The results are then assembled back into a | |
/// variadic expression, in some implementation-defined way (this method). | |
/// always the same, provided by [`eval_variadic`]). The results are then combined into the | |
/// output type in some implementation-defined way (this method). |
if inverted { | ||
op = op.invert(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could avoid the mut
with:
if inverted { | |
op = op.invert(); | |
} | |
let op = if inverted { | |
op.invert() | |
} else { | |
op | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is mut
somehow bad? I was using mut
there specifically to avoid the else
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, it's fine. it's just a little less "functional" :)
Today, we have two completely independent data skipping predicate mechanisms:
Besides the duplication, there is also the problem of under-tested Delta stats code, and
at least oneseveral lurking bugs (**). The solution is to define a common predicate evaluation framework that can express not just Delta stats expression rewriting and direct evaluation over parquet footer stats, but also can evaluate any predicate over scalar data, given a way to resolve column names intoScalar
values (theDefaultPredicateEvaluator
trait). The default predicate evaluator allows for much easier testing of Delta data skipping predicates. All while reusing significant code to further reduce the chances of divergence and lurking bugs.(**) Bugs found (and fixed) so far:
NotEqual
implementation was unsound, due to swapping<
with>
(could wrongly skip files).IS [NOT] NULL
was flat out broken, trying to do some black magic involving tightBounds. The correct solution is vastly simpler.NULL
handling inAND
andOR
clauses was too conservative, preventing files from being skipped in several cases.We considered hoisting
eval_sql_where
from the parquet skipping code up to the main predicate evaluator. Decided not to for now. If we think it's generally useful to have, and worth the trouble to implement for the other two expression evaluators, we can always do it later.