Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonized predicate eval #420

Merged
merged 19 commits into from
Nov 16, 2024
Merged

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Oct 23, 2024

Today, we have two completely independent data skipping predicate mechanisms:

  1. Delta stats -- takes an expression as input and produces a rewritten expression as output. Difficult to test because you have to create and query a Delta table in order to see what data skipping resulted.
  2. Parquet footer stats -- takes an expression as input and produces an optional boolean as output. Tests can easily hook into it, and we have very thorough test coverage.

Besides the duplication, there is also the problem of under-tested Delta stats code, and at least one several lurking bugs (**). The solution is to define a common predicate evaluation framework that can express not just Delta stats expression rewriting and direct evaluation over parquet footer stats, but also can evaluate any predicate over scalar data, given a way to resolve column names into Scalar values (the DefaultPredicateEvaluator trait). The default predicate evaluator allows for much easier testing of Delta data skipping predicates. All while reusing significant code to further reduce the chances of divergence and lurking bugs.

(**) Bugs found (and fixed) so far:

  • NotEqual implementation was unsound, due to swapping < with > (could wrongly skip files).
  • IS [NOT] NULL was flat out broken, trying to do some black magic involving tightBounds. The correct solution is vastly simpler.
  • NULL handling in AND and OR clauses was too conservative, preventing files from being skipped in several cases.

We considered hoisting eval_sql_where from the parquet skipping code up to the main predicate evaluator. Decided not to for now. If we think it's generally useful to have, and worth the trouble to implement for the other two expression evaluators, we can always do it later.

Copy link

codecov bot commented Oct 23, 2024

Codecov Report

Attention: Patch coverage is 88.54489% with 148 lines in your changes missing coverage. Please review.

Project coverage is 80.22%. Comparing base (3e7ad45) to head (b937f0f).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/predicates/tests.rs 84.81% 63 Missing ⚠️
kernel/src/engine/parquet_stats_skipping/tests.rs 80.60% 30 Missing and 2 partials ⚠️
kernel/src/predicates/mod.rs 90.32% 19 Missing and 11 partials ⚠️
kernel/src/scan/data_skipping.rs 81.42% 11 Missing and 2 partials ⚠️
kernel/src/scan/data_skipping/tests.rs 96.51% 6 Missing and 1 partial ⚠️
kernel/src/engine/arrow_expression.rs 0.00% 1 Missing ⚠️
kernel/src/engine/parquet_stats_skipping.rs 98.30% 0 Missing and 1 partial ⚠️
kernel/src/expressions/mod.rs 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #420      +/-   ##
==========================================
+ Coverage   80.20%   80.22%   +0.02%     
==========================================
  Files          58       61       +3     
  Lines       12994    13327     +333     
  Branches    12994    13327     +333     
==========================================
+ Hits        10422    10692     +270     
- Misses       2033     2090      +57     
- Partials      539      545       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Oct 23, 2024
Comment on lines +334 to +338
(NotEqual, 1, vec![&batch2, &batch1]),
(NotEqual, 3, vec![&batch2, &batch1]),
(NotEqual, 4, vec![&batch2, &batch1]),
(NotEqual, 5, vec![&batch1]),
(NotEqual, 7, vec![&batch1]),
(NotEqual, 5, vec![&batch2, &batch1]),
(NotEqual, 7, vec![&batch2, &batch1]),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug fix...

(
Expression::literal(3i64),
table_for_numbers(vec![1, 2, 3, 4, 5, 6]),
),
(
column_expr!("number").distinct(3i64),
table_for_numbers(vec![1, 2, 3, 4, 5, 6]),
table_for_numbers(vec![1, 2, 4, 5, 6]),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug fix...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only changes in this file are a couple new uses of column_name! where needed, and a LOT of noise due to renaming get_XXX_stat_value as get_XXX_stat (since the method may return an expression for some trait impl).

@scovich scovich marked this pull request as ready for review November 7, 2024 04:02
@scovich scovich changed the title [WIP] Harmonized predicate eval Harmonized predicate eval Nov 7, 2024
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, that's a lot of stuff :)

I've taken a pass and had a few comments. Will go over again now that I have more of the shape of it in my head.

kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved
kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved
kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved
kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!!! this was a marathon but i think now i both grok the changes and believe them to be correct. left a handful of questions/comments but nothing major

nit: I think in PR comment DefaultPredicateEvaluator trait is supposed to say ResolveColumnAsScalar trait?

kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved
Comment on lines +24 to +26
/// Literal NULL values almost always produce cascading changes in the predicate's structure, so we
/// represent them by `Option::None` rather than `Scalar::Null`. This allows e.g. `A < NULL` to be
/// rewritten as `NULL`, or `AND(NULL, FALSE)` to be rewritten as `FALSE`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we are really saying that AND(NULL, False) is represented as AND(None, False) which is simplified to False. I guess there is some nuance with what is said below since if that NULL were actually a null scalar then it would evaluate/simplify to NULL?

This feels a bit weird to have a distinction between None and Scalar::Null - i certainly see the utility but I find this semantic to be confusing. Unfortunately I don't have any better idea but just flagging as an area for future investigation/improvement?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use generically use Scalar::Null because Output isn't always an Expression -- the parquet skipping implementation uses Output = bool. In theory, the code shouldn't ever produce Scalar::Null but I should make another pass to be sure of that.

Copy link
Collaborator Author

@scovich scovich Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if something does produce Scalar::Null tho, it should only risk giving up some of those "structural" plan simplifications. It would still evaluate to a correct result. For example, eval_scalar returns NULL for anything except Scalar::Boolean, and partial_cmp_scalars falls through to impl PartialOrd for Scalar, which has a specific case for Scalar::Null.

/// A less-than-or-equal comparison, e.g. `<col> <= <value>`
///
/// NOTE: Caller is responsible to commute and/or invert the operation if needed,
/// e.g. `NOT(<value> <= <col>` becomes `<col> > <value>`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// e.g. `NOT(<value> <= <col>` becomes `<col> > <value>`.
/// e.g. `NOT(<value> <= <col>)` becomes `<col> < <value>`.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also to me at least, it feels more intuitive to keep the ordering of the arguments and just flip the sign
(this is in many doc comments above/below)

Suggested change
/// e.g. `NOT(<value> <= <col>` becomes `<col> > <value>`.
/// e.g. `NOT(<value> <= <col>)` becomes `<value> > <col>`.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commuting (reordering) is needed because the eval logic that follows is based on col being on the left and value being on the right. Otherwise there are twice as many cases to implement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh got it :) thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disregard then! (though i think the top one still applies)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on the > vs. < btw!

/// A less-than comparison, e.g. `<col> < <value>`.
///
/// NOTE: Caller is responsible to commute and/or invert the operation if needed,
/// e.g. `NOT(<value> < <col>` becomes `<col> <= <value>`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and lots of times below)

Suggested change
/// e.g. `NOT(<value> < <col>` becomes `<col> <= <value>`.
/// e.g. `NOT(<value> < <col>)` becomes `<col> <= <value>`.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta-question: with github's new fancy 'suggestions' is it easier if I go ahead and fix these all myself with suggestions in github UI? or is that too noisy and just doing once with the (lots of times below) message better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually find it easier since I just commit the suggestion here and then rebase it back locally for anything else.

Copy link
Collaborator Author

@scovich scovich Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I almost never commit changes from github side, but suggestions don't bother me in the slightest. It probably saves you a lot of time to just flag that it's a multiple-instance issue instead of manually fixing it a bunch of times.

/// A predicate evaluator that directly evaluates the predicate to produce an `Option<bool>`
/// result. Column resolution is handled by an embedded [`ResolveColumnAsScalar`] instance.
pub(crate) struct DefaultPredicateEvaluator {
resolver: Box<dyn ResolveColumnAsScalar>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity: why trait object instead of generic?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Let me dig into that a bit.

Copy link
Collaborator Author

@scovich scovich Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, why not this instead?

pub(crate) struct DefaultPredicateEvaluator<R: ResolveColumnAsScalar> {
    resolver: R,

Maybe it's a knee jerk from C++ days, but generic means every method the class defines (including trait methods) has to be monomorphized (= replicated) for every different R we use. That includes ten trait methods we directly implement, plus all the provided methods we "inherit" from PredicateEvaluator. That seemed a bit excessive when ResolveColumnAsScalar only provides a single method.

In rust tho, everything is super-aggressively inlined, so using a Box<dyn _> might cause more bloat than it prevents? Maybe @nicklan or @hntd187 has a good idea here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I think it's just the trade-off you've stated. Code size vs. runtime overhead. In general I'd say generics are more idiomatic, because rust loves "zero-cost" abstractions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's interesting - I actually didn't consider monomorphization a con until you reminded it does increase our code size :)

but yea generally agree with nick given that tradeoff - it seems like generics would be my vote here!

let null = &Scalar::Null(DataType::INTEGER);

let expressions = [
Expr::lt(col.clone(), ten.clone()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess doing this to share ref to the same scalar? otherwise could we do something like 5.into()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was mostly to avoid magic constants... otherwise Expr::lt(column_name!("foo"), 5) would work Just Fine. And maybe that's better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm.. for simple numbers and esp in testing I would maybe advocate for the Expr::lt(column_name!("foo"), 5)? not a huge opinion :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revisited this, and there are several challenges (tho I can still improve the code over what we have today):

  1. Some operations do not take impl Into<_> and so passing a primitive literal like 1 does not compile.
  2. Some operations need to take Scalar so they can sometimes pass NULL values
  3. ColumnName is not const and so can't be made into a constant; and there are format strings relying on the current col that would be "not fun" to change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we think this is worth tackling then yea maybe just make a follow-up issue to improve some of these ergonomics?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already made some improvements, hopefully it's enough for now.

kernel/src/scan/data_skipping.rs Show resolved Hide resolved

/// Retrieves the minimum value of a column, if it exists and has the requested type.
fn get_min_stat(&self, col: &ColumnName, _data_type: &DataType) -> Option<Expr> {
Some(joined_column_expr!("minValues", col))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should probably pull out "minValues" etc. as constants?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make them constants, we can't use the macro any more and would end up with e.g.

Suggested change
Some(joined_column_expr!("minValues", col))
Some(ColumnName::new([MIN_VALUES_COL_NAME]).join(col))

... which is certainly doable, but also a bit of a mouthful.

Also, there used to be more uses of these magic constants, but this PR leaves only two prod uses of each of those constants: One for its get_xxx_stat method here, and the stats_schema created internally by DataSkippingFilter::new.

Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm fair point. this seems like not worth spending time on so let's leave as-is. next time someone wants to play with macros we could probably expand the column_name! etc. to accept a path in addition to a string literal? but again doesn't really feel worth the time right now :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a path help? The problem is, macros run before const evaluates (indeed, before we even know what type the const evaluates to), so the macro couldn't verify the content of a const string meets the safety conditions.

Comment on lines +187 to +190
impl DataSkippingPredicateEvaluator for DataSkippingPredicateCreator {
type Output = Expr;
type TypedStat = Expr;
type IntStat = Expr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impressed how well this API generalizes across data skipping and parquet stats :)

@scovich scovich requested a review from nicklan November 15, 2024 21:32
@scovich scovich requested a review from hntd187 November 15, 2024 21:32
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thanks, this is a big improvement. Just had a couple of small things

/// A predicate evaluator that directly evaluates the predicate to produce an `Option<bool>`
/// result. Column resolution is handled by an embedded [`ResolveColumnAsScalar`] instance.
pub(crate) struct DefaultPredicateEvaluator {
resolver: Box<dyn ResolveColumnAsScalar>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I think it's just the trade-off you've stated. Code size vs. runtime overhead. In general I'd say generics are more idiomatic, because rust loves "zero-cost" abstractions.

}
impl DefaultPredicateEvaluator {
// Convenient thin wrapper
fn resolve_column(&self, col: &ColumnName) -> Option<Scalar> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, yeah, this is more clear thanks

Comment on lines 101 to 102
/// always the same, provided by [`eval_variadic`]). The results are then assembled back into a
/// variadic expression, in some implementation-defined way (this method).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// always the same, provided by [`eval_variadic`]). The results are then assembled back into a
/// variadic expression, in some implementation-defined way (this method).
/// always the same, provided by [`eval_variadic`]). The results are then combined into the
/// output type in some implementation-defined way (this method).

Comment on lines +258 to 260
if inverted {
op = op.invert();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could avoid the mut with:

Suggested change
if inverted {
op = op.invert();
}
let op = if inverted {
op.invert()
} else {
op
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mut somehow bad? I was using mut there specifically to avoid the else...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it's fine. it's just a little less "functional" :)

@scovich scovich merged commit a8ed99f into delta-io:main Nov 16, 2024
23 checks passed
@zachschuermann zachschuermann removed the breaking-change Change that will require a version bump label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants