Harmonized predicate eval #420

scovich · 2024-10-23T04:07:34Z

Today, we have two completely independent data skipping predicate mechanisms:

Delta stats -- takes an expression as input and produces a rewritten expression as output. Difficult to test because you have to create and query a Delta table in order to see what data skipping resulted.
Parquet footer stats -- takes an expression as input and produces an optional boolean as output. Tests can easily hook into it, and we have very thorough test coverage.

Besides the duplication, there is also the problem of under-tested Delta stats code, and ~~at least one~~ several lurking bugs (**). The solution is to define a common predicate evaluation framework that can express not just Delta stats expression rewriting and direct evaluation over parquet footer stats, but also can evaluate any predicate over scalar data, given a way to resolve column names into Scalar values (the DefaultPredicateEvaluator trait). The default predicate evaluator allows for much easier testing of Delta data skipping predicates. All while reusing significant code to further reduce the chances of divergence and lurking bugs.

(**) Bugs found (and fixed) so far:

NotEqual implementation was unsound, due to swapping < with > (could wrongly skip files).
IS [NOT] NULL was flat out broken, trying to do some black magic involving tightBounds. The correct solution is vastly simpler.
NULL handling in AND and OR clauses was too conservative, preventing files from being skipped in several cases.

We considered hoisting eval_sql_where from the parquet skipping code up to the main predicate evaluator. Decided not to for now. If we think it's generally useful to have, and worth the trouble to implement for the other two expression evaluators, we can always do it later.

…, and generic expression eval

codecov · 2024-10-23T04:13:08Z

Codecov Report

Attention: Patch coverage is 88.54489% with 148 lines in your changes missing coverage. Please review.

Project coverage is 80.22%. Comparing base (3e7ad45) to head (b937f0f).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/predicates/tests.rs	84.81%	63 Missing ⚠️
kernel/src/engine/parquet_stats_skipping/tests.rs	80.60%	30 Missing and 2 partials ⚠️
kernel/src/predicates/mod.rs	90.32%	19 Missing and 11 partials ⚠️
kernel/src/scan/data_skipping.rs	81.42%	11 Missing and 2 partials ⚠️
kernel/src/scan/data_skipping/tests.rs	96.51%	6 Missing and 1 partial ⚠️
kernel/src/engine/arrow_expression.rs	0.00%	1 Missing ⚠️
kernel/src/engine/parquet_stats_skipping.rs	98.30%	0 Missing and 1 partial ⚠️
kernel/src/expressions/mod.rs	92.85%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #420      +/-   ##
==========================================
+ Coverage   80.20%   80.22%   +0.02%     
==========================================
  Files          58       61       +3     
  Lines       12994    13327     +333     
  Branches    12994    13327     +333     
==========================================
+ Hits        10422    10692     +270     
- Misses       2033     2090      +57     
- Partials      539      545       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kernel/src/expressions/mod.rs

kernel/src/predicates/mod.rs

scovich · 2024-11-07T03:34:36Z

kernel/tests/read.rs

+        (NotEqual, 1, vec![&batch2, &batch1]),
+        (NotEqual, 3, vec![&batch2, &batch1]),
        (NotEqual, 4, vec![&batch2, &batch1]),
-        (NotEqual, 5, vec![&batch1]),
-        (NotEqual, 7, vec![&batch1]),
+        (NotEqual, 5, vec![&batch2, &batch1]),
+        (NotEqual, 7, vec![&batch2, &batch1]),


scovich · 2024-11-07T03:35:09Z

kernel/tests/read.rs

        (
            Expression::literal(3i64),
            table_for_numbers(vec![1, 2, 3, 4, 5, 6]),
        ),
        (
            column_expr!("number").distinct(3i64),
-            table_for_numbers(vec![1, 2, 3, 4, 5, 6]),
+            table_for_numbers(vec![1, 2, 4, 5, 6]),


scovich · 2024-11-07T04:01:54Z

kernel/src/engine/parquet_row_group_skipping/tests.rs

The only changes in this file are a couple new uses of column_name! where needed, and a LOT of noise due to renaming get_XXX_stat_value as get_XXX_stat (since the method may return an expression for some trait impl).

nicklan

wow, that's a lot of stuff :)

I've taken a pass and had a few comments. Will go over again now that I have more of the shape of it in my head.

kernel/src/predicates/mod.rs

zachschuermann

LGTM!!!! this was a marathon but i think now i both grok the changes and believe them to be correct. left a handful of questions/comments but nothing major

nit: I think in PR comment DefaultPredicateEvaluator trait is supposed to say ResolveColumnAsScalar trait?

kernel/src/predicates/mod.rs

zachschuermann · 2024-11-14T22:09:09Z

kernel/src/predicates/mod.rs

+/// Literal NULL values almost always produce cascading changes in the predicate's structure, so we
+/// represent them by `Option::None` rather than `Scalar::Null`. This allows e.g. `A < NULL` to be
+/// rewritten as `NULL`, or `AND(NULL, FALSE)` to be rewritten as `FALSE`.


so we are really saying that AND(NULL, False) is represented as AND(None, False) which is simplified to False. I guess there is some nuance with what is said below since if that NULL were actually a null scalar then it would evaluate/simplify to NULL?

This feels a bit weird to have a distinction between None and Scalar::Null - i certainly see the utility but I find this semantic to be confusing. Unfortunately I don't have any better idea but just flagging as an area for future investigation/improvement?

We can't use generically use Scalar::Null because Output isn't always an Expression -- the parquet skipping implementation uses Output = bool. In theory, the code shouldn't ever produce Scalar::Null but I should make another pass to be sure of that.

Even if something does produce Scalar::Null tho, it should only risk giving up some of those "structural" plan simplifications. It would still evaluate to a correct result. For example, eval_scalar returns NULL for anything except Scalar::Boolean, and partial_cmp_scalars falls through to impl PartialOrd for Scalar, which has a specific case for Scalar::Null.

zachschuermann · 2024-11-14T22:15:53Z

kernel/src/predicates/mod.rs

+    /// A less-than-or-equal comparison, e.g. `<col> <= <value>`
+    ///
+    /// NOTE: Caller is responsible to commute and/or invert the operation if needed,
+    /// e.g. `NOT(<value> <= <col>` becomes `<col> > <value>`.


Suggested change

/// e.g. `NOT(<value> <= <col>` becomes `<col> > <value>`.

/// e.g. `NOT(<value> <= <col>)` becomes `<col> < <value>`.

also to me at least, it feels more intuitive to keep the ordering of the arguments and just flip the sign
(this is in many doc comments above/below)

Suggested change

/// e.g. `NOT(<value> <= <col>` becomes `<col> > <value>`.

/// e.g. `NOT(<value> <= <col>)` becomes `<value> > <col>`.

Commuting (reordering) is needed because the eval logic that follows is based on col being on the left and value being on the right. Otherwise there are twice as many cases to implement.

ahhh got it :) thanks!

disregard then! (though i think the top one still applies)

Nice catch on the > vs. < btw!

zachschuermann · 2024-11-14T22:16:03Z

kernel/src/predicates/mod.rs

+    /// A less-than comparison, e.g. `<col> < <value>`.
+    ///
+    /// NOTE: Caller is responsible to commute and/or invert the operation if needed,
+    /// e.g. `NOT(<value> < <col>` becomes `<col> <= <value>`.


(and lots of times below)

Suggested change

/// e.g. `NOT(<value> < <col>` becomes `<col> <= <value>`.

/// e.g. `NOT(<value> < <col>)` becomes `<col> <= <value>`.

meta-question: with github's new fancy 'suggestions' is it easier if I go ahead and fix these all myself with suggestions in github UI? or is that too noisy and just doing once with the (lots of times below) message better?

I usually find it easier since I just commit the suggestion here and then rebase it back locally for anything else.

FWIW I almost never commit changes from github side, but suggestions don't bother me in the slightest. It probably saves you a lot of time to just flag that it's a multiple-instance issue instead of manually fixing it a bunch of times.

zachschuermann · 2024-11-14T22:30:45Z

kernel/src/predicates/mod.rs

+/// A predicate evaluator that directly evaluates the predicate to produce an `Option<bool>`
+/// result. Column resolution is handled by an embedded [`ResolveColumnAsScalar`] instance.
+pub(crate) struct DefaultPredicateEvaluator {
+    resolver: Box<dyn ResolveColumnAsScalar>,


out of curiosity: why trait object instead of generic?

Good question. Let me dig into that a bit.

You mean, why not this instead?

pub(crate) struct DefaultPredicateEvaluator<R: ResolveColumnAsScalar> { resolver: R,

Maybe it's a knee jerk from C++ days, but generic means every method the class defines (including trait methods) has to be monomorphized (= replicated) for every different R we use. That includes ten trait methods we directly implement, plus all the provided methods we "inherit" from PredicateEvaluator. That seemed a bit excessive when ResolveColumnAsScalar only provides a single method.

In rust tho, everything is super-aggressively inlined, so using a Box<dyn _> might cause more bloat than it prevents? Maybe @nicklan or @hntd187 has a good idea here?

I mean, I think it's just the trade-off you've stated. Code size vs. runtime overhead. In general I'd say generics are more idiomatic, because rust loves "zero-cost" abstractions.

ah that's interesting - I actually didn't consider monomorphization a con until you reminded it does increase our code size :)

but yea generally agree with nick given that tradeoff - it seems like generics would be my vote here!

zachschuermann · 2024-11-14T23:15:07Z

kernel/src/engine/parquet_stats_skipping/tests.rs

+    let null = &Scalar::Null(DataType::INTEGER);
+
+    let expressions = [
+        Expr::lt(col.clone(), ten.clone()),


i guess doing this to share ref to the same scalar? otherwise could we do something like 5.into()?

It was mostly to avoid magic constants... otherwise Expr::lt(column_name!("foo"), 5) would work Just Fine. And maybe that's better?

hm.. for simple numbers and esp in testing I would maybe advocate for the Expr::lt(column_name!("foo"), 5)? not a huge opinion :)

I revisited this, and there are several challenges (tho I can still improve the code over what we have today):

Some operations do not take impl Into<_> and so passing a primitive literal like 1 does not compile.

Some operations need to take Scalar so they can sometimes pass NULL values

ColumnName is not const and so can't be made into a constant; and there are format strings relying on the current col that would be "not fun" to change.

if we think this is worth tackling then yea maybe just make a follow-up issue to improve some of these ergonomics?

I already made some improvements, hopefully it's enough for now.

kernel/src/engine/parquet_stats_skipping/tests.rs

kernel/src/scan/data_skipping.rs

zachschuermann · 2024-11-14T23:26:35Z

kernel/src/scan/data_skipping.rs

+
+    /// Retrieves the minimum value of a column, if it exists and has the requested type.
+    fn get_min_stat(&self, col: &ColumnName, _data_type: &DataType) -> Option<Expr> {
+        Some(joined_column_expr!("minValues", col))


nit: should probably pull out "minValues" etc. as constants?

If we make them constants, we can't use the macro any more and would end up with e.g.

Suggested change

Some(joined_column_expr!("minValues", col))

Some(ColumnName::new([MIN_VALUES_COL_NAME]).join(col))

... which is certainly doable, but also a bit of a mouthful.

Also, there used to be more uses of these magic constants, but this PR leaves only two prod uses of each of those constants: One for its get_xxx_stat method here, and the stats_schema created internally by DataSkippingFilter::new.

Thoughts?

hm fair point. this seems like not worth spending time on so let's leave as-is. next time someone wants to play with macros we could probably expand the column_name! etc. to accept a path in addition to a string literal? but again doesn't really feel worth the time right now :)

How would a path help? The problem is, macros run before const evaluates (indeed, before we even know what type the const evaluates to), so the macro couldn't verify the content of a const string meets the safety conditions.

zachschuermann · 2024-11-14T23:27:17Z

kernel/src/scan/data_skipping.rs

+impl DataSkippingPredicateEvaluator for DataSkippingPredicateCreator {
+    type Output = Expr;
+    type TypedStat = Expr;
+    type IntStat = Expr;


impressed how well this API generalizes across data skipping and parquet stats :)

nicklan

lgtm! Thanks, this is a big improvement. Just had a couple of small things

nicklan · 2024-11-15T21:40:50Z

kernel/src/predicates/mod.rs

+/// A predicate evaluator that directly evaluates the predicate to produce an `Option<bool>`
+/// result. Column resolution is handled by an embedded [`ResolveColumnAsScalar`] instance.
+pub(crate) struct DefaultPredicateEvaluator {
+    resolver: Box<dyn ResolveColumnAsScalar>,


I mean, I think it's just the trade-off you've stated. Code size vs. runtime overhead. In general I'd say generics are more idiomatic, because rust loves "zero-cost" abstractions.

nicklan · 2024-11-15T21:41:31Z

kernel/src/predicates/mod.rs

+}
+impl DefaultPredicateEvaluator {
+    // Convenient thin wrapper
+    fn resolve_column(&self, col: &ColumnName) -> Option<Scalar> {


nice, yeah, this is more clear thanks

nicklan · 2024-11-15T21:58:17Z

kernel/src/predicates/mod.rs

+    /// always the same, provided by [`eval_variadic`]). The results are then assembled back into a
+    /// variadic expression, in some implementation-defined way (this method).


Suggested change

/// always the same, provided by [`eval_variadic`]). The results are then assembled back into a

/// variadic expression, in some implementation-defined way (this method).

/// always the same, provided by [`eval_variadic`]). The results are then combined into the

/// output type in some implementation-defined way (this method).

nicklan · 2024-11-15T22:07:00Z

kernel/src/scan/data_skipping.rs

+        if inverted {
+            op = op.invert();
        }


could avoid the mut with:

Suggested change

if inverted {

op = op.invert();

}

let op = if inverted {

op.invert()

} else {

op

}

Is mut somehow bad? I was using mut there specifically to avoid the else...

nope, it's fine. it's just a little less "functional" :)

scovich added 5 commits October 21, 2024 21:27

simplify and clean up data skipping logic a bit

cfb9cb3

checkpoint - one trait captures data skipping, parquet stats skipping…

264ad5f

…, and generic expression eval

Delete redundant code, Delta data skipping passes tests now

7b24f90

it works now, all tests passing

3ed4526

code comment

da16ba7

github-actions bot added the breaking-change Change that will require a version bump label Oct 23, 2024

scovich mentioned this pull request Oct 23, 2024

Simplify and clean up data skipping logic a bit #415

Closed

add doc comments

5e6dc11

hntd187 reviewed Oct 23, 2024

View reviewed changes

kernel/src/expressions/mod.rs Show resolved Hide resolved

hntd187 reviewed Oct 23, 2024

View reviewed changes

kernel/src/predicates/mod.rs Show resolved Hide resolved

scovich added 4 commits October 23, 2024 20:15

add default eval tests, fix distinct

1c26d98

more cleanups and doc comments

dadd719

add more tests, fix broken data skipping null checks, AND/OR weirdness

9ec825d

Cleanup and remove redundant parquet stats skipping tests

1ccaa6d

scovich mentioned this pull request Oct 29, 2024

ColumnName tracks a path of field names instead of a simple string #445

Merged

Merge remote-tracking branch 'oss/main' into hamonized-predicate-eval

107bc5f

scovich commented Nov 7, 2024

View reviewed changes

scovich marked this pull request as ready for review November 7, 2024 04:02

scovich changed the title ~~[WIP] Harmonized predicate eval~~ Harmonized predicate eval Nov 7, 2024

scovich requested review from nicklan and zachschuermann November 7, 2024 04:03

scovich added 2 commits November 6, 2024 20:33

cleanup

8f5b726

Merge remote-tracking branch 'oss/main' into hamonized-predicate-eval

b89019e

nicklan reviewed Nov 14, 2024

View reviewed changes

kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved

kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved

kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved

kernel/src/predicates/mod.rs Outdated Show resolved Hide resolved

zachschuermann approved these changes Nov 14, 2024

View reviewed changes

scovich added 3 commits November 15, 2024 07:00

Merge remote-tracking branch 'oss/main' into hamonized-predicate-eval

2451733

address feedback

6da9f15

fmt + missed reviewer feedback

69e1565

scovich requested a review from nicklan November 15, 2024 21:32

scovich requested a review from hntd187 November 15, 2024 21:32

Merge remote-tracking branch 'oss/main' into hamonized-predicate-eval

1ec3bae

nicklan approved these changes Nov 15, 2024

View reviewed changes

scovich added 2 commits November 15, 2024 15:40

last feedback

23d794c

Merge remote-tracking branch 'oss/main' into hamonized-predicate-eval

b937f0f

scovich merged commit a8ed99f into delta-io:main Nov 16, 2024
23 checks passed

zachschuermann removed the breaking-change Change that will require a version bump label Nov 26, 2024

	/// e.g. `NOT(<value> <= <col>` becomes `<col> > <value>`.
	/// e.g. `NOT(<value> <= <col>)` becomes `<col> < <value>`.

	/// e.g. `NOT(<value> < <col>` becomes `<col> <= <value>`.
	/// e.g. `NOT(<value> < <col>)` becomes `<col> <= <value>`.

	Some(joined_column_expr!("minValues", col))
	Some(ColumnName::new([MIN_VALUES_COL_NAME]).join(col))

		/// always the same, provided by [`eval_variadic`]). The results are then assembled back into a
		/// variadic expression, in some implementation-defined way (this method).

Harmonized predicate eval #420

Harmonized predicate eval #420

Conversation

scovich commented Oct 23, 2024 • edited Loading

codecov bot commented Oct 23, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan left a comment

Choose a reason for hiding this comment

zachschuermann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich commented Oct 23, 2024 •

edited

Loading

codecov bot commented Oct 23, 2024 •

edited

Loading

scovich Nov 15, 2024 •

edited

Loading

scovich Nov 15, 2024 •

edited

Loading

scovich Nov 15, 2024 •

edited

Loading