Expression::Column references a ColumnName object instead of a String #400

scovich · 2024-10-15T23:14:19Z

Kernel only partly supports nested column names today. One of the biggest barriers to closing the gap is that Expression tracks column names as simple strings, so there is no reliable way to deal with nesting.

This PR takes a first step of defining a proper ColumnName struct for hosting column names. So far, it behaves the same as before -- internally it's just a simple string, occasionally interpreted as nested by splitting at periods. Changing the code to use this new construct is noisy, so we make the changes as a pre-factor. A follow-up PR will actually add nesting support.

Resolves #422

codecov · 2024-10-15T23:17:27Z

Codecov Report

Attention: Patch coverage is 93.12169% with 13 lines in your changes missing coverage. Please review.

Project coverage is 78.94%. Comparing base (b09078d) to head (28ac979).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
derive-macros/src/lib.rs	58.33%	4 Missing and 1 partial ⚠️
kernel/src/expressions/column_names.rs	96.62%	3 Missing ⚠️
kernel/src/expressions/mod.rs	75.00%	3 Missing ⚠️
ffi/src/expressions.rs	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #400      +/-   ##
==========================================
+ Coverage   78.82%   78.94%   +0.11%     
==========================================
  Files          51       52       +1     
  Lines       10655    10756     +101     
  Branches    10655    10756     +101     
==========================================
+ Hits         8399     8491      +92     
- Misses       1788     1796       +8     
- Partials      468      469       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

OussamaSaoudi-db · 2024-10-18T22:00:50Z

kernel/src/expressions/mod.rs

+
+    /// Creates a new column reference, for a splittable nested column name.
+    /// See [`ColumnName::split`] for details.
+    pub fn split_column(name: impl AsRef<str>) -> Self {


Seems that I need to know a priori whether a column name is nested or not when calling Expression::simple_column vs Expression::split_column. Do we want devs to have to keep track of that? Why not have a column(name: impl AsRef<str>), and the function can figure out if it's split or simple?

Great questions! Here's what I've been thinking but had not written down previously:

It's common for column and field names to be "simple" identifiers satisfying the regexp [_a-zA-Z][_a-zA-Z0-9]*. That corresponds to a simple_column (e.g. 'my_column').

When specifying nested column paths, most systems use . to separate field names, which corresponds to a split_column (e.g. "a.b.c").

But field names are actually allowed to contain any characters they want -- including spaces, periods, and wacky unicode characters, and such columns are not rare. Especially if a language other than English is used, and characters from other languages are needed.

So now we have ambiguity -- does the string "a.b.c.d" correspond to a simple nested column name with three parts? Or a single column whose name happens to contain periods? What about "a.b c.d" -- most parsers would reject that as a syntax error. What about "a.b, c.d" -- in the wrong place, most parsers would interpret that as two column names.

So, systems need a way to unambiguously identify the fields in a string that represents a nested column name. Microsoft ecosystem uses square brackets (e.g. a.[b.c].d or a.[b, c].d), MySQL uses backticks (e.g. a.`b.c`.d or a.`b, c`.d), and SQL officially uses double-quote characters (e.g. a."b.c".d or a."b, c".d). All have an escape character mechanism to allow parsing names that actually contain the field name delimiter.

Assuming the column name has been successfully parsed, it should be represented as a Vec<String> or similar, in order to eliminate any possible confusion from then on.

The next PR after this one will add a proper vec representation, but we're hoping to sidestep the whole delimiter and parsing situation in kernel because different engines do it so differently (and parsing is messy business in general).

However -- for internal kernel use when referencing Delta metadata fields, we know the column names are all simple identifiers (and usually string literals known at compile time as well). For convenience in such cases, this PR adds the simple_column and split_column constructors. By using them, the caller asserts that there are no fancy characters involved, thus avoiding the parsing problem without the ugliness and frustration of passing vectors of literals all over the place (e.g. Expression::Column(ColumnName::nested(vec!["add", "deletionVector", "storageType"])) becomes just Expression::split_column("add.deletionVector.storageType")).

The next PR will introduce ColumnName::nested that takes a (pre-parsed) list of field names. Once that exists, a call to ColumnName::simple(name) will be sugar for ColumnName::nested(vec![name]) -- and allows arbitrary characters because it no longer requires parsing of any kind. Similarly, ColumnName::split(name) will be sugar for ColumnName::nested(name.split('.')). In any case where kernel doesn't KNOW the structure of the column name, the caller (usually engine) is required to use ColumnName::nested to avoid ambiguity.

Naming is hard tho -- we want something short and intuitive for the convenience helpers. I'm not sure simple and split are quite the right names for the job.

I'm also debating whether the two convenience constructors should actually be macros, so we can enforce at compile time that the input is a string literal containing only unremarkable characters. That would prevent anyone from misusing them in cases where we don't actually know the column name's structure.

I went ahead and defined macros that give provable compile-time safety

awesome i really like the macros :)

at risk of a dumb question: could we just have a simple_column be something like: either a nested column split by periods (and no periods in field names) like a.b.c or just a single top-level column? then we wouldn't need two APIs for top-level/nested column names? It could just unify under simple_column!("nested.col") -> ColumnName::new([name, col]) and simple_column!("not_nested") -> ColumnName::new([not_nested]) or something like that?

and, come to think of it, regarding naming: I wonder if just something like column_name!() macro would suffice, since it's compile-time checked if any error occurs we can point back to macro docs and have a clear error message that says 'column_name!' is for generating a simple '.-delimited' column name without special chars, etc, etc

Woah that was an excellent overview of the problems of parsing and how this addresses them. Thank you!

Makes sense that we're avoiding parsing in the kernel. We're at a low enough level that any names we're passed are already parsed.

I'll take a look at the macros this review. Excited to see what you cooked up 👀

I wonder if just something like column_name!() macro would suffice, since it's compile-time checked if any error occurs we can point back to macro docs and have a clear error message that says 'column_name!' is for generating a simple '.-delimited' column name without special chars, etc, etc

That's a great question. I went back and forth quite a bit and settled on the current form because it requires developer to state intent and allows the compiler to easily validate intent (e.g. simple_column! blows up if it sees any dots. I even debated having nested_column! blow up unless it sees at least one dot, but that seemed overkill.

That said, I'm very open to discussion if we think the current split is also overkill. Maybe we're already safe enough from the combination of string literals and the fact that Delta spec column names never contain dots.

Thinking more, I agree that two sets of macros is overkill. Simplified to just one concept: column_name! for a ColumnName and column_expr! for a Expression::Column.

cool, SGTM! thanks!

OussamaSaoudi-db · 2024-10-18T22:04:34Z

ffi/src/expressions.rs

@@ -146,7 +146,8 @@ fn visit_expression_column_impl(
    state: &mut KernelExpressionVisitorState,
    name: DeltaResult<String>,
 ) -> DeltaResult<usize> {
-    Ok(wrap_expression(state, Expression::Column(name?)))
+    // TODO: We can't actually assume the column name is so easily splittable!
+    Ok(wrap_expression(state, Expression::split_column(name?)))


Out of curiosity, what are cases when a column name isn't splittable?

See other comment... we can't safely split an arbitrary column name like "a.b.c.d" because one of the field names might be b.c (probably not, but no way to rule it out).

nicklan

Nice, this is a cool way of handling it. One suggestion for join.

nicklan · 2024-10-23T00:11:49Z

kernel/src/expressions/column_name.rs

+    pub fn join(left: ColumnName, right: ColumnName) -> ColumnName {
+        [left, right].into_iter().collect()
+    }


I think either this should take &self and then work as if self was left and join to the arg (similar to Url's join), or it should live outside the impl.

I'd prefer:

Suggested change

pub fn join(left: ColumnName, right: ColumnName) -> ColumnName {

[left, right].into_iter().collect()

}

pub fn join(&self, right: ColumnName) -> ColumnName {

[self.clone(), right].into_iter().collect()

}

How could it live outside the impl without causing massive namespace collisions? join is a pretty generic name and rust doesn't have function overloading...

I was avoiding the &self join route because it potentially means copying a bunch of strings for deeply nested paths. But (a) that's probably optimizing rather than merely not-pessimizing; and (b) the left side of the join is usually a single field name and we're anyway consuming the right arg.

We could also define a join_with method that takes self. Maybe I'll just do that.

Note: std::path::Path::join actually takes both left and right sides by reference. Let's go simpler-is-better here.

kernel/src/expressions/column_name.rs

kernel/src/scan/data_skipping.rs

zachschuermann

looks great just had two main things below (1) if we can unify the nested/simple APIs and (2) how we could possibly test the compilation failure

ffi/src/expressions.rs

zachschuermann · 2024-10-23T16:34:30Z

kernel/src/expressions/mod.rs

+
+    /// Creates a new column reference, for a splittable nested column name.
+    /// See [`ColumnName::split`] for details.
+    pub fn split_column(name: impl AsRef<str>) -> Self {


awesome i really like the macros :)

zachschuermann · 2024-10-23T16:40:26Z

kernel/src/expressions/mod.rs

+
+    /// Creates a new column reference, for a splittable nested column name.
+    /// See [`ColumnName::split`] for details.
+    pub fn split_column(name: impl AsRef<str>) -> Self {


at risk of a dumb question: could we just have a simple_column be something like: either a nested column split by periods (and no periods in field names) like a.b.c or just a single top-level column? then we wouldn't need two APIs for top-level/nested column names? It could just unify under simple_column!("nested.col") -> ColumnName::new([name, col]) and simple_column!("not_nested") -> ColumnName::new([not_nested]) or something like that?

zachschuermann · 2024-10-23T16:42:34Z

kernel/src/expressions/mod.rs

+
+    /// Creates a new column reference, for a splittable nested column name.
+    /// See [`ColumnName::split`] for details.
+    pub fn split_column(name: impl AsRef<str>) -> Self {


and, come to think of it, regarding naming: I wonder if just something like column_name!() macro would suffice, since it's compile-time checked if any error occurs we can point back to macro docs and have a clear error message that says 'column_name!' is for generating a simple '.-delimited' column name without special chars, etc, etc

kernel/src/expressions/column_name.rs

scovich · 2024-10-23T17:06:40Z

how we could possibly test the compilation failure

This PR includes doc tests for that!
https://github.com/delta-incubator/delta-kernel-rs/pull/400/files#diff-3630fe5d8f28ded079faa2603c1d1d2e152516527bbe5c58fba77fbba4f93b89R114-R123
and
https://github.com/delta-incubator/delta-kernel-rs/pull/400/files#diff-3630fe5d8f28ded079faa2603c1d1d2e152516527bbe5c58fba77fbba4f93b89R147-R151

kernel/src/expressions/column_name.rs

nicklan

nice. lgtm!

zachschuermann · 2024-10-23T19:41:20Z

This PR includes doc tests for that!

Awesome, sorry I missed it!

scovich added 4 commits October 15, 2024 14:28

[Chore] Use existing helpers to simplify expression handling code

47d7201

add into-expr support for binary operators and use it

6960544

Merge remote-tracking branch 'oss/main' into cleanup-expression-creation

a1d8f81

Expression::Column references a ColumnName object instead of a String

1687d55

scovich requested review from nicklan, hntd187 and zachschuermann October 15, 2024 23:14

github-actions bot added the breaking-change Change that will require a version bump label Oct 15, 2024

Merge remote-tracking branch 'oss/main' into column-name-expr

5e1fb6d

OussamaSaoudi-db reviewed Oct 18, 2024

View reviewed changes

scovich added 2 commits October 21, 2024 14:40

use proc macro for safety

491d946

Merge remote-tracking branch 'oss/main' into column-name-expr

de963f7

scovich requested a review from OussamaSaoudi-db October 21, 2024 21:56

nicklan requested changes Oct 23, 2024

View reviewed changes

scovich added 2 commits October 23, 2024 08:32

join takes references now

5629e18

Merge remote-tracking branch 'oss/main' into column-name-expr

be7c33b

scovich requested a review from nicklan October 23, 2024 15:35

zachschuermann approved these changes Oct 23, 2024

View reviewed changes

Merge remote-tracking branch 'oss/main' into column-name-expr

8a43b45

OussamaSaoudi-db reviewed Oct 23, 2024

View reviewed changes

kernel/src/expressions/column_name.rs Outdated Show resolved Hide resolved

scovich added 2 commits October 23, 2024 10:44

merge simple and nested column name concept into just column name

74f5f7f

Merge remote-tracking branch 'oss/main' into column-name-expr

28ac979

nicklan approved these changes Oct 23, 2024

View reviewed changes

scovich merged commit e48d238 into delta-io:main Oct 23, 2024
19 checks passed

scovich deleted the column-name-expr branch November 8, 2024 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expression::Column references a ColumnName object instead of a String #400

Expression::Column references a ColumnName object instead of a String #400

scovich commented Oct 15, 2024 •

edited

Loading

codecov bot commented Oct 15, 2024 •

edited

Loading

OussamaSaoudi-db Oct 18, 2024

scovich Oct 19, 2024 •

edited

Loading

scovich Oct 21, 2024

zachschuermann Oct 23, 2024

zachschuermann Oct 23, 2024

zachschuermann Oct 23, 2024

OussamaSaoudi-db Oct 23, 2024

scovich Oct 23, 2024 •

edited

Loading

scovich Oct 23, 2024

zachschuermann Oct 23, 2024

OussamaSaoudi-db Oct 18, 2024

scovich Oct 19, 2024

nicklan left a comment

nicklan Oct 23, 2024

scovich Oct 23, 2024

scovich Oct 23, 2024

scovich Oct 23, 2024

zachschuermann left a comment

zachschuermann Oct 23, 2024

zachschuermann Oct 23, 2024

zachschuermann Oct 23, 2024

scovich commented Oct 23, 2024

nicklan left a comment

zachschuermann commented Oct 23, 2024

Expression::Column references a ColumnName object instead of a String #400

Expression::Column references a ColumnName object instead of a String #400

Conversation

scovich commented Oct 15, 2024 • edited Loading

codecov bot commented Oct 15, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

scovich Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachschuermann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich commented Oct 23, 2024

nicklan left a comment

Choose a reason for hiding this comment

zachschuermann commented Oct 23, 2024

scovich commented Oct 15, 2024 •

edited

Loading

codecov bot commented Oct 15, 2024 •

edited

Loading

scovich Oct 19, 2024 •

edited

Loading

scovich Oct 23, 2024 •

edited

Loading