Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expression::Column references a ColumnName object instead of a String #400

Merged
merged 12 commits into from
Oct 23, 2024

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Oct 15, 2024

Kernel only partly supports nested column names today. One of the biggest barriers to closing the gap is that Expression tracks column names as simple strings, so there is no reliable way to deal with nesting.

This PR takes a first step of defining a proper ColumnName struct for hosting column names. So far, it behaves the same as before -- internally it's just a simple string, occasionally interpreted as nested by splitting at periods. Changing the code to use this new construct is noisy, so we make the changes as a pre-factor. A follow-up PR will actually add nesting support.

Resolves #422

Copy link

codecov bot commented Oct 15, 2024

Codecov Report

Attention: Patch coverage is 93.12169% with 13 lines in your changes missing coverage. Please review.

Project coverage is 78.94%. Comparing base (b09078d) to head (28ac979).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
derive-macros/src/lib.rs 58.33% 4 Missing and 1 partial ⚠️
kernel/src/expressions/column_names.rs 96.62% 3 Missing ⚠️
kernel/src/expressions/mod.rs 75.00% 3 Missing ⚠️
ffi/src/expressions.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #400      +/-   ##
==========================================
+ Coverage   78.82%   78.94%   +0.11%     
==========================================
  Files          51       52       +1     
  Lines       10655    10756     +101     
  Branches    10655    10756     +101     
==========================================
+ Hits         8399     8491      +92     
- Misses       1788     1796       +8     
- Partials      468      469       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Oct 15, 2024

/// Creates a new column reference, for a splittable nested column name.
/// See [`ColumnName::split`] for details.
pub fn split_column(name: impl AsRef<str>) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that I need to know a priori whether a column name is nested or not when calling Expression::simple_column vs Expression::split_column. Do we want devs to have to keep track of that? Why not have a column(name: impl AsRef<str>), and the function can figure out if it's split or simple?

Copy link
Collaborator Author

@scovich scovich Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions! Here's what I've been thinking but had not written down previously:

It's common for column and field names to be "simple" identifiers satisfying the regexp [_a-zA-Z][_a-zA-Z0-9]*. That corresponds to a simple_column (e.g. 'my_column').

When specifying nested column paths, most systems use . to separate field names, which corresponds to a split_column (e.g. "a.b.c").

But field names are actually allowed to contain any characters they want -- including spaces, periods, and wacky unicode characters, and such columns are not rare. Especially if a language other than English is used, and characters from other languages are needed.

So now we have ambiguity -- does the string "a.b.c.d" correspond to a simple nested column name with three parts? Or a single column whose name happens to contain periods? What about "a.b c.d" -- most parsers would reject that as a syntax error. What about "a.b, c.d" -- in the wrong place, most parsers would interpret that as two column names.

So, systems need a way to unambiguously identify the fields in a string that represents a nested column name. Microsoft ecosystem uses square brackets (e.g. a.[b.c].d or a.[b, c].d), MySQL uses backticks (e.g. a.`b.c`.d or a.`b, c`.d), and SQL officially uses double-quote characters (e.g. a."b.c".d or a."b, c".d). All have an escape character mechanism to allow parsing names that actually contain the field name delimiter.

Assuming the column name has been successfully parsed, it should be represented as a Vec<String> or similar, in order to eliminate any possible confusion from then on.

The next PR after this one will add a proper vec representation, but we're hoping to sidestep the whole delimiter and parsing situation in kernel because different engines do it so differently (and parsing is messy business in general).

However -- for internal kernel use when referencing Delta metadata fields, we know the column names are all simple identifiers (and usually string literals known at compile time as well). For convenience in such cases, this PR adds the simple_column and split_column constructors. By using them, the caller asserts that there are no fancy characters involved, thus avoiding the parsing problem without the ugliness and frustration of passing vectors of literals all over the place (e.g. Expression::Column(ColumnName::nested(vec!["add", "deletionVector", "storageType"])) becomes just Expression::split_column("add.deletionVector.storageType")).

The next PR will introduce ColumnName::nested that takes a (pre-parsed) list of field names. Once that exists, a call to ColumnName::simple(name) will be sugar for ColumnName::nested(vec![name]) -- and allows arbitrary characters because it no longer requires parsing of any kind. Similarly, ColumnName::split(name) will be sugar for ColumnName::nested(name.split('.')). In any case where kernel doesn't KNOW the structure of the column name, the caller (usually engine) is required to use ColumnName::nested to avoid ambiguity.

Naming is hard tho -- we want something short and intuitive for the convenience helpers. I'm not sure simple and split are quite the right names for the job.

I'm also debating whether the two convenience constructors should actually be macros, so we can enforce at compile time that the input is a string literal containing only unremarkable characters. That would prevent anyone from misusing them in cases where we don't actually know the column name's structure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and defined macros that give provable compile-time safety

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome i really like the macros :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at risk of a dumb question: could we just have a simple_column be something like: either a nested column split by periods (and no periods in field names) like a.b.c or just a single top-level column? then we wouldn't need two APIs for top-level/nested column names? It could just unify under simple_column!("nested.col") -> ColumnName::new([name, col]) and simple_column!("not_nested") -> ColumnName::new([not_nested]) or something like that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, come to think of it, regarding naming: I wonder if just something like column_name!() macro would suffice, since it's compile-time checked if any error occurs we can point back to macro docs and have a clear error message that says 'column_name!' is for generating a simple '.-delimited' column name without special chars, etc, etc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah that was an excellent overview of the problems of parsing and how this addresses them. Thank you!

Makes sense that we're avoiding parsing in the kernel. We're at a low enough level that any names we're passed are already parsed.

I'll take a look at the macros this review. Excited to see what you cooked up 👀

Copy link
Collaborator Author

@scovich scovich Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if just something like column_name!() macro would suffice, since it's compile-time checked if any error occurs we can point back to macro docs and have a clear error message that says 'column_name!' is for generating a simple '.-delimited' column name without special chars, etc, etc

That's a great question. I went back and forth quite a bit and settled on the current form because it requires developer to state intent and allows the compiler to easily validate intent (e.g. simple_column! blows up if it sees any dots. I even debated having nested_column! blow up unless it sees at least one dot, but that seemed overkill.

That said, I'm very open to discussion if we think the current split is also overkill. Maybe we're already safe enough from the combination of string literals and the fact that Delta spec column names never contain dots.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more, I agree that two sets of macros is overkill. Simplified to just one concept: column_name! for a ColumnName and column_expr! for a Expression::Column.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, SGTM! thanks!

@@ -146,7 +146,8 @@ fn visit_expression_column_impl(
state: &mut KernelExpressionVisitorState,
name: DeltaResult<String>,
) -> DeltaResult<usize> {
Ok(wrap_expression(state, Expression::Column(name?)))
// TODO: We can't actually assume the column name is so easily splittable!
Ok(wrap_expression(state, Expression::split_column(name?)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what are cases when a column name isn't splittable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment... we can't safely split an arbitrary column name like "a.b.c.d" because one of the field names might be b.c (probably not, but no way to rule it out).

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is a cool way of handling it. One suggestion for join.

Comment on lines 22 to 24
pub fn join(left: ColumnName, right: ColumnName) -> ColumnName {
[left, right].into_iter().collect()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either this should take &self and then work as if self was left and join to the arg (similar to Url's join), or it should live outside the impl.

I'd prefer:

Suggested change
pub fn join(left: ColumnName, right: ColumnName) -> ColumnName {
[left, right].into_iter().collect()
}
pub fn join(&self, right: ColumnName) -> ColumnName {
[self.clone(), right].into_iter().collect()
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could it live outside the impl without causing massive namespace collisions? join is a pretty generic name and rust doesn't have function overloading...

I was avoiding the &self join route because it potentially means copying a bunch of strings for deeply nested paths. But (a) that's probably optimizing rather than merely not-pessimizing; and (b) the left side of the join is usually a single field name and we're anyway consuming the right arg.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also define a join_with method that takes self. Maybe I'll just do that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: std::path::Path::join actually takes both left and right sides by reference. Let's go simpler-is-better here.

kernel/src/expressions/column_name.rs Outdated Show resolved Hide resolved
kernel/src/scan/data_skipping.rs Outdated Show resolved Hide resolved
@scovich scovich requested a review from nicklan October 23, 2024 15:35
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great just had two main things below (1) if we can unify the nested/simple APIs and (2) how we could possibly test the compilation failure

ffi/src/expressions.rs Show resolved Hide resolved

/// Creates a new column reference, for a splittable nested column name.
/// See [`ColumnName::split`] for details.
pub fn split_column(name: impl AsRef<str>) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome i really like the macros :)


/// Creates a new column reference, for a splittable nested column name.
/// See [`ColumnName::split`] for details.
pub fn split_column(name: impl AsRef<str>) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at risk of a dumb question: could we just have a simple_column be something like: either a nested column split by periods (and no periods in field names) like a.b.c or just a single top-level column? then we wouldn't need two APIs for top-level/nested column names? It could just unify under simple_column!("nested.col") -> ColumnName::new([name, col]) and simple_column!("not_nested") -> ColumnName::new([not_nested]) or something like that?


/// Creates a new column reference, for a splittable nested column name.
/// See [`ColumnName::split`] for details.
pub fn split_column(name: impl AsRef<str>) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, come to think of it, regarding naming: I wonder if just something like column_name!() macro would suffice, since it's compile-time checked if any error occurs we can point back to macro docs and have a clear error message that says 'column_name!' is for generating a simple '.-delimited' column name without special chars, etc, etc

kernel/src/expressions/column_name.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. lgtm!

@zachschuermann
Copy link
Collaborator

This PR includes doc tests for that!

Awesome, sorry I missed it!

@scovich scovich merged commit e48d238 into delta-io:main Oct 23, 2024
19 checks passed
@scovich scovich deleted the column-name-expr branch November 8, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make column names a first-class concept (not just strings)
4 participants