-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expression::Column references a ColumnName object instead of a String #400
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #400 +/- ##
==========================================
+ Coverage 78.82% 78.94% +0.11%
==========================================
Files 51 52 +1
Lines 10655 10756 +101
Branches 10655 10756 +101
==========================================
+ Hits 8399 8491 +92
- Misses 1788 1796 +8
- Partials 468 469 +1 ☔ View full report in Codecov by Sentry. |
kernel/src/expressions/mod.rs
Outdated
|
||
/// Creates a new column reference, for a splittable nested column name. | ||
/// See [`ColumnName::split`] for details. | ||
pub fn split_column(name: impl AsRef<str>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that I need to know a priori whether a column name is nested or not when calling Expression::simple_column
vs Expression::split_column
. Do we want devs to have to keep track of that? Why not have a column(name: impl AsRef<str>)
, and the function can figure out if it's split or simple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great questions! Here's what I've been thinking but had not written down previously:
It's common for column and field names to be "simple" identifiers satisfying the regexp [_a-zA-Z][_a-zA-Z0-9]*
. That corresponds to a simple_column
(e.g. 'my_column'
).
When specifying nested column paths, most systems use .
to separate field names, which corresponds to a split_column
(e.g. "a.b.c"
).
But field names are actually allowed to contain any characters they want -- including spaces, periods, and wacky unicode characters, and such columns are not rare. Especially if a language other than English is used, and characters from other languages are needed.
So now we have ambiguity -- does the string "a.b.c.d"
correspond to a simple nested column name with three parts? Or a single column whose name happens to contain periods? What about "a.b c.d"
-- most parsers would reject that as a syntax error. What about "a.b, c.d"
-- in the wrong place, most parsers would interpret that as two column names.
So, systems need a way to unambiguously identify the fields in a string that represents a nested column name. Microsoft ecosystem uses square brackets (e.g. a.[b.c].d
or a.[b, c].d
), MySQL uses backticks (e.g. a.`b.c`.d
or a.`b, c`.d
), and SQL officially uses double-quote characters (e.g. a."b.c".d
or a."b, c".d
). All have an escape character mechanism to allow parsing names that actually contain the field name delimiter.
Assuming the column name has been successfully parsed, it should be represented as a Vec<String>
or similar, in order to eliminate any possible confusion from then on.
The next PR after this one will add a proper vec representation, but we're hoping to sidestep the whole delimiter and parsing situation in kernel because different engines do it so differently (and parsing is messy business in general).
However -- for internal kernel use when referencing Delta metadata fields, we know the column names are all simple identifiers (and usually string literals known at compile time as well). For convenience in such cases, this PR adds the simple_column
and split_column
constructors. By using them, the caller asserts that there are no fancy characters involved, thus avoiding the parsing problem without the ugliness and frustration of passing vectors of literals all over the place (e.g. Expression::Column(ColumnName::nested(vec!["add", "deletionVector", "storageType"]))
becomes just Expression::split_column("add.deletionVector.storageType")
).
The next PR will introduce ColumnName::nested
that takes a (pre-parsed) list of field names. Once that exists, a call to ColumnName::simple(name)
will be sugar for ColumnName::nested(vec![name])
-- and allows arbitrary characters because it no longer requires parsing of any kind. Similarly, ColumnName::split(name)
will be sugar for ColumnName::nested(name.split('.'))
. In any case where kernel doesn't KNOW the structure of the column name, the caller (usually engine) is required to use ColumnName::nested
to avoid ambiguity.
Naming is hard tho -- we want something short and intuitive for the convenience helpers. I'm not sure simple
and split
are quite the right names for the job.
I'm also debating whether the two convenience constructors should actually be macros, so we can enforce at compile time that the input is a string literal containing only unremarkable characters. That would prevent anyone from misusing them in cases where we don't actually know the column name's structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and defined macros that give provable compile-time safety
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome i really like the macros :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at risk of a dumb question: could we just have a simple_column
be something like: either a nested column split by periods (and no periods in field names) like a.b.c
or just a single top-level column? then we wouldn't need two APIs for top-level/nested column names? It could just unify under simple_column!("nested.col") -> ColumnName::new([name, col])
and simple_column!("not_nested") -> ColumnName::new([not_nested])
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and, come to think of it, regarding naming: I wonder if just something like column_name!()
macro would suffice, since it's compile-time checked if any error occurs we can point back to macro docs and have a clear error message that says 'column_name!' is for generating a simple '.
-delimited' column name without special chars, etc, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woah that was an excellent overview of the problems of parsing and how this addresses them. Thank you!
Makes sense that we're avoiding parsing in the kernel. We're at a low enough level that any names we're passed are already parsed.
I'll take a look at the macros this review. Excited to see what you cooked up 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if just something like column_name!() macro would suffice, since it's compile-time checked if any error occurs we can point back to macro docs and have a clear error message that says 'column_name!' is for generating a simple '.-delimited' column name without special chars, etc, etc
That's a great question. I went back and forth quite a bit and settled on the current form because it requires developer to state intent and allows the compiler to easily validate intent (e.g. simple_column!
blows up if it sees any dots. I even debated having nested_column!
blow up unless it sees at least one dot, but that seemed overkill.
That said, I'm very open to discussion if we think the current split is also overkill. Maybe we're already safe enough from the combination of string literals and the fact that Delta spec column names never contain dots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more, I agree that two sets of macros is overkill. Simplified to just one concept: column_name!
for a ColumnName
and column_expr!
for a Expression::Column
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, SGTM! thanks!
ffi/src/expressions.rs
Outdated
@@ -146,7 +146,8 @@ fn visit_expression_column_impl( | |||
state: &mut KernelExpressionVisitorState, | |||
name: DeltaResult<String>, | |||
) -> DeltaResult<usize> { | |||
Ok(wrap_expression(state, Expression::Column(name?))) | |||
// TODO: We can't actually assume the column name is so easily splittable! | |||
Ok(wrap_expression(state, Expression::split_column(name?))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what are cases when a column name isn't splittable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment... we can't safely split an arbitrary column name like "a.b.c.d"
because one of the field names might be b.c
(probably not, but no way to rule it out).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is a cool way of handling it. One suggestion for join
.
pub fn join(left: ColumnName, right: ColumnName) -> ColumnName { | ||
[left, right].into_iter().collect() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either this should take &self
and then work as if self was left
and join to the arg (similar to Url
's join), or it should live outside the impl
.
I'd prefer:
pub fn join(left: ColumnName, right: ColumnName) -> ColumnName { | |
[left, right].into_iter().collect() | |
} | |
pub fn join(&self, right: ColumnName) -> ColumnName { | |
[self.clone(), right].into_iter().collect() | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could it live outside the impl without causing massive namespace collisions? join
is a pretty generic name and rust doesn't have function overloading...
I was avoiding the &self
join route because it potentially means copying a bunch of strings for deeply nested paths. But (a) that's probably optimizing rather than merely not-pessimizing; and (b) the left side of the join is usually a single field name and we're anyway consuming the right arg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also define a join_with
method that takes self
. Maybe I'll just do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: std::path::Path::join actually takes both left and right sides by reference. Let's go simpler-is-better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great just had two main things below (1) if we can unify the nested/simple APIs and (2) how we could possibly test the compilation failure
kernel/src/expressions/mod.rs
Outdated
|
||
/// Creates a new column reference, for a splittable nested column name. | ||
/// See [`ColumnName::split`] for details. | ||
pub fn split_column(name: impl AsRef<str>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome i really like the macros :)
kernel/src/expressions/mod.rs
Outdated
|
||
/// Creates a new column reference, for a splittable nested column name. | ||
/// See [`ColumnName::split`] for details. | ||
pub fn split_column(name: impl AsRef<str>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at risk of a dumb question: could we just have a simple_column
be something like: either a nested column split by periods (and no periods in field names) like a.b.c
or just a single top-level column? then we wouldn't need two APIs for top-level/nested column names? It could just unify under simple_column!("nested.col") -> ColumnName::new([name, col])
and simple_column!("not_nested") -> ColumnName::new([not_nested])
or something like that?
kernel/src/expressions/mod.rs
Outdated
|
||
/// Creates a new column reference, for a splittable nested column name. | ||
/// See [`ColumnName::split`] for details. | ||
pub fn split_column(name: impl AsRef<str>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and, come to think of it, regarding naming: I wonder if just something like column_name!()
macro would suffice, since it's compile-time checked if any error occurs we can point back to macro docs and have a clear error message that says 'column_name!' is for generating a simple '.
-delimited' column name without special chars, etc, etc
This PR includes doc tests for that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice. lgtm!
Awesome, sorry I missed it! |
Kernel only partly supports nested column names today. One of the biggest barriers to closing the gap is that Expression tracks column names as simple strings, so there is no reliable way to deal with nesting.
This PR takes a first step of defining a proper
ColumnName
struct for hosting column names. So far, it behaves the same as before -- internally it's just a simple string, occasionally interpreted as nested by splitting at periods. Changing the code to use this new construct is noisy, so we make the changes as a pre-factor. A follow-up PR will actually add nesting support.Resolves #422