-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[SPEC | API | CORE] Bound Expressions / Transforms SerDe #14462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| @SuppressWarnings("unchecked") | ||
| public static <S, T> UnboundTerm<T> bucket(NamedReference<S> resolvedRef, int numBuckets) { | ||
| Transform<S, T> transform = (Transform<S, T>) Transforms.bucket(numBuckets); | ||
| return new UnboundTransform<>(resolvedRef, transform); | ||
| } | ||
|
|
||
| @SuppressWarnings("unchecked") | ||
| public static <S, T> UnboundTerm<T> year(NamedReference<S> resolvedRef) { | ||
| return new UnboundTransform<>(resolvedRef, (Transform<S, T>) Transforms.year()); | ||
| } | ||
|
|
||
| @SuppressWarnings("unchecked") | ||
| public static <S, T> UnboundTerm<T> month(NamedReference<S> resolvedRef) { | ||
| return new UnboundTransform<>(resolvedRef, (Transform<S, T>) Transforms.month()); | ||
| } | ||
|
|
||
| @SuppressWarnings("unchecked") | ||
| public static <S, T> UnboundTerm<T> day(NamedReference<S> resolvedRef) { | ||
| return new UnboundTransform<>(resolvedRef, (Transform<S, T>) Transforms.day()); | ||
| } | ||
|
|
||
| @SuppressWarnings("unchecked") | ||
| public static <S, T> UnboundTerm<T> hour(NamedReference<S> resolvedRef) { | ||
| return new UnboundTransform<>(resolvedRef, (Transform<S, T>) Transforms.hour()); | ||
| } | ||
|
|
||
| @SuppressWarnings("unchecked") | ||
| public static <S, T> UnboundTerm<T> truncate(NamedReference<S> resolvedRef, int width) { | ||
| Transform<S, T> transform = (Transform<S, T>) Transforms.truncate(width); | ||
| return new UnboundTransform<>(resolvedRef, transform); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Present signatures :
public static <T> UnboundTerm<T> bucket(String name, int numBuckets) {
Transform<?, T> transform = (Transform<?, T>) Transforms.bucket(numBuckets);
return new UnboundTransform<>(ref(name), transform);
}
The signatures for those are tied to the String name, i can move this, the UnboundTerm as even in this function calls ref(name) which returns NamedReference, so we can mark above as deprecated and move to this new one, this way we have just one api and the caller needs to make sure it passes appropriate unboundTerm
f54dd90 to
1b0c49b
Compare
1b0c49b to
48ae6a8
Compare
| class NamedReference(BaseModel): | ||
| __root__: str = Field( | ||
| ..., description='Reference to a column by its name', example=['column-name'] | ||
| ) | ||
|
|
||
|
|
||
| class IDReference(BaseModel): | ||
| """ | ||
| Reference to a column by its field ID | ||
| """ | ||
|
|
||
| type: str = Field('reference', const=True) | ||
| source_id: int = Field(..., alias='source-id') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always dislike that it can be either a primitive (str in this case) or an object. Why not just encode the field-ID as an integer? When we find an integer, we know that we need to interpret it as a field-ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, i initially modeled IDReference extends NamedReference the representation of id-reference would be something like
{
"type" : "reference",
"source-id" : <field-id>
"name" : <colum-name> <--- this is informational only
}
Hence an object was required, I agree field-id / source-id is sufficient in this case, though this would mean IDReference would just be an UnboundReference much like NamedReference, we need to adapt Unbound etc to operate on UnboundReference which would ideally be fine but will be a bigger change.
relevant discussion : singhpk234#270 (comment)
Please let me know you thoughts considering above.
I am not sure why this doesn't show up in the python class as the yaml i use
allOf:
- $ref: '#/components/schemas/NamedReference'
- type: object
required:
- type
- source-id
properties:
type:
type: string
const: "reference"
source-id:
type: integer
About the change
Presently the Expressions and the tranfroms are when serialized they are first converted to an unbound expression / transform which reference stuff by name, and then we look up stuff by name when we deserialze and want to bind them back to a given schema. This proposal introduces a notion of
IDReferencewhich apart from name includes field ID of the columns and when this info is Serialized in the representation.This is super helpful in cases such as Row Access Policy where the catalog returns back an expression which needs to enforced by the engine, for that its important for the catalog to give back bound expression to protect the cases of column rename / drop (in case reading old snaphot).
Read Restrictions spec here :
This has been discussed a couple of times in the Catalog community Syncs :
Never the less we have been iterating on this PR in my fork as well in the community sync to get some initial feedbacks
Opening this PR since the feedbacks looks positive, for wider forum !
Note: This has spec change too