Skip to content

Conversation

@singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented Nov 1, 2025

About the change

Presently the Expressions and the tranfroms are when serialized they are first converted to an unbound expression / transform which reference stuff by name, and then we look up stuff by name when we deserialze and want to bind them back to a given schema. This proposal introduces a notion of IDReference which apart from name includes field ID of the columns and when this info is Serialized in the representation.

This is super helpful in cases such as Row Access Policy where the catalog returns back an expression which needs to enforced by the engine, for that its important for the catalog to give back bound expression to protect the cases of column rename / drop (in case reading old snaphot).

Read Restrictions spec here :

This has been discussed a couple of times in the Catalog community Syncs :

Never the less we have been iterating on this PR in my fork as well in the community sync to get some initial feedbacks

Opening this PR since the feedbacks looks positive, for wider forum !

Note: This has spec change too

Comment on lines 105 to 136
@SuppressWarnings("unchecked")
public static <S, T> UnboundTerm<T> bucket(NamedReference<S> resolvedRef, int numBuckets) {
Transform<S, T> transform = (Transform<S, T>) Transforms.bucket(numBuckets);
return new UnboundTransform<>(resolvedRef, transform);
}

@SuppressWarnings("unchecked")
public static <S, T> UnboundTerm<T> year(NamedReference<S> resolvedRef) {
return new UnboundTransform<>(resolvedRef, (Transform<S, T>) Transforms.year());
}

@SuppressWarnings("unchecked")
public static <S, T> UnboundTerm<T> month(NamedReference<S> resolvedRef) {
return new UnboundTransform<>(resolvedRef, (Transform<S, T>) Transforms.month());
}

@SuppressWarnings("unchecked")
public static <S, T> UnboundTerm<T> day(NamedReference<S> resolvedRef) {
return new UnboundTransform<>(resolvedRef, (Transform<S, T>) Transforms.day());
}

@SuppressWarnings("unchecked")
public static <S, T> UnboundTerm<T> hour(NamedReference<S> resolvedRef) {
return new UnboundTransform<>(resolvedRef, (Transform<S, T>) Transforms.hour());
}

@SuppressWarnings("unchecked")
public static <S, T> UnboundTerm<T> truncate(NamedReference<S> resolvedRef, int width) {
Transform<S, T> transform = (Transform<S, T>) Transforms.truncate(width);
return new UnboundTransform<>(resolvedRef, transform);
}

Copy link
Contributor Author

@singhpk234 singhpk234 Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Present signatures :

  public static <T> UnboundTerm<T> bucket(String name, int numBuckets) {
    Transform<?, T> transform = (Transform<?, T>) Transforms.bucket(numBuckets);
    return new UnboundTransform<>(ref(name), transform);
  }

The signatures for those are tied to the String name, i can move this, the UnboundTerm as even in this function calls ref(name) which returns NamedReference, so we can mark above as deprecated and move to this new one, this way we have just one api and the caller needs to make sure it passes appropriate unboundTerm

@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/serde-bound-terms branch from f54dd90 to 1b0c49b Compare November 1, 2025 12:06
@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/serde-bound-terms branch from 1b0c49b to 48ae6a8 Compare November 2, 2025 01:54
@singhpk234 singhpk234 added the Specification Issues that may introduce spec changes. label Nov 2, 2025
Comment on lines +146 to +158
class NamedReference(BaseModel):
__root__: str = Field(
..., description='Reference to a column by its name', example=['column-name']
)


class IDReference(BaseModel):
"""
Reference to a column by its field ID
"""

type: str = Field('reference', const=True)
source_id: int = Field(..., alias='source-id')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always dislike that it can be either a primitive (str in this case) or an object. Why not just encode the field-ID as an integer? When we find an integer, we know that we need to interpret it as a field-ID.

Copy link
Contributor Author

@singhpk234 singhpk234 Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, i initially modeled IDReference extends NamedReference the representation of id-reference would be something like

{
  "type" : "reference", 
   "source-id" : <field-id> 
   "name" : <colum-name> <--- this is informational only 
} 

Hence an object was required, I agree field-id / source-id is sufficient in this case, though this would mean IDReference would just be an UnboundReference much like NamedReference, we need to adapt Unbound etc to operate on UnboundReference which would ideally be fine but will be a bigger change.

relevant discussion : singhpk234#270 (comment)

Please let me know you thoughts considering above.


I am not sure why this doesn't show up in the python class as the yaml i use

      allOf:
        - $ref: '#/components/schemas/NamedReference'
        - type: object
          required:
            - type
            - source-id
          properties:
            type:
              type: string
              const: "reference"
            source-id:
              type: integer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API core OPENAPI Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants