Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-redshift-batch: Enable schema inference (while still doing full discovery) #2178

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Commits on Dec 3, 2024

  1. Configuration menu
    Copy the full SHA
    154a374 View commit details
    Browse the repository at this point in the history
  2. source-redshift-batch: Turn schema inference back on

    In some sense this is a fairly large change, because it's pulling
    in all of the schema inference machinery. But in another sense it
    isn't that big a deal because we're still discovering a full schema
    for ourselves which documents must satisfy, so all schema inference
    can do is narrow down the types further.
    
    Which is exactly what we need to solve the problem of collection
    keys coming from nullable source columns and being materialized
    as the actual destination primary key. Currently we discover all
    columns with their actual source-DB nullability, which is fine.
    
    But a lot of Redshift users don't have declared primary keys on
    their table and instead have to explicitly specify which properties
    should be the collection key. Which is fine.
    
    Except that a lot of the time, those columns in the Redshift source
    are technically nullable, simply because the user never bothered to
    put a `NOT NULL` constraint on them. This is _also_ fine, Flow can
    handle it.
    
    Except, SQL materializations often can't. They need the collection
    key to be non-nullable so that it can be the primary key of the
    materialized tables.
    
    In the past, this has been worked around by specifying a default
    value annotation in the collection schema. Which _should_ not do
    anything at all, but the materialization incorrectly assumes that
    if a default value annotation is present then the property can't
    ever be null. This is not actually true, as the default value only
    applies to _omitted_ values and not explicitly-null values.
    
    We should fix that, but until this PR goes live setting a default
    value is the only knob we have available to tell a materialization
    "it's okay, just assume it won't actually be null".
    
    After this PR is merged, the inferred schema will be used again
    and (in cases where these collection-key properties aren't ever
    actually null) will provide a tighter schema describing the actual
    data observed in the collection, which means it won't include null,
    which means the materialization won't get upset that the collection
    key is potentially nullable.
    
    If a null value were to be inserted into the source table, we would
    capture it into the collection, the materialization would read it and
    fail with a schema violation error, and at some point schema inference
    would note that now null is a possible value, but be unable to publish
    that a schema update because it would fail validation.
    
    That isn't ideal, but in general null values don't suddenly appear in
    this context and there are two solutions:
     1. Just alter the source DB column to be non-nullable.
     2. Disable schema inference and manually tell us what schema the
        user wants us to use.
    
    The combination of these two escape hatches seems like plenty to me.
    
    Phew. That was a lot of description for a one-line PR.
    willdonnelly committed Dec 3, 2024
    Configuration menu
    Copy the full SHA
    a57a3be View commit details
    Browse the repository at this point in the history