source-redshift-batch: Enable schema inference (while still doing full discovery) #2178

willdonnelly · 2024-12-03T23:04:01Z

Description:

In some sense this is a fairly large change, because it's pulling in all of the schema inference machinery. But in another sense it isn't that big a deal because we're still discovering a full schema for ourselves which documents must satisfy, so all schema inference can do is narrow down the read-time types further.

Which is exactly what we need to solve the problem of collection keys coming from nullable source columns and being materialized as the actual destination primary key. Currently we discover all columns with their actual source-DB nullability, which is fine.

But a lot of Redshift users don't have declared primary keys on their table and instead have to explicitly specify which properties should be the collection key. Which is fine.

Except that a lot of the time, those columns in the Redshift source are technically nullable, simply because the user never bothered to put a NOT NULL constraint on them. This is also fine, Flow can handle it.

Except, SQL materializations often can't. They need the collection key to be non-nullable so that it can be the primary key of the materialized tables.

In the past, this has been worked around by specifying a default value annotation in the collection schema. Which should not do anything at all, but the materialization incorrectly assumes that if a default value annotation is present then the property can't ever be null. This is not actually true, as the default value only applies to omitted values and not explicitly-null values.

We should fix that, but until this PR goes live setting a default value is the only knob we have available to tell a materialization "it's okay, just assume it won't actually be null".

After this PR is merged, the inferred schema will be used again and (in cases where these collection-key properties aren't ever actually null) will provide a tighter schema describing the actual data observed in the collection, which means it won't include null, which means the materialization won't get upset that the collection key is potentially nullable.

If a null value were to be inserted into the source table, we would capture it into the collection, the materialization would read it and fail with a schema violation error, and at some point schema inference would note that now null is a possible value, but be unable to publish that a schema update because it would fail validation.

That isn't ideal, but in general null values don't suddenly appear in this context and there are two solutions:

Just alter the source DB column to be non-nullable.
Disable schema inference and manually tell us what schema the user wants us to use.

The combination of these two escape hatches seems like plenty to me.

Phew. That was a lot of description for a one-line PR.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

This PR hasn't been tested against an actual Redshift instance because the one in our 1Password appears to either no longer exist or have had its password changed, and I don't think I have a working AWS account with which to fix that at the moment. However I ran the test suite against Postgres locally just to make sure I got all the snapshot updates from this change (plus some other updates from other changes which are months-old), and this doesn't change anything about how the connector interacts with the server, so it should be fine.

This change is

In some sense this is a fairly large change, because it's pulling in all of the schema inference machinery. But in another sense it isn't that big a deal because we're still discovering a full schema for ourselves which documents must satisfy, so all schema inference can do is narrow down the types further. Which is exactly what we need to solve the problem of collection keys coming from nullable source columns and being materialized as the actual destination primary key. Currently we discover all columns with their actual source-DB nullability, which is fine. But a lot of Redshift users don't have declared primary keys on their table and instead have to explicitly specify which properties should be the collection key. Which is fine. Except that a lot of the time, those columns in the Redshift source are technically nullable, simply because the user never bothered to put a `NOT NULL` constraint on them. This is _also_ fine, Flow can handle it. Except, SQL materializations often can't. They need the collection key to be non-nullable so that it can be the primary key of the materialized tables. In the past, this has been worked around by specifying a default value annotation in the collection schema. Which _should_ not do anything at all, but the materialization incorrectly assumes that if a default value annotation is present then the property can't ever be null. This is not actually true, as the default value only applies to _omitted_ values and not explicitly-null values. We should fix that, but until this PR goes live setting a default value is the only knob we have available to tell a materialization "it's okay, just assume it won't actually be null". After this PR is merged, the inferred schema will be used again and (in cases where these collection-key properties aren't ever actually null) will provide a tighter schema describing the actual data observed in the collection, which means it won't include null, which means the materialization won't get upset that the collection key is potentially nullable. If a null value were to be inserted into the source table, we would capture it into the collection, the materialization would read it and fail with a schema violation error, and at some point schema inference would note that now null is a possible value, but be unable to publish that a schema update because it would fail validation. That isn't ideal, but in general null values don't suddenly appear in this context and there are two solutions: 1. Just alter the source DB column to be non-nullable. 2. Disable schema inference and manually tell us what schema the user wants us to use. The combination of these two escape hatches seems like plenty to me. Phew. That was a lot of description for a one-line PR.

Alex-Bair

LGTM! Thanks for the detailed PR description. It helped me understand the purpose of this change a lot.

willdonnelly added 2 commits December 3, 2024 17:03

source-redshift-batch: Update test snapshots

154a374

willdonnelly requested a review from a team December 3, 2024 23:04

willdonnelly changed the title ~~Wgd/redshift schema inference 20241203~~ source-redshift-batch: Enable schema inference (while still doing full discovery) Dec 3, 2024

Alex-Bair approved these changes Dec 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source-redshift-batch: Enable schema inference (while still doing full discovery) #2178

source-redshift-batch: Enable schema inference (while still doing full discovery) #2178

willdonnelly commented Dec 3, 2024 •

edited

Loading

Alex-Bair left a comment

source-redshift-batch: Enable schema inference (while still doing full discovery) #2178

Are you sure you want to change the base?

source-redshift-batch: Enable schema inference (while still doing full discovery) #2178

Conversation

willdonnelly commented Dec 3, 2024 • edited Loading

Alex-Bair left a comment

Choose a reason for hiding this comment

willdonnelly commented Dec 3, 2024 •

edited

Loading