Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airbtye platform should validate that not all PKs are null, if stream has at least one PK #31758

Closed
evantahler opened this issue Oct 23, 2023 · 9 comments

Comments

@evantahler
Copy link
Contributor

evantahler commented Oct 23, 2023

The Airbyte platform should fail the sync if a record is found with all null primary keys.

In #31926, we decided that if there are multiple PKs in the stream, that means the source is attempting to describe a composite primary key. Any of the keys can be null, but at lease one of them must be non-null.

The platform should fail the sync if a record is found which has all primary-key columns as null. This works for the degenerate case of 1 PK. If there is no primary key for the stream this check should be skipped.

@evantahler
Copy link
Contributor Author

This is related to #31926

@evantahler evantahler changed the title Airbtye platform should validate that no PKs are null Airbtye platform should validate that not all PKs are null Oct 31, 2023
@evantahler
Copy link
Contributor Author

@davinchia this is ready to be worked on now

@evantahler evantahler reopened this Nov 1, 2023
@davinchia
Copy link
Contributor

Note:

  • what is the performance impact of this?
  • @evantahler can we confirm, the primary key is something we always expect to be present? just want to make sure there aren't cases where the primary key isn't present and we need to account for that.

@evantahler evantahler changed the title Airbtye platform should validate that not all PKs are null Airbtye platform should validate that not all PKs are null, if stream has at least one PK Nov 8, 2023
@evantahler
Copy link
Contributor Author

There are sources/streams which do not have primary any primary keys. These streams should (1) always be full refresh and (2) not be subject to this check. I updated the title to make this more clear

@davinchia
Copy link
Contributor

davinchia commented Nov 8, 2023

thanks.

for a stream configured with primary keys, is it problematic if we enforce this heavily? e.g. are we worried about the case where a user wants to use a field as a primary key, and this field is actually missing in some records.

it sounds like we aren't - wanted to double check.

@evantahler
Copy link
Contributor Author

are we worried about the case where a user wants to use a field as a primary key, and this field is actually missing in some records.

That's exactly what this check should detect and fail the sync. Then, there are some next steps:

  • The source has bad data and the user can fix it up (in DB or File sources)
  • The source connector is presenting a bad schema and/or the API isn't as reliable as we think it is. The source should be changed to only allow full-refresh syncs for this stream (and remove the PK)

@benmoriceau
Copy link
Contributor

@evantahler Just to be sure that I am doing the right thing. My understanding is that this error will be qualified as a Source error, am I right?

@evantahler
Copy link
Contributor Author

Yep! Not having the catalog correct is a source issue

@benmoriceau benmoriceau assigned benmoriceau and unassigned davinchia Nov 16, 2023
@benmoriceau
Copy link
Contributor

@evantahler The fix has been merged, I'll start to progressive rollout on Wednesday 29th. I will start with 1%, then 10% on the same day and I'm aiming for 100% on Thursday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants