Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: @dlt.transformer ignores write_disposition="replace" #2108

Closed
wants to merge 2 commits into from

Conversation

joscha
Copy link
Contributor

@joscha joscha commented Nov 28, 2024

Description

When a transformer yields duplicates, they are not merged as defined in the given write_disposition.
See attached test, which is expected to pass, however it fails.

Expected

multiple yields overwrite each other. Last yield wins.

Thus the expected resulting table should look like this:

id current_pass
1 2
2 2

however, it looks like this:

id current_pass
1 1
2 1
1 2
2 2

even though primary_key has been defined and write_disposition is "replace"

Copy link

netlify bot commented Nov 28, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 7441141
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/674d74cb487070000822abe9

@joscha joscha force-pushed the joscha/transformer-ignores branch from 3fd9d25 to 16a94de Compare November 28, 2024 16:05
@sh-rp
Copy link
Collaborator

sh-rp commented Dec 2, 2024

Hey @joscha , without looking at your test to deeply, are you maybe confusing write_disposition "replace" with "merge"? If you set it to replace, merge keys are not used, but the table is fully replaced with each pipeline run.

@joscha
Copy link
Contributor Author

joscha commented Dec 2, 2024

Hey @joscha , without looking at your test to deeply, are you maybe confusing write_disposition "replace" with "merge"? If you set it to replace, merge keys are not used, but the table is fully replaced with each pipeline run.

hi @sh-rp , updated the description and code to remove the merge key; it's the primary key I was after, sorry for the confusion. Would you mind looking at the code? I am also happy to pair/have a call/add more detail to it, if needed.

@joscha joscha changed the title test: @dlt.transformer ignores write_disposition="replace" bug: @dlt.transformer ignores write_disposition="replace" Dec 2, 2024
@sh-rp
Copy link
Collaborator

sh-rp commented Dec 2, 2024

@joscha the primary key also is not used when using replace. What exactly do you expect the primary key to do? If you need deduplication or merging, you'll have to use the merge write_disposition. In your example the transformer yields 4 rows alltogether and these seem to end up in the destination table, which is the correct behavior.

@joscha
Copy link
Contributor Author

joscha commented Dec 2, 2024

the primary key also is not used when using replace. What exactly do you expect the primary key to do?

I was under the impression the primary key denotes uniqueness, do I have that wrong?

If you need deduplication or merging, you'll have to use the merge write_disposition. In your example the transformer yields 4 rows alltogether and these seem to end up in the destination table, which is the correct behavior.

I expect a newer (later) record to enter the system replace an older (earlier) record if they have equal primary keys. So if I have a stream of data, that yields multiple of the same records, the database ends up with only one of them.

I.e.:

[
 { id: 1, seq: 0 },
 { id: 1, seq: 1 },
 { id: 1, seq: 2 },
 { id: 1, seq: 3 },
 { id: 1, seq: 4 },
 { id: 1, seq: 5 },
]

with primary_key="id" would yield

id seq
1 5

the table is fully replaced with each pipeline run.

I do want that. Basically I want to treat whatever comes from the external API as truth, not keeping any previous data. How do I combine both of these behaviours?

@sh-rp
Copy link
Collaborator

sh-rp commented Dec 2, 2024

The primary key is just a column hint, it does not do anything on its own. dlt does not enforce uniqueness, you can set it to translate to a primary key or unique key on some destinations which would make your example fail though because you'd be trying to insert two records with the same primary key.

If you want to have deduplication on the incoming records but still to a replace, you'll have to truncate the table before the pipeline run and use a merge write disposition with primary keys and possibly a dedup_sort hint:

https://dlthub.com/docs/general-usage/incremental-loading#delete-insert-strategy

@joscha
Copy link
Contributor Author

joscha commented Dec 2, 2024

Okay, thank you, will give that a try. This behaviour was not obvious to me. I think for two reasons:

  • The notion of a primary key is overloaded; because of the name I automatically assumed it would be unique
  • "replace" suggested to me it would work on a per-row basis, not only correspond to the whole table, thus I expected later records to overwrite earlier ones.

I wonder if it would make sense for me to open a pull request to the docs to add this information? It would have definitely helped me. WDYT?

I will try your suggestion above and close this issue in the meantime.

@joscha joscha closed this Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants