bug: `@dlt.transformer` ignores `write_disposition="replace"` #2108

joscha · 2024-11-28T15:55:26Z

Description

When a transformer yields duplicates, they are not merged as defined in the given write_disposition.
See attached test, which is expected to pass, however it fails.

Expected

multiple yields overwrite each other. Last yield wins.

Thus the expected resulting table should look like this:

id	current_pass
1	2
2	2

however, it looks like this:

id	current_pass
1	1
2	1
1	2
2	2

even though primary_key has been defined and write_disposition is "replace"

netlify · 2024-11-28T15:55:41Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`7441141`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/674d74cb487070000822abe9

sh-rp · 2024-12-02T08:41:05Z

Hey @joscha , without looking at your test to deeply, are you maybe confusing write_disposition "replace" with "merge"? If you set it to replace, merge keys are not used, but the table is fully replaced with each pipeline run.

tests/load/pipeline/test_merge_disposition.py

joscha · 2024-12-02T08:51:42Z

Hey @joscha , without looking at your test to deeply, are you maybe confusing write_disposition "replace" with "merge"? If you set it to replace, merge keys are not used, but the table is fully replaced with each pipeline run.

hi @sh-rp , updated the description and code to remove the merge key; it's the primary key I was after, sorry for the confusion. Would you mind looking at the code? I am also happy to pair/have a call/add more detail to it, if needed.

sh-rp · 2024-12-02T09:47:59Z

@joscha the primary key also is not used when using replace. What exactly do you expect the primary key to do? If you need deduplication or merging, you'll have to use the merge write_disposition. In your example the transformer yields 4 rows alltogether and these seem to end up in the destination table, which is the correct behavior.

joscha · 2024-12-02T09:52:13Z

the primary key also is not used when using replace. What exactly do you expect the primary key to do?

I was under the impression the primary key denotes uniqueness, do I have that wrong?

If you need deduplication or merging, you'll have to use the merge write_disposition. In your example the transformer yields 4 rows alltogether and these seem to end up in the destination table, which is the correct behavior.

I expect a newer (later) record to enter the system replace an older (earlier) record if they have equal primary keys. So if I have a stream of data, that yields multiple of the same records, the database ends up with only one of them.

I.e.:

[
 { id: 1, seq: 0 },
 { id: 1, seq: 1 },
 { id: 1, seq: 2 },
 { id: 1, seq: 3 },
 { id: 1, seq: 4 },
 { id: 1, seq: 5 },
]

with primary_key="id" would yield

id	seq
1	5

the table is fully replaced with each pipeline run.

I do want that. Basically I want to treat whatever comes from the external API as truth, not keeping any previous data. How do I combine both of these behaviours?

sh-rp · 2024-12-02T10:04:53Z

The primary key is just a column hint, it does not do anything on its own. dlt does not enforce uniqueness, you can set it to translate to a primary key or unique key on some destinations which would make your example fail though because you'd be trying to insert two records with the same primary key.

If you want to have deduplication on the incoming records but still to a replace, you'll have to truncate the table before the pipeline run and use a merge write disposition with primary keys and possibly a dedup_sort hint:

https://dlthub.com/docs/general-usage/incremental-loading#delete-insert-strategy

joscha · 2024-12-02T10:17:26Z

Okay, thank you, will give that a try. This behaviour was not obvious to me. I think for two reasons:

The notion of a primary key is overloaded; because of the name I automatically assumed it would be unique
"replace" suggested to me it would work on a per-row basis, not only correspond to the whole table, thus I expected later records to overwrite earlier ones.

I wonder if it would make sense for me to open a pull request to the docs to add this information? It would have definitely helped me. WDYT?

I will try your suggestion above and close this issue in the meantime.

test: @dlt.transformer ignores write_disposition="replace"

16a94de

joscha force-pushed the joscha/transformer-ignores branch from 3fd9d25 to 16a94de Compare November 28, 2024 16:05

joscha commented Dec 2, 2024

View reviewed changes

tests/load/pipeline/test_merge_disposition.py Outdated Show resolved Hide resolved

Update tests/load/pipeline/test_merge_disposition.py

7441141

joscha changed the title ~~test: @dlt.transformer ignores write_disposition="replace"~~ bug: @dlt.transformer ignores write_disposition="replace" Dec 2, 2024

joscha closed this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: `@dlt.transformer` ignores `write_disposition="replace"` #2108

bug: `@dlt.transformer` ignores `write_disposition="replace"` #2108

joscha commented Nov 28, 2024 •

edited

Loading

netlify bot commented Nov 28, 2024 •

edited

Loading

sh-rp commented Dec 2, 2024 •

edited

Loading

joscha commented Dec 2, 2024

sh-rp commented Dec 2, 2024

joscha commented Dec 2, 2024 •

edited

Loading

sh-rp commented Dec 2, 2024 •

edited

Loading

joscha commented Dec 2, 2024

bug: @dlt.transformer ignores write_disposition="replace" #2108

bug: @dlt.transformer ignores write_disposition="replace" #2108

Conversation

joscha commented Nov 28, 2024 • edited Loading

Description

Expected

netlify bot commented Nov 28, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

sh-rp commented Dec 2, 2024 • edited Loading

joscha commented Dec 2, 2024

sh-rp commented Dec 2, 2024

joscha commented Dec 2, 2024 • edited Loading

sh-rp commented Dec 2, 2024 • edited Loading

joscha commented Dec 2, 2024

bug: `@dlt.transformer` ignores `write_disposition="replace"` #2108

bug: `@dlt.transformer` ignores `write_disposition="replace"` #2108

joscha commented Nov 28, 2024 •

edited

Loading

netlify bot commented Nov 28, 2024 •

edited

Loading

sh-rp commented Dec 2, 2024 •

edited

Loading

joscha commented Dec 2, 2024 •

edited

Loading

sh-rp commented Dec 2, 2024 •

edited

Loading