[CT-1356] [Feature] incremental models' merge conditional update #6077

Yotamho · 2022-10-17T16:01:43Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

In some databases (namely snowflake and postgres), it is possible to add a condition to a merge update clause:
when matched and <condition> then ...
I want to allow to add this kind of condition to incremental models (by making it an incremental model configuration).

Describe alternatives you've considered

A considerable alternative would be to join the records in the incremental run, with their destination records, to check for which rows the condition is not satisfied, and omit these rows.
such solution would be less performant and will make the model more complicated to understand.

Who will this benefit?

For example when there is a scenario of out of order data that is arbitrarily inserted into the source table of a model, this feature allow us to omit this data by using a condition like: DBT_INTERNAL_SOURCE.timestamp > DBT_INTERNAL_DEST.timestamp.

Are you interested in contributing this feature?

yes

Anything else?

I have forks with suggested implementation:
https://github.com/Yotamho/dbt-core
https://github.com/Yotamho/dbt-snowflake

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-10-30T23:33:28Z

@Yotamho Thanks for opening!

This ask sounds to me a lot like the ask to pass additional custom "predicates" to the merge performed by dbt incremental models. There's been longstanding discussion and work in progress on that. See: #3293, #5680, dbt-labs/dbt-snowflake#231, ...

There's a subtle distinction between:

merge into ...
    from ...
    on <unique_key_match> and <custom_condition> ...
    when matched then update ...
    when not matched then insert ...

and:

merge into ...
    from ...
    on <unique_key_match> ...
    when matched and <custom_condition> then update ...
    when not matched then insert ...

In the first case, if the unique key matches but the custom condition isn't met, the row will be inserted as a new record instead of updating an existing record. (Could lead to duplicates in the target table.)

In the second case, if the unique key matches, but the custom condition isn't met, that new row goes ... nowhere. Is that preferable behavior? Or risk greater confusion?

It may help to get concrete about specific use cases. Looking at yours:

For example when there is a scenario of out of order data that is arbitrarily inserted into the source table of a model, this feature allow us to omit this data by using a condition like: DBT_INTERNAL_SOURCE.timestamp > DBT_INTERNAL_DEST.timestamp.

Is this different from the common logic that we ask users to stick in incremental models? Yes, this is an implicit join within the incremental model logic instead, but I'd rather have it live in the model logic, where it can be more easily seen / debugged, than tucked away in the materialization logic / merge statement:

select * from {{ ref('upstream_table') }}
{% if is_incremental() %}
where timestamp_col > (select max(timestamp_col) as max_ts from {{ this }})
{% endif %}

Yotamho · 2022-10-31T09:55:19Z

Hi @jtcohen6 , thanks for responding!
I see the benefit and readability of sticking to the incremental model pattern,
But in certain use cases I think the discussed enhancement is still needed, I would like to demonstrate -
Lets say the model runs twice on 2 bulks of records:

# run 1:
# name, timestamp
john, 01:00
alex, 03:00

# run 2:
# name, timestamp
john, 01:00
alex, 03:00
john, 02:00

If the name is the unique key, and we wish to update records when they arrive with newer timestamp, if we simply base the incremental model on the timestamp we will ignore the updated "john" record.
If we add the condition src.timestamp > dst.timestamp to the merge join condition, the name will no longer act as a unique key and and we will have 2 records of "john".

When data might come out of order, I can think of 2 implementations of incremental models (I'm sure there are more approaches):

Using a column that doesn't ensure order, but ensures that you don't miss data (e.g sequential id that is atomically generated upon the first insertion to the database), in addition to this column, using the order column (the timestamp column in the example given above), as a "merge condition" which is thrown away when the record we wish to insert is older.
Adding overlaps between model runs, for example in the given example, making it also run on 1 hour before the "max_ts", and filtering out records using the same "merge_condition" discussed (I think this is more applicable when you know a certain time that out of order records should arrive in).

jtcohen6 · 2022-11-14T19:17:13Z

(Jumping in quickly because @dbeatty10 has pointed out to me that, where I said "subtle distinction" above, I had initially pasted the exact same code twice. Sorry about that!!)

jtcohen6 · 2022-11-17T14:44:12Z

(Doug - assigning both of us, just so that one of us writes a quick follow-up to the conversation we had on Monday!)

Yotamho · 2022-12-11T07:34:06Z

Hey,
@jtcohen6 @dbeatty10 ,
Just commenting for your attention because I feel it was forgotten,
Thanks

tphillip33 · 2022-12-14T13:49:34Z

I am very interested in this.

My use-case is I only want to update when a column "hashdiff" has change:
when matched and old.hashdiff<>new.hashdiff then update

Without that ability, it requires a self-join to identify rows which actually changed. This requires processing, when adding this simple compare would fix that issue.

I am considering a custom materialization to work around this issue.

jtcohen6 · 2022-12-14T19:49:57Z

After giving this another read through, I see significant overlap between this issue, and one I just responded to yesterday: #6415 (comment)

Two points that carry over in particular:

As a general rule, we want out-of-the-box materialization behavior to be as simple & easy to reason about as possible. If given the choice between two approaches, with functionally identical outcomes, we should opt for the approach that keeps more of the business logic within the model SQL select statement.
At the same time, we want it to be possible to "register" a custom incremental strategy, when you want merge to work differently for certain models. There's a new & better pattern for doing this, starting in v1.3 (but we still need to document it! New components for incremental strategies docs.getdbt.com#1761). You could define (e.g.) a macro named get_incremental_merge_conditional_update_sql, and set incremental_strategy: 'merge_conditional_update' in your model config, along with any other configs that would be relevant to templating your custom merge statement. All that, without having to copy-paste-edit the incremental materialization, or defining one of your own.

tphillip33 · 2022-12-14T20:29:33Z

Interesting. I will need to update to 1.3 and see if that will allow me to fix my problem.

tphillip33 · 2022-12-17T15:04:46Z

In reviewing this implementation. For my purpose, I think I would rather simply override default__get_merge_sql, to accept a new config setting of "merge_update_only_when_columns_different" (or something) and append it to the "when matched {merge_update_only_when_columns_different code here} then update set..."

I will try that and see how that goes.

github-actions · 2023-08-09T01:50:55Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2023-08-17T01:44:37Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

Yotamho added enhancement New feature or request triage labels Oct 17, 2022

github-actions bot changed the title ~~[Feature] incremental models' merge conditional update~~ [CT-1356] [Feature] incremental models' merge conditional update Oct 17, 2022

jtcohen6 added incremental Incremental modeling with dbt Team:Adapters Issues designated for the adapter area of the code awaiting_response and removed triage labels Oct 30, 2022

github-actions bot added triage and removed awaiting_response labels Oct 31, 2022

jtcohen6 added awaiting_response and removed triage labels Nov 17, 2022

jtcohen6 assigned jtcohen6 and dbeatty10 Nov 17, 2022

jtcohen6 removed the awaiting_response label Nov 17, 2022

dbeatty10 added the triage label Dec 14, 2022

jtcohen6 removed their assignment Dec 14, 2022

jtcohen6 removed the triage label Feb 9, 2023

jtcohen6 mentioned this issue Feb 9, 2023

[CT-2023] [Feature] Create an append_unique incremental strategy #6848

Closed

3 tasks

github-actions bot added the stale Issues that have gone stale label Aug 9, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 17, 2023

This was referenced Nov 14, 2023

[CT-3373] [Feature] Incremental Strategy: 'insert_unmatched' #9056

Closed

[CT-3375] [Feature] Allow Users to Specify Merge Clauses for Incremental Models #9060

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1356] [Feature] incremental models' merge conditional update #6077

[CT-1356] [Feature] incremental models' merge conditional update #6077

Yotamho commented Oct 17, 2022

jtcohen6 commented Oct 30, 2022 •

edited

Loading

Yotamho commented Oct 31, 2022 •

edited

Loading

jtcohen6 commented Nov 14, 2022

jtcohen6 commented Nov 17, 2022

Yotamho commented Dec 11, 2022

tphillip33 commented Dec 14, 2022

jtcohen6 commented Dec 14, 2022

tphillip33 commented Dec 14, 2022

tphillip33 commented Dec 17, 2022

github-actions bot commented Aug 9, 2023

github-actions bot commented Aug 17, 2023

[CT-1356] [Feature] incremental models' merge conditional update #6077

[CT-1356] [Feature] incremental models' merge conditional update #6077

Comments

Yotamho commented Oct 17, 2022

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

jtcohen6 commented Oct 30, 2022 • edited Loading

Yotamho commented Oct 31, 2022 • edited Loading

jtcohen6 commented Nov 14, 2022

jtcohen6 commented Nov 17, 2022

Yotamho commented Dec 11, 2022

tphillip33 commented Dec 14, 2022

jtcohen6 commented Dec 14, 2022

tphillip33 commented Dec 14, 2022

tphillip33 commented Dec 17, 2022

github-actions bot commented Aug 9, 2023

github-actions bot commented Aug 17, 2023

jtcohen6 commented Oct 30, 2022 •

edited

Loading

Yotamho commented Oct 31, 2022 •

edited

Loading