Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1356] [Feature] incremental models' merge conditional update #6077

Closed
3 tasks done
Yotamho opened this issue Oct 17, 2022 · 11 comments
Closed
3 tasks done

[CT-1356] [Feature] incremental models' merge conditional update #6077

Yotamho opened this issue Oct 17, 2022 · 11 comments
Assignees
Labels
enhancement New feature or request incremental Incremental modeling with dbt stale Issues that have gone stale Team:Adapters Issues designated for the adapter area of the code

Comments

@Yotamho
Copy link

Yotamho commented Oct 17, 2022

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

In some databases (namely snowflake and postgres), it is possible to add a condition to a merge update clause:
when matched and <condition> then ...
I want to allow to add this kind of condition to incremental models (by making it an incremental model configuration).

Describe alternatives you've considered

A considerable alternative would be to join the records in the incremental run, with their destination records, to check for which rows the condition is not satisfied, and omit these rows.
such solution would be less performant and will make the model more complicated to understand.

Who will this benefit?

For example when there is a scenario of out of order data that is arbitrarily inserted into the source table of a model, this feature allow us to omit this data by using a condition like: DBT_INTERNAL_SOURCE.timestamp > DBT_INTERNAL_DEST.timestamp.

Are you interested in contributing this feature?

yes

Anything else?

I have forks with suggested implementation:
https://github.com/Yotamho/dbt-core
https://github.com/Yotamho/dbt-snowflake

@Yotamho Yotamho added enhancement New feature or request triage labels Oct 17, 2022
@github-actions github-actions bot changed the title [Feature] incremental models' merge conditional update [CT-1356] [Feature] incremental models' merge conditional update Oct 17, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 30, 2022

@Yotamho Thanks for opening!

This ask sounds to me a lot like the ask to pass additional custom "predicates" to the merge performed by dbt incremental models. There's been longstanding discussion and work in progress on that. See: #3293, #5680, dbt-labs/dbt-snowflake#231, ...

There's a subtle distinction between:

merge into ...
    from ...
    on <unique_key_match> and <custom_condition> ...
    when matched then update ...
    when not matched then insert ...

and:

merge into ...
    from ...
    on <unique_key_match> ...
    when matched and <custom_condition> then update ...
    when not matched then insert ...

In the first case, if the unique key matches but the custom condition isn't met, the row will be inserted as a new record instead of updating an existing record. (Could lead to duplicates in the target table.)

In the second case, if the unique key matches, but the custom condition isn't met, that new row goes ... nowhere. Is that preferable behavior? Or risk greater confusion?


It may help to get concrete about specific use cases. Looking at yours:

For example when there is a scenario of out of order data that is arbitrarily inserted into the source table of a model, this feature allow us to omit this data by using a condition like: DBT_INTERNAL_SOURCE.timestamp > DBT_INTERNAL_DEST.timestamp.

Is this different from the common logic that we ask users to stick in incremental models? Yes, this is an implicit join within the incremental model logic instead, but I'd rather have it live in the model logic, where it can be more easily seen / debugged, than tucked away in the materialization logic / merge statement:

select * from {{ ref('upstream_table') }}
{% if is_incremental() %}
where timestamp_col > (select max(timestamp_col) as max_ts from {{ this }})
{% endif %}

@jtcohen6 jtcohen6 added incremental Incremental modeling with dbt Team:Adapters Issues designated for the adapter area of the code awaiting_response and removed triage labels Oct 30, 2022
@Yotamho
Copy link
Author

Yotamho commented Oct 31, 2022

Hi @jtcohen6 , thanks for responding!
I see the benefit and readability of sticking to the incremental model pattern,
But in certain use cases I think the discussed enhancement is still needed, I would like to demonstrate -
Lets say the model runs twice on 2 bulks of records:

# run 1:
# name, timestamp
john, 01:00
alex, 03:00

# run 2:
# name, timestamp
john, 01:00
alex, 03:00
john, 02:00

If the name is the unique key, and we wish to update records when they arrive with newer timestamp, if we simply base the incremental model on the timestamp we will ignore the updated "john" record.
If we add the condition src.timestamp > dst.timestamp to the merge join condition, the name will no longer act as a unique key and and we will have 2 records of "john".

When data might come out of order, I can think of 2 implementations of incremental models (I'm sure there are more approaches):

  1. Using a column that doesn't ensure order, but ensures that you don't miss data (e.g sequential id that is atomically generated upon the first insertion to the database), in addition to this column, using the order column (the timestamp column in the example given above), as a "merge condition" which is thrown away when the record we wish to insert is older.
  2. Adding overlaps between model runs, for example in the given example, making it also run on 1 hour before the "max_ts", and filtering out records using the same "merge_condition" discussed (I think this is more applicable when you know a certain time that out of order records should arrive in).

@jtcohen6
Copy link
Contributor

(Jumping in quickly because @dbeatty10 has pointed out to me that, where I said "subtle distinction" above, I had initially pasted the exact same code twice. Sorry about that!!)

@jtcohen6
Copy link
Contributor

(Doug - assigning both of us, just so that one of us writes a quick follow-up to the conversation we had on Monday!)

@Yotamho
Copy link
Author

Yotamho commented Dec 11, 2022

Hey,
@jtcohen6 @dbeatty10 ,
Just commenting for your attention because I feel it was forgotten,
Thanks

@tphillip33
Copy link

I am very interested in this.

My use-case is I only want to update when a column "hashdiff" has change:
when matched and old.hashdiff<>new.hashdiff then update

Without that ability, it requires a self-join to identify rows which actually changed. This requires processing, when adding this simple compare would fix that issue.

I am considering a custom materialization to work around this issue.

@jtcohen6
Copy link
Contributor

After giving this another read through, I see significant overlap between this issue, and one I just responded to yesterday: #6415 (comment)

Two points that carry over in particular:

  • As a general rule, we want out-of-the-box materialization behavior to be as simple & easy to reason about as possible. If given the choice between two approaches, with functionally identical outcomes, we should opt for the approach that keeps more of the business logic within the model SQL select statement.
  • At the same time, we want it to be possible to "register" a custom incremental strategy, when you want merge to work differently for certain models. There's a new & better pattern for doing this, starting in v1.3 (but we still need to document it! New components for incremental strategies docs.getdbt.com#1761). You could define (e.g.) a macro named get_incremental_merge_conditional_update_sql, and set incremental_strategy: 'merge_conditional_update' in your model config, along with any other configs that would be relevant to templating your custom merge statement. All that, without having to copy-paste-edit the incremental materialization, or defining one of your own.

@jtcohen6 jtcohen6 removed their assignment Dec 14, 2022
@tphillip33
Copy link

Interesting. I will need to update to 1.3 and see if that will allow me to fix my problem.

@tphillip33
Copy link

In reviewing this implementation. For my purpose, I think I would rather simply override default__get_merge_sql, to accept a new config setting of "merge_update_only_when_columns_different" (or something) and append it to the "when matched {merge_update_only_when_columns_different code here} then update set..."

I will try that and see how that goes.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Aug 9, 2023
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request incremental Incremental modeling with dbt stale Issues that have gone stale Team:Adapters Issues designated for the adapter area of the code
Projects
None yet
Development

No branches or pull requests

4 participants