-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1639] [Feature] NULL-safe incremental merge strategy #6415
Comments
@raphaelvarieras Thanks for opening! Context for those watching at home: https://docs.snowflake.com/en/sql-reference/functions/is-distinct-from.html This is a neat function in Snowflake. As I understand, the idea is that with some_data as (
select 1 as id
union all
select 2 as id
union all
select null as id
)
select
a.id,
b.id,
a.id = b.id as are_equal,
a.id is not distinct from b.id as are_not_distinct,
coalesce(a.id, 0) = coalesce(b.id, 0) as are_equal_with_coalesce -- same effect
from some_data as a
cross join some_data as b
Should this be possible? Yeah prob!We did some refactoring work in v1.3 that made it simpler & easier to "register" your own custom incremental strategy, by defining a macro named For example: {{ config(
materialized = 'incremental',
incremental_strategy = 'merge_null_safe',
unique_key = 'id'
) }}
{% if not is_incremental() %}
-- first run
select 1 as id, 'blue'::varchar(1000) as color
union all
select 2 as id, 'red' as color
union all
select null as id, 'green' as color
{% else %}
-- second run (+ subsequent)
select 1 as id, 'mauve' as color
union all
select 2 as id, 'purple' as color
union all
select null as id, 'yellow' as color
{% endif %} {% macro get_incremental_merge_null_safe_sql(arg_dict) %}
-- {# these components are passed in as a dictionary from the incremental materialization #}
{% set target, source, unique_key, dest_columns, predicates =
arg_dict['target_relation'],
arg_dict['temp_relation'],
arg_dict['unique_key'],
arg_dict['dest_columns'],
arg_dict['predicates'] %}
-- {# the vast majority of this code is copy-pasted from the 'get_merge_sql' macro #}
{%- set predicates = [] if predicates is none else [] + predicates -%}
{%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}
{%- set update_columns = config.get('merge_update_columns', default = dest_columns | map(attribute="quoted") | list) -%}
{%- set sql_header = config.get('sql_header', none) -%}
{% if unique_key %}
{% if unique_key is sequence and unique_key is not mapping and unique_key is not string %}
{% for key in unique_key %}
{% set this_key_match %}
-- this is different: use 'is not distinct from' instead of '='
DBT_INTERNAL_SOURCE.{{ key }} is not distinct from DBT_INTERNAL_DEST.{{ key }}
{% endset %}
{% do predicates.append(this_key_match) %}
{% endfor %}
{% else %}
{% set unique_key_match %}
-- this is different: use 'is not distinct from' instead of '='
DBT_INTERNAL_SOURCE.{{ unique_key }} is not distinct from DBT_INTERNAL_DEST.{{ unique_key }}
{% endset %}
{% do predicates.append(unique_key_match) %}
{% endif %}
{% else %}
{% do predicates.append('FALSE') %}
{% endif %}
{{ sql_header if sql_header is not none }}
merge into {{ target }} as DBT_INTERNAL_DEST
using {{ source }} as DBT_INTERNAL_SOURCE
on {{ predicates | join(' and ') }}
{% if unique_key %}
when matched then update set
{% for column_name in update_columns -%}
{{ column_name }} = DBT_INTERNAL_SOURCE.{{ column_name }}
{%- if not loop.last %}, {%- endif %}
{%- endfor %}
{% endif %}
when not matched then insert
({{ dest_cols_csv }})
values
({{ dest_cols_csv }})
{% endmacro %}
From the logs: merge into analytics.dbt_jcohen.my_model as DBT_INTERNAL_DEST
using analytics.dbt_jcohen.my_model__dbt_tmp as DBT_INTERNAL_SOURCE
on
-- this is different: use 'is not distinct from' instead of '='
DBT_INTERNAL_SOURCE.id is not distinct from DBT_INTERNAL_DEST.id
when matched then update set
"ID" = DBT_INTERNAL_SOURCE."ID","COLOR" = DBT_INTERNAL_SOURCE."COLOR"
when not matched then insert
("ID", "COLOR")
values
("ID", "COLOR") After the second run, the special
Whereas the traditional
The Should this be out-of-the-box behavior? My personal opinion
The question is: When do you ever want your I know it can happen, but it can also be handled. This is a matter of some religious debate in data modeling, but I think Also think about the case where there are multiple records with Options:
I hold two principles when thinking about what logic should go where:
Given those principles, I prefer 2! So I do think this should remain the default behavior of the simple Next steps: docs docs docsWe definitely need to document the new approach to incremental strategies in v1.3+: dbt-labs/docs.getdbt.com#1761 In incremental models & snapshots, it is on the user to ensure that the I'm less inclined to add this as a built-in strategy — but I would be willing to reconsider if lots of folks are running into the same problem, and we see the same custom strategy macro proliferating in the wild. I'm going to close this issue in the meantime, as something that can be resolved with a little bit of custom code — whether a |
@jtcohen6 one question you posed above is My team is running into the need for null-safe merging due to the following use case:
As a result, we need a null-safe merge because the This is using Snowflake, by the way. |
Is this your first time submitting a feature request?
Describe the feature
Add support for null-safe equality predicates using Snowflake's
IS NOT DISTINCT FROM
function, or anIFNULL
orCOALESCE
-based workaround. My suggestion would be to maybe create a third Snowflake strategy in addition tomerge
anddelete+insert
calledmerge_null_safe
that would use a different set of instructions based onNOT DISTINCT FROM
predicates.Describe alternatives you've considered
Adding a special purpose key to all incremental models that are using nullable keys as part of a set, but that's a lot of extra work.
Who will this benefit?
Everybody who uses incremental models with potentially nullable keys.
Are you interested in contributing this feature?
Sure!
Anything else?
No response
The text was updated successfully, but these errors were encountered: