[CT-1639] [Feature] NULL-safe incremental merge strategy #6415

raphaelvarieras · 2022-12-09T22:39:47Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Add support for null-safe equality predicates using Snowflake's IS NOT DISTINCT FROM function, or an IFNULL or COALESCE -based workaround. My suggestion would be to maybe create a third Snowflake strategy in addition to merge and delete+insert called merge_null_safe that would use a different set of instructions based on NOT DISTINCT FROM predicates.

Describe alternatives you've considered

Adding a special purpose key to all incremental models that are using nullable keys as part of a set, but that's a lot of extra work.

Who will this benefit?

Everybody who uses incremental models with potentially nullable keys.

Are you interested in contributing this feature?

Sure!

Anything else?

No response

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-12-13T15:37:37Z

@raphaelvarieras Thanks for opening!

Context for those watching at home: https://docs.snowflake.com/en/sql-reference/functions/is-distinct-from.html

This is a neat function in Snowflake. As I understand, the idea is that <anything> = null and <anything> != null will always return null (and therefore not matched in a merge statement), but this function will treat null as a discrete value unto itself (which can then satisfy a when matched clause).

with some_data as (

    select 1 as id
    union all
    select 2 as id
    union all
    select null as id

)

select

    a.id,
    b.id,
    a.id = b.id as are_equal,
    a.id is not distinct from b.id as are_not_distinct,
    coalesce(a.id, 0) = coalesce(b.id, 0) as are_equal_with_coalesce  -- same effect

from some_data as a
cross join some_data as b

ID	ID	ARE_EQUAL	ARE_NOT_DISTINCT	ARE_EQUAL_WITH_COALESCE
1	1	TRUE	TRUE	TRUE
1	2	FALSE	FALSE	FALSE
1	null	FALSE	FALSE	TRUE
2	1	FALSE	FALSE	FALSE
2	2	TRUE	TRUE	TRUE
2	null	FALSE	FALSE	TRUE
null	1	FALSE	FALSE	TRUE
null	2	FALSE	FALSE	TRUE
null	null	TRUE	TRUE	TRUE

Should this be possible? Yeah prob!

We did some refactoring work in v1.3 that made it simpler & easier to "register" your own custom incremental strategy, by defining a macro named get_incremental_<yourcoolstrat>_sql. But this still needs to be documented!!

For example:

{{ config(
  materialized = 'incremental',
  incremental_strategy = 'merge_null_safe',
  unique_key = 'id'
) }}

{% if not is_incremental() %}

-- first run
select 1 as id, 'blue'::varchar(1000) as color
union all
select 2 as id, 'red' as color
union all
select null as id, 'green' as color

{% else %}

-- second run (+ subsequent)
select 1 as id, 'mauve' as color
union all
select 2 as id, 'purple' as color
union all
select null as id, 'yellow' as color

{% endif %}

{% macro get_incremental_merge_null_safe_sql(arg_dict) %}
    
    -- {# these components are passed in as a dictionary from the incremental materialization #}
    {% set target, source, unique_key, dest_columns, predicates =
      arg_dict['target_relation'],
      arg_dict['temp_relation'],
      arg_dict['unique_key'],
      arg_dict['dest_columns'],
      arg_dict['predicates'] %}
    
    -- {# the vast majority of this code is copy-pasted from the 'get_merge_sql' macro #}
    {%- set predicates = [] if predicates is none else [] + predicates -%}
    {%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}
    {%- set update_columns = config.get('merge_update_columns', default = dest_columns | map(attribute="quoted") | list) -%}
    {%- set sql_header = config.get('sql_header', none) -%}

    {% if unique_key %}
        {% if unique_key is sequence and unique_key is not mapping and unique_key is not string %}
            {% for key in unique_key %}
                {% set this_key_match %}
                    -- this is different: use 'is not distinct from' instead of '='
                    DBT_INTERNAL_SOURCE.{{ key }} is not distinct from DBT_INTERNAL_DEST.{{ key }}
                {% endset %}
                {% do predicates.append(this_key_match) %}
            {% endfor %}
        {% else %}
            {% set unique_key_match %}
                -- this is different: use 'is not distinct from' instead of '='
                DBT_INTERNAL_SOURCE.{{ unique_key }} is not distinct from DBT_INTERNAL_DEST.{{ unique_key }}
            {% endset %}
            {% do predicates.append(unique_key_match) %}
        {% endif %}
    {% else %}
        {% do predicates.append('FALSE') %}
    {% endif %}

    {{ sql_header if sql_header is not none }}

    merge into {{ target }} as DBT_INTERNAL_DEST
        using {{ source }} as DBT_INTERNAL_SOURCE
        on {{ predicates | join(' and ') }}

    {% if unique_key %}
    when matched then update set
        {% for column_name in update_columns -%}
            {{ column_name }} = DBT_INTERNAL_SOURCE.{{ column_name }}
            {%- if not loop.last %}, {%- endif %}
        {%- endfor %}
    {% endif %}

    when not matched then insert
        ({{ dest_cols_csv }})
    values
        ({{ dest_cols_csv }})

{% endmacro %}

$ dbt run --full-refresh
$ dbt run

From the logs:

    merge into analytics.dbt_jcohen.my_model as DBT_INTERNAL_DEST
        using analytics.dbt_jcohen.my_model__dbt_tmp as DBT_INTERNAL_SOURCE
        on 
                -- this is different: use 'is not distinct from' instead of '='
                DBT_INTERNAL_SOURCE.id is not distinct from DBT_INTERNAL_DEST.id
            

    
    when matched then update set
        "ID" = DBT_INTERNAL_SOURCE."ID","COLOR" = DBT_INTERNAL_SOURCE."COLOR"
    

    when not matched then insert
        ("ID", "COLOR")
    values
        ("ID", "COLOR")

After the second run, the special merge statement, templated by the merge_null_safe strategy, will have successfully overwritten the null row from green → yellow:

ID	COLOR
1	mauve
2	purple
null	yellow

Whereas the traditional merge strategy would yield:

ID	COLOR
1	mauve
2	purple
null	green
null	yellow

The get_incremental_merge_null_safe_sql macro is a bit ugly, since it's doing a lot, and most of the code is copy-pasted from the existing get_merge_sql macro. But on the plus side, you only need to define one macro, and can keep using the built-in incremental materialization.

Should this be out-of-the-box behavior? My personal opinion

Everybody who uses incremental models with potentially nullable keys.

Adding a special purpose key to all incremental models that are using nullable keys as part of a set, but that's a lot of extra work.

The question is: When do you ever want your unique_key to be null?

I know it can happen, but it can also be handled. This is a matter of some religious debate in data modeling, but I think null values for primary keys are never a good idea — always better to coalesce() to some placeholder value (e.g. 0), within your modeling logic, or combine a frequently nullable column with other columns in a surrogate key, so that the materialization logic can remain a simple and unopinionated equality comparison (good old =). That should yield the same effect as is not distinct from, which requires a bit more thinking, and seems to have less consistent support across SQL dialects.

Also think about the case where there are multiple records with null for their value of unique_key.

Options:

Stop the world during the merge statement. This would be the effect of using is not distinct from, since null is treated as a distinct value. Snowflake will raise an error saying Duplicate row detected during DML action.
Not lossy & debuggable after the fact. Insert all records with null as their key, and then (hopefully) fail a not_null soon after!

I hold two principles when thinking about what logic should go where:

More complexity in modeling logic is good if it enables simpler materialization logic that's doing fewer magical/implicit things. Materialization code is "behind the scenes" and harder to debug.
If your modeling logic succeeds and satisfies your expectations, your materialization logic should also succeed. Better to see an error or unexpected result in your previewed select statement than in your dbt run.

Given those principles, I prefer 2! So I do think this should remain the default behavior of the simple merge strategy.

Next steps: docs docs docs

We definitely need to document the new approach to incremental strategies in v1.3+: dbt-labs/docs.getdbt.com#1761

In incremental models & snapshots, it is on the user to ensure that the unique_key they have configured is in fact unique. I think we really mean unique_and_not_null_key — it's at least worth documenting that here!

I'm less inclined to add this as a built-in strategy — but I would be willing to reconsider if lots of folks are running into the same problem, and we see the same custom strategy macro proliferating in the wild. I'm going to close this issue in the meantime, as something that can be resolved with a little bit of custom code — whether a coalesce() or surrogate key in the modeling logic, or a custom strategy macro (get_incremental_merge_null_safe_sql).

mmoyer-pax · 2025-02-18T21:53:11Z

@jtcohen6 one question you posed above is The question is: When do you ever want your unique_key to be null?

My team is running into the need for null-safe merging due to the following use case:

Our incremental table has a unique key column named unique_key, but the table is clustered on a column transaction_type (which can be null)
For efficiency, we include both columns in the dbt model's unique key config via unique_key=['transaction_type','unique_key'] because that forces the merge to also check the clustered transaction_type column, which is much more performant than just having it compare unique_key values

As a result, we need a null-safe merge because the transaction_type column can be null.

This is using Snowflake, by the way.

raphaelvarieras added enhancement New feature or request triage labels Dec 9, 2022

github-actions bot changed the title ~~[Feature] NULL-safe incremental merge strategy~~ [CT-1639] [Feature] NULL-safe incremental merge strategy Dec 9, 2022

jtcohen6 added the incremental Incremental modeling with dbt label Dec 13, 2022

jtcohen6 self-assigned this Dec 13, 2022

jtcohen6 closed this as completed Dec 13, 2022

jtcohen6 removed the triage label Dec 13, 2022

jtcohen6 removed their assignment Dec 13, 2022

jtcohen6 added the user docs [docs.getdbt.com] Needs better documentation label Dec 13, 2022

jtcohen6 mentioned this issue Dec 14, 2022

[CT-1356] [Feature] incremental models' merge conditional update #6077

Closed

3 tasks

jtcohen6 mentioned this issue Feb 9, 2023

[CT-2023] [Feature] Create an append_unique incremental strategy #6848

Closed

3 tasks

This was referenced Nov 14, 2023

[CT-3373] [Feature] Incremental Strategy: 'insert_unmatched' #9056

Closed

[CT-3375] [Feature] Allow Users to Specify Merge Clauses for Incremental Models #9060

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1639] [Feature] NULL-safe incremental merge strategy #6415

[CT-1639] [Feature] NULL-safe incremental merge strategy #6415

raphaelvarieras commented Dec 9, 2022 •

edited

Loading

jtcohen6 commented Dec 13, 2022

mmoyer-pax commented Feb 18, 2025

[CT-1639] [Feature] NULL-safe incremental merge strategy #6415

[CT-1639] [Feature] NULL-safe incremental merge strategy #6415

Comments

raphaelvarieras commented Dec 9, 2022 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

jtcohen6 commented Dec 13, 2022

Should this be possible? Yeah prob!

Should this be out-of-the-box behavior? My personal opinion

Next steps: docs docs docs

mmoyer-pax commented Feb 18, 2025

raphaelvarieras commented Dec 9, 2022 •

edited

Loading