Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1639] [Feature] NULL-safe incremental merge strategy #6415

Closed
3 tasks done
raphaelvarieras opened this issue Dec 9, 2022 · 2 comments
Closed
3 tasks done

[CT-1639] [Feature] NULL-safe incremental merge strategy #6415

raphaelvarieras opened this issue Dec 9, 2022 · 2 comments
Labels
enhancement New feature or request incremental Incremental modeling with dbt user docs [docs.getdbt.com] Needs better documentation

Comments

@raphaelvarieras
Copy link

raphaelvarieras commented Dec 9, 2022

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Add support for null-safe equality predicates using Snowflake's IS NOT DISTINCT FROM function, or an IFNULL or COALESCE -based workaround. My suggestion would be to maybe create a third Snowflake strategy in addition to merge and delete+insert called merge_null_safe that would use a different set of instructions based on NOT DISTINCT FROM predicates.

Describe alternatives you've considered

Adding a special purpose key to all incremental models that are using nullable keys as part of a set, but that's a lot of extra work.

Who will this benefit?

Everybody who uses incremental models with potentially nullable keys.

Are you interested in contributing this feature?

Sure!

Anything else?

No response

@raphaelvarieras raphaelvarieras added enhancement New feature or request triage labels Dec 9, 2022
@github-actions github-actions bot changed the title [Feature] NULL-safe incremental merge strategy [CT-1639] [Feature] NULL-safe incremental merge strategy Dec 9, 2022
@jtcohen6 jtcohen6 added the incremental Incremental modeling with dbt label Dec 13, 2022
@jtcohen6 jtcohen6 self-assigned this Dec 13, 2022
@jtcohen6
Copy link
Contributor

@raphaelvarieras Thanks for opening!

Context for those watching at home: https://docs.snowflake.com/en/sql-reference/functions/is-distinct-from.html

This is a neat function in Snowflake. As I understand, the idea is that <anything> = null and <anything> != null will always return null (and therefore not matched in a merge statement), but this function will treat null as a discrete value unto itself (which can then satisfy a when matched clause).

with some_data as (

    select 1 as id
    union all
    select 2 as id
    union all
    select null as id

)

select

    a.id,
    b.id,
    a.id = b.id as are_equal,
    a.id is not distinct from b.id as are_not_distinct,
    coalesce(a.id, 0) = coalesce(b.id, 0) as are_equal_with_coalesce  -- same effect

from some_data as a
cross join some_data as b
ID ID ARE_EQUAL ARE_NOT_DISTINCT ARE_EQUAL_WITH_COALESCE
1 1 TRUE TRUE TRUE
1 2 FALSE FALSE FALSE
1 null FALSE FALSE TRUE
2 1 FALSE FALSE FALSE
2 2 TRUE TRUE TRUE
2 null FALSE FALSE TRUE
null 1 FALSE FALSE TRUE
null 2 FALSE FALSE TRUE
null null TRUE TRUE TRUE

Should this be possible? Yeah prob!

We did some refactoring work in v1.3 that made it simpler & easier to "register" your own custom incremental strategy, by defining a macro named get_incremental_<yourcoolstrat>_sql. But this still needs to be documented!!

For example:

{{ config(
  materialized = 'incremental',
  incremental_strategy = 'merge_null_safe',
  unique_key = 'id'
) }}

{% if not is_incremental() %}

-- first run
select 1 as id, 'blue'::varchar(1000) as color
union all
select 2 as id, 'red' as color
union all
select null as id, 'green' as color

{% else %}

-- second run (+ subsequent)
select 1 as id, 'mauve' as color
union all
select 2 as id, 'purple' as color
union all
select null as id, 'yellow' as color

{% endif %}
{% macro get_incremental_merge_null_safe_sql(arg_dict) %}
    
    -- {# these components are passed in as a dictionary from the incremental materialization #}
    {% set target, source, unique_key, dest_columns, predicates =
      arg_dict['target_relation'],
      arg_dict['temp_relation'],
      arg_dict['unique_key'],
      arg_dict['dest_columns'],
      arg_dict['predicates'] %}
    
    -- {# the vast majority of this code is copy-pasted from the 'get_merge_sql' macro #}
    {%- set predicates = [] if predicates is none else [] + predicates -%}
    {%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}
    {%- set update_columns = config.get('merge_update_columns', default = dest_columns | map(attribute="quoted") | list) -%}
    {%- set sql_header = config.get('sql_header', none) -%}

    {% if unique_key %}
        {% if unique_key is sequence and unique_key is not mapping and unique_key is not string %}
            {% for key in unique_key %}
                {% set this_key_match %}
                    -- this is different: use 'is not distinct from' instead of '='
                    DBT_INTERNAL_SOURCE.{{ key }} is not distinct from DBT_INTERNAL_DEST.{{ key }}
                {% endset %}
                {% do predicates.append(this_key_match) %}
            {% endfor %}
        {% else %}
            {% set unique_key_match %}
                -- this is different: use 'is not distinct from' instead of '='
                DBT_INTERNAL_SOURCE.{{ unique_key }} is not distinct from DBT_INTERNAL_DEST.{{ unique_key }}
            {% endset %}
            {% do predicates.append(unique_key_match) %}
        {% endif %}
    {% else %}
        {% do predicates.append('FALSE') %}
    {% endif %}

    {{ sql_header if sql_header is not none }}

    merge into {{ target }} as DBT_INTERNAL_DEST
        using {{ source }} as DBT_INTERNAL_SOURCE
        on {{ predicates | join(' and ') }}

    {% if unique_key %}
    when matched then update set
        {% for column_name in update_columns -%}
            {{ column_name }} = DBT_INTERNAL_SOURCE.{{ column_name }}
            {%- if not loop.last %}, {%- endif %}
        {%- endfor %}
    {% endif %}

    when not matched then insert
        ({{ dest_cols_csv }})
    values
        ({{ dest_cols_csv }})

{% endmacro %}
$ dbt run --full-refresh
$ dbt run

From the logs:

    merge into analytics.dbt_jcohen.my_model as DBT_INTERNAL_DEST
        using analytics.dbt_jcohen.my_model__dbt_tmp as DBT_INTERNAL_SOURCE
        on 
                -- this is different: use 'is not distinct from' instead of '='
                DBT_INTERNAL_SOURCE.id is not distinct from DBT_INTERNAL_DEST.id
            

    
    when matched then update set
        "ID" = DBT_INTERNAL_SOURCE."ID","COLOR" = DBT_INTERNAL_SOURCE."COLOR"
    

    when not matched then insert
        ("ID", "COLOR")
    values
        ("ID", "COLOR")

After the second run, the special merge statement, templated by the merge_null_safe strategy, will have successfully overwritten the null row from greenyellow:

ID COLOR
1 mauve
2 purple
null yellow

Whereas the traditional merge strategy would yield:

ID COLOR
1 mauve
2 purple
null green
null yellow

The get_incremental_merge_null_safe_sql macro is a bit ugly, since it's doing a lot, and most of the code is copy-pasted from the existing get_merge_sql macro. But on the plus side, you only need to define one macro, and can keep using the built-in incremental materialization.

Should this be out-of-the-box behavior? My personal opinion

Everybody who uses incremental models with potentially nullable keys.

Adding a special purpose key to all incremental models that are using nullable keys as part of a set, but that's a lot of extra work.

The question is: When do you ever want your unique_key to be null?

I know it can happen, but it can also be handled. This is a matter of some religious debate in data modeling, but I think null values for primary keys are never a good idea — always better to coalesce() to some placeholder value (e.g. 0), within your modeling logic, or combine a frequently nullable column with other columns in a surrogate key, so that the materialization logic can remain a simple and unopinionated equality comparison (good old =). That should yield the same effect as is not distinct from, which requires a bit more thinking, and seems to have less consistent support across SQL dialects.

Also think about the case where there are multiple records with null for their value of unique_key.

Options:

  1. Stop the world during the merge statement. This would be the effect of using is not distinct from, since null is treated as a distinct value. Snowflake will raise an error saying Duplicate row detected during DML action.
  2. Not lossy & debuggable after the fact. Insert all records with null as their key, and then (hopefully) fail a not_null soon after!

I hold two principles when thinking about what logic should go where:

  • More complexity in modeling logic is good if it enables simpler materialization logic that's doing fewer magical/implicit things. Materialization code is "behind the scenes" and harder to debug.
  • If your modeling logic succeeds and satisfies your expectations, your materialization logic should also succeed. Better to see an error or unexpected result in your previewed select statement than in your dbt run.

Given those principles, I prefer 2! So I do think this should remain the default behavior of the simple merge strategy.


Next steps: docs docs docs

We definitely need to document the new approach to incremental strategies in v1.3+: dbt-labs/docs.getdbt.com#1761

In incremental models & snapshots, it is on the user to ensure that the unique_key they have configured is in fact unique. I think we really mean unique_and_not_null_key — it's at least worth documenting that here!

I'm less inclined to add this as a built-in strategy — but I would be willing to reconsider if lots of folks are running into the same problem, and we see the same custom strategy macro proliferating in the wild. I'm going to close this issue in the meantime, as something that can be resolved with a little bit of custom code — whether a coalesce() or surrogate key in the modeling logic, or a custom strategy macro (get_incremental_merge_null_safe_sql).

@mmoyer-pax
Copy link

@jtcohen6 one question you posed above is The question is: When do you ever want your unique_key to be null?

My team is running into the need for null-safe merging due to the following use case:

  • Our incremental table has a unique key column named unique_key, but the table is clustered on a column transaction_type (which can be null)
  • For efficiency, we include both columns in the dbt model's unique key config via unique_key=['transaction_type','unique_key'] because that forces the merge to also check the clustered transaction_type column, which is much more performant than just having it compare unique_key values

As a result, we need a null-safe merge because the transaction_type column can be null.

This is using Snowflake, by the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request incremental Incremental modeling with dbt user docs [docs.getdbt.com] Needs better documentation
Projects
None yet
Development

No branches or pull requests

3 participants