[BUG] Wrong Cast in macro type_string for Databricks Hash Function #238

ambullo · 2024-07-15T14:43:52Z

Describe the bug
In the type_string.sql line 22 to 31 you do a cast to a varchar based on the hashtype. This is not appropriate for databricks as databricks does not know the type varchar. The comparable type in databricks would be string and for it no length is defined.

This leads then so an output in the hash algorithm like this:
UPPER(SHA2(NULLIF(CONCAT(
IFNULL(NULLIF(UPPER(TRIM(CAST(NaturalKey AS VARCHAR(32)))), ''), '^^')
), '^^'), 256)) AS NaturalKey_HK,

This is bad, since the cast to 32 bit could potentially truncate the to be hashed column and lead to unwanted hashing collisions. Luckily, databricks does not truncate the input column. Nevertheless, it should be fixed as it has the potential to corrupt all hash keys.

Environment

dbt version: 1.8.3
automate_dv version: tested on 10.2 but most likely also on 11.0
Database/Platform: Databricks
hash algorithm: SHA
enable_binary_hash: false

Expected behavior
Replace Cast to Varchar with Cast to String

AB#5567

azure-boards · 2024-07-15T14:44:12Z

✅ Successfully linked to Azure Boards work item(s):

Bug 5567

DVAlexHiggs · 2024-07-15T14:52:27Z

Hi @ambullo, thanks for this thorough and detailed report.

I believe we fixed this as part of v0.11.0. v0.11.0 was explicitly an update to resolve hashing and handling of hash types correctly in both Databricks and BigQuery. Please can you upgrade and try it out?

If it's still an issue then we'd be happy to take a look, however 😄 Thanks!

ambullo · 2024-07-15T15:11:51Z

I just checked. this bug report applies also to version v0.11.0

mzemp · 2024-12-19T07:25:15Z

Hello @DVAlexHiggs

The fix for this is straight-forward. The macro databricks__type_string in macros/supporting/data_types/type_string.sql needs to be corrected.

It should be

{%- macro databricks__type_string(is_hash, char_length) -%}
    STRING
{%- endmacro -%}

(similar to the BigQuery case) instead of currently

{%- macro databricks__type_string(is_hash=false, char_length=255) -%}
    {%- if is_hash -%}
        {%- if var('hash', 'MD5') | lower == 'md5' -%}
            VARCHAR(16)
        {%- elif var('hash', 'MD5') | lower == 'sha' -%}
            VARCHAR(32)
        {%- elif var('hash', 'MD5') | lower == 'sha1' -%}
            VARCHAR(20)
        {%- endif -%}
    {%- else -%}
        VARCHAR({{ char_length }})
    {%- endif -%}
{%- endmacro -%}

See also link above.

It also looks like the arguments for all these macros could be removed, since they are not really used. I assume these macros serverd a different puropose at some time.

For others with the same issue: Until this is fixed by the AutomateDV team, one can create a macro with the correct definition above in the macro folder of your project. The version from AutomateDV can then be overridden by adding

dispatch:
  - macro_namespace: automate_dv
    search_order: ["<my_project>", "automate_dv"]

in the dbt_project.yml file.

I hope this helps! 🙂

ambullo added the bug Something isn't working label Jul 15, 2024

ambullo assigned DVAlexHiggs Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Wrong Cast in macro type_string for Databricks Hash Function #238

[BUG] Wrong Cast in macro type_string for Databricks Hash Function #238

ambullo commented Jul 15, 2024 •

edited by azure-boards bot

Loading

azure-boards bot commented Jul 15, 2024

DVAlexHiggs commented Jul 15, 2024 •

edited

Loading

ambullo commented Jul 15, 2024

mzemp commented Dec 19, 2024

[BUG] Wrong Cast in macro type_string for Databricks Hash Function #238

[BUG] Wrong Cast in macro type_string for Databricks Hash Function #238

Comments

ambullo commented Jul 15, 2024 • edited by azure-boards bot Loading

azure-boards bot commented Jul 15, 2024

DVAlexHiggs commented Jul 15, 2024 • edited Loading

ambullo commented Jul 15, 2024

mzemp commented Dec 19, 2024

ambullo commented Jul 15, 2024 •

edited by azure-boards bot

Loading

DVAlexHiggs commented Jul 15, 2024 •

edited

Loading