Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default implementation of deduplicate is not null-safe #621

Closed
1 of 5 tasks
dbeatty10 opened this issue Jul 16, 2022 · 7 comments
Closed
1 of 5 tasks

Default implementation of deduplicate is not null-safe #621

dbeatty10 opened this issue Jul 16, 2022 · 7 comments
Labels
bug Something isn't working Stale

Comments

@dbeatty10
Copy link
Contributor

Describe the bug

@belasobral93 discovered that deduplicate doesn't work for Spark when any of the order_by columns are null.

Root cause

The root cause is that Spark defaults to NULLS FIRST for the null_sort_order for the ORDER BY clause.

Why dbt_utils rather than spark_utils?

Explanation

But dbt_utils doesn't officially support Spark, so why is this being reported here (instead of spark_utils)?

dbt_utils is providing five implementations for deduplicate:

  1. default
  2. redshift
  3. postgres
  4. snowflake
  5. bigquery

Since dbt_utils only tests against the four databases listed above and each of those has an override, it means there is no testing for the default implementation which the other dbt adapters inherit (like Spark).

Steps to reproduce

SQL example

The current implementation essentially acts like the following example:

with relation as (

    select 1 as partition_by, 'a' as order_by
    union all
    select 1 as partition_by, 'a' as order_by
    union all
    select 2 as partition_by, 'a' as order_by
    union all
    select 2 as partition_by, 'b' as order_by
    union all
    select 2 as partition_by, NULL as order_by

),

row_numbered as (

    select
        _inner.*,
        row_number() over (
            partition by partition_by
            order by order_by
        ) as rn
    from relation as _inner

)

select
    distinct relation.*
from relation
natural join row_numbered
where row_numbered.rn = 1
;

Expected results

partition_by order_by
1 a
2 a

Actual results

partition_by order_by
1 a

Screenshots and log output

Not provided.

System information

Not provided.

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other: spark

The output of dbt --version:
Not provided.

Additional context

To make the fix, it might be as simple as:

  • adding nulls last after order_by (when an order by clause exists, of course)

To confirm the bug and establish sufficient test cases, we could:

  1. comment out all adapter-specific overrides for deduplicate
  2. re-run the test suite
  3. add test cases as needed
  4. re-run the test suite
  5. make fixes to the implementation(s)
  6. re-run the test suite

Are you interested in contributing the fix?

I will contribute the fix.

@dbeatty10 dbeatty10 added bug Something isn't working triage and removed triage labels Jul 16, 2022
@dbeatty10 dbeatty10 changed the title Default implementation of deduplicate is not covered by testing Default implementation of deduplicate is not null-safe Jul 16, 2022
@joellabes
Copy link
Contributor

To make the fix, it might be as simple as:

  • adding nulls last after order_by (when an order by clause exists, of course)

Is nulls [last|first] a well-known part of the SQL spec? I've never heard of it, a bit of quick Googling implies that it's specific-ish

A couple of things that are only in my head/true by convention but not documented:

  • Deviations from the norm (default__ implementations) should be handled in an override. (Strongly held)
  • "The norm" is vanilla-flavoured things that can reasonably be expected of pretty much everything. 1 (Pretty strongly held)
  • I tend to assume that Postgres is vanilla-flavoured, as the longest-lived and only-OSS DB in the original core four. (Not at all strongly held, but don't have a better idea)
    1 Lookin' at you, BigQuery types

The approach that would be truest to points 1 and 3 would be:

  • The default implementation changes to Postgres'
  • Anyone who can't do that builds their own

Which is right if we want a globally coherent and understandable set of principles. It's not super pragmatic though; it appears that 10 data platforms support our current implementation, and as far as I can see most of them would not support Postgres' because they don't have distinct on.

In my opinion, "Vanilla as default" is more important than "every default implementation is what would work best in Postgres". (In fact, this might be a case where the default implementation works on PGSQL, but there is a special optimisation that led to the custom version? I haven't checked).

So that's a lot of words to say that if MySQL, Redshift, Materialize, SingleStore don't understand nulls last then we are inconveniencing some new adapters by throwing it in, and maybe we're better off just specifying nulls last in the Spark version.

@joellabes
Copy link
Contributor

Related (in that it's also a deduplicate issue): #713

@github-actions
Copy link

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jul 29, 2023
@github-actions
Copy link

github-actions bot commented Aug 5, 2023

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2023
@dbeatty10 dbeatty10 removed the Stale label Apr 18, 2024
@dbeatty10 dbeatty10 reopened this Apr 18, 2024
@nitindatta
Copy link

anyone any workarounds for this

Copy link

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 25, 2024
Copy link

github-actions bot commented Nov 1, 2024

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants