-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default implementation of deduplicate
is not null-safe
#621
Comments
deduplicate
is not covered by testingdeduplicate
is not null-safe
Is A couple of things that are only in my head/true by convention but not documented:
The approach that would be truest to points 1 and 3 would be:
Which is right if we want a globally coherent and understandable set of principles. It's not super pragmatic though; it appears that 10 data platforms support our current implementation, and as far as I can see most of them would not support Postgres' because they don't have In my opinion, "Vanilla as default" is more important than "every default implementation is what would work best in Postgres". (In fact, this might be a case where the default implementation works on PGSQL, but there is a special optimisation that led to the custom version? I haven't checked). So that's a lot of words to say that if MySQL, Redshift, Materialize, SingleStore don't understand |
Related (in that it's also a deduplicate issue): #713 |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
anyone any workarounds for this |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
Describe the bug
@belasobral93 discovered that
deduplicate
doesn't work for Spark when any of theorder_by
columns arenull
.Root cause
The root cause is that Spark defaults to
NULLS FIRST
for the null_sort_order for theORDER BY
clause.Why
dbt_utils
rather thanspark_utils
?Explanation
dbt_utils
is providing five implementations fordeduplicate
:Since
dbt_utils
only tests against the four databases listed above and each of those has an override, it means there is no testing for thedefault
implementation which the other dbt adapters inherit (like Spark).Steps to reproduce
SQL example
The current implementation essentially acts like the following example:
Expected results
Actual results
Screenshots and log output
Not provided.
System information
Not provided.
Which database are you using dbt with?
The output of
dbt --version
:Not provided.
Additional context
To make the fix, it might be as simple as:
nulls last
afterorder_by
(when anorder by
clause exists, of course)To confirm the bug and establish sufficient test cases, we could:
deduplicate
Are you interested in contributing the fix?
I will contribute the fix.
The text was updated successfully, but these errors were encountered: