Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate macro adds additional "with: clause, fails with a cte #642

Closed
mr78623451915 opened this issue Aug 16, 2022 · 6 comments
Closed
Labels
1.0 Changes to include in version 1.0 (especially breaking changes) bug Something isn't working good first issue Stale

Comments

@mr78623451915
Copy link

Describe the bug

The deduplicate macro adds a "with" clause, making it not work a cte, i.e., when there is already a with clause in the file.

Steps to reproduce

A basic cte and then deduplicate (silly example):

with my_cte as (
    select 1 as user_id
)

{{ dbt_utils.deduplicate(
    relation='my_cte',
    partition_by='user_id',
    order_by='user_id desc',
   )
}}

Compiles to

with my_cte as (
    select 1 as user_id
)

with row_numbered as (
        select
            _inner.*,
            row_number() over (
                partition by user_id
                order by user_id desc
            ) as rn
        from my_cte as _inner
    )

    select
        distinct data.*
    from my_cte as data
    
    natural join row_numbered
    where row_numbered.rn = 1

Expected results

The macro to work also with a cte, i.e., when there is already a with clause in the file.

Screenshots and log output

Compiler error for the above:

Server error: Runtime Error in rpc request (from remote system)
('42000', "[42000] [Simba][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: org.apache.hive.service.cli.HiveSQLException: Error running query: [PARSE_SYNTAX_ERROR] org.apache.spark.sql.catalyst.parser.ParseException: \n[PARSE_SYNTAX_ERROR] Syntax error at or near 'with'(line 5, pos 0)\n\n== SQL ==\nwith my_cte as (\n select 1 as us (80) (SQLExecDirectW)")

System information

Which database are you using dbt with?
databricks with dbt cloud v1.1.58.42

Additional context

"with" statement is in:
https://github.com/dbt-labs/dbt-utils/blob/0.8.6/macros/sql/deduplicate.sql line 42

Are you interested in contributing the fix?

Not experienced at all in DBT, so I should stay far away from trying to fix things ;-)

@mr78623451915 mr78623451915 added bug Something isn't working triage labels Aug 16, 2022
@joellabes
Copy link
Contributor

Good spotting! The interesting thing with this is that the Postgres, Snowflake and BigQuery variants don't require a CTE. It would be very good for all implementations to be consistent, and not needing a with clause seems easiest.

Lazy Joel thinks that maybe this could be done by wrapping it all in a select * from (with .... ) block, but I haven't tested that.

What we don't want: a flag on the macro to say whether the with is needed, because that will do nothing in the PG/SF/BQ cases.

I'm going to mark this as a good first issue for a community member to investigate. @mr78623451915 if you worked out a way to achieve it in Databricks SQL (disregarding the fact that it's in a macro), then we can provide the dbt experience to help you get it merged into the project. But right now all we need is SQL chops - I believe in you!

@joellabes joellabes added the 1.0 Changes to include in version 1.0 (especially breaking changes) label Sep 15, 2022
@joellabes
Copy link
Contributor

@ethanotran picking up from this discussion, this is another issue which I'd love to get in to dbt utils 1.0, and I think should be well-suited to a first contribution. To flesh out my example from above, I think the macro would be

select * from (
    [all the code that's already in default__deduplicate]
)

The piece that would be really valuable on top of this would be adding another test in the similar vein to @mr78623451915's example above, so that we can validate that we get the same behaviour on each of the warehouses we support.

You can see an example test here:

with
source as (
select *
from {{ ref('data_deduplicate') }}
where user_id = 1
),
deduped as (
{{
dbt_utils.deduplicate(
'source',
partition_by='user_id',
order_by='version desc',
) | indent
}}
)
select * from deduped

and its generic test defined here

- name: test_deduplicate
tests:
- dbt_utils.equality:
compare_model: ref('data_deduplicate_expected')

You're welcome to use the same input and output files, and just have another copy of the model file which behaves as above instead of wrapping inside a CTE.

@grhaonan
Copy link

Hi team,

Any update on this issue, I am facing the same issue with

  • dbt core 1.4.5
  • athena adaptor 1.4.2
  • dbt_utils 1.0.0

Thanks

@NagarajNune
Copy link

Hi @joellabes I created a PR for issue based on the hints and guidelines you provided and tested the same in the local environment also and it is working as expected.

Please let me know if this works.

This is my first PR any suggestions for improvements are welcome

Copy link

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 29, 2024
Copy link

github-actions bot commented Apr 5, 2024

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0 Changes to include in version 1.0 (especially breaking changes) bug Something isn't working good first issue Stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants