Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt diff() macro #42

Open
dbeatty10 opened this issue Mar 8, 2022 · 1 comment
Open

dbt diff() macro #42

dbeatty10 opened this issue Mar 8, 2022 · 1 comment

Comments

@dbeatty10
Copy link

I'm interested in prototyping the following proposal.

Goal

Enable the DIFF operator within a dbt project.

Implementation proposal

  • Create a suite of dbt macros that mimic the SQL generated within datools/explanations.py
  • diff() macro as the main interface
  • Result is a relation like this:
    image

Potential syntax

with
this_week as (

    select *
    from {{ ref("logs") }}
    where
        crash = true
        and timestamp between '2018-08-28' and '2018-09-04'

),

last_week as (

    select *
    from {{ ref("logs") }}
    where
        crash = true
        and timestamp between '2018-08-21' and '2018-08-28'

)

{{ datools.diff(this_week, last_week, on=["app_version", "device_type", "os"], compare_by="risk_ratio", threshold="2.0", support="0.05", max_order="1") }}

Examples of dbt macros

These two dbt packages contain macros that might be useful for inspiration:

For example:

@marcua
Copy link
Owner

marcua commented Mar 13, 2022

Thank you so much for your interest @dbeatty10! This is a really exciting proposal, as I'd love to see the library hooked into the existing data ecosystem. I can think of a few considerations. Hopefully this long-ish response comes off as excited and welcoming rather than off-putting or overwhelming. After you read it, let me know where/how I can help!

Separating query generation from query execution

Right now the explanations.diff function does two things: 1) Goes through the process of generating a query, 2) Executes the query to return potential explanations.

In the code, the crossover from step 1 to step 2 is here:

diff_query = _diff_query(

In your proposal, you'd like to use part 1 without relying on part 2. Since part 1 isn't exposed as a public API just yet, one way to accomplish that would be to extract diff_query (all the part 1 stuff) from diff, and then both the dbt wrapper and diff can call it for execution purposes.

Query generation executes queries as well

The explanations.diff query doesn't just "statically compile" SQL---it has to execute queries in order to know how to parametrize the SQL it's generating. This is a bit tougher to solve than the previous issue. Here are a few examples:

  • To filter the explanations to only ones that pass the support threshold, the code queries for the number of rows in the test relation:
    min_support_rows = floor(num_test_rows * min_support)
  • In order to consider range-valued columns (e.g., floating point numbers) for explanations, the code breaks those columns into 15 equi-sized buckets (it queries the underlying relations to determine those buckets), and then rewrites the test/control relations to include bucketed versions of the columns:
    range_statistics = range_valued_statistics(

The first issue seems easier to solve --- we can turn the support query into a subquery, and even benefit from one less round-trip to the database. The second issue will require a little bit of thought. Can dbt handle "I need to run a query inside this macro," or should we brainstorm ways to push the bucketing/rewrite logic into the query as well?

If neither queries-inside-macros nor pushing down the queries works, we can still explore an alternative where bucketing is its own macro distinct from diffing, but I'd love to avoid changing the diff API if at all possible.

Generating SQL beyond datools.diff

While diff is certainly the most exciting part of datools, there are other query generation goodies. For example, as part of making diff work on postgres, I've been implementing GROUPING SETS support for databases that don't implement grouping sets natively (e.g., SQLite, Redshift) and ones that do (e.g., DuckDB, Postgres). You could imagine dbt macros for that as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants