Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2609] [Feature] Allow defining a default backfill value for incremental models with constraints #7732

Open
3 tasks done
b-per opened this issue May 29, 2023 · 8 comments
Labels
enhancement New feature or request incremental Incremental modeling with dbt model_contracts Refinement Maintainer input needed

Comments

@b-per
Copy link
Contributor

b-per commented May 29, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

When adding a new column to an incremental model (set with append_new_columns) with a not null constraint, the new rows are checked to see if the new column is NULL but the table itself doesn't get its overall metadata/DDL updated (which is normal as there are now NULL values for previous rows).

This is good as data is checked as configured, but the table attributes in the Warehouse then doesn't reflect the fact the new column is verified to be not null.

The feature suggested would be to:

  1. allow setting a default value for historical rows in case a new column is added
  2. alter the table to signify that the column has a not null constraint

Describe alternatives you've considered

Alternative 1

It is currently possible to achieve the same outcome by manually:

  1. running update statements to update the historical rows with NULL values
  2. running some alter statement on the table to update the column as being not null

But this approach requires writing DDL directly when the suggested feature would handle this automatically.

Alternative 2

Another approach is to run a --full-refresh but this might not be possible for huge tables.

Who will this benefit?

People wanting to leverage the new constraints capabilities and handling large amount of data stored in incremental models.

Are you interested in contributing this feature?

If need be

Anything else?

No response

@b-per b-per added enhancement New feature or request triage labels May 29, 2023
@github-actions github-actions bot changed the title [Feature] Allow defining a default backfill value for incremental models with constraints [CT-2609] [Feature] Allow defining a default backfill value for incremental models with constraints May 29, 2023
@dbeatty10
Copy link
Contributor

This is an intriguing idea @b-per ! 💡

Database / schema migrations

The manual steps you described are basically like database / schema migrations in Django, Rails, and other similar frameworks. Incremental models already do similar operations as those migrations (i.e., adding columns like you described), but don't do others (like replacing NULLs with some default value).

Things to solve for

The piece of the design that seems most impactful is how to specify the value to use for the new column(s). There's a couple awkward things to solve for:

  1. There can be more than one column that is applicable
  2. Specifying which column(s) need default values
  3. This is essentially a one-time backfill per column and then never needed again

Specifying default values for not null constraints

Idea 1

One option would be to introduce a default_values dictionary that maps column names to a SQL expression to be used as the default value instead of NULLs.

It could look similar this:

models/dim_customers.sql

{{
    config(
        materialized='incremental',
        unique_key='id',
        on_schema_change='append_new_columns',
        default_values={
          "some_string_column": "'foobar'",
          "some_timestamp_column": "cast('0000-01-01' as timestamp)",
        }
    )
}}

Idea 2

An alternative could be to have the default value added to the configuration of the constraint itself, like this:

models/_models.yml

models:
  - name: dim_customers
    config:
      contract:
        enforced: true
    columns:
      - name: some_string_column
        data_type: varchar
        constraints:
          - type: not_null
            - default_value: "foobar"
      - name: some_timestamp_column
        data_type: timestamp
        constraints:
          - type: not_null
            - default_value: "cast('0000-01-01' as timestamp)"

Idea 3

Full-blown support for general database migrations. Such a feature would only be relevant to models materialized as incremental because ephemeral, table, and view don't need migrations.

Summary

I prefer idea 2 over idea 1 since there's already a full listing of the column names when a contract is enforced. It solves for the first two things listed in the "things to solve for" section. But neither idea solves for the one-time nature of these migrations. To solve for that one, we'd need idea 3. That typically involves storing state, which we've historically shied away from.

What do you think of those first two options versus each other @b-per? Did you have something different in mind?

@b-per
Copy link
Contributor Author

b-per commented Jun 1, 2023

Idea 1 and Idea 2 are tackling slightly different use cases

  1. allow it for all incrementals
  2. allow it for incrementals with constraints/contracts (second question, what about people who don't want an auto-backfill?)

Knowing that the original issue came more from a constraints use case I would be edging towards Idea 2, supporting it when contraints are defined.

Couple of additional ideas about it:

  • wouldn't it make more sense to call it backfill_value instead of default_value?
  • the logic would be:
    • if default_value/backfill_value is set, backfill the value when the new column is added AND alter the incremental table to set the correct constraint
    • if default_value/backfill_value is not set, continue like today, don't backfill, test the new rows for the constraint but don't set the constraint on the incremental table

@dbeatty10
Copy link
Contributor

@b-per good insight about Idea 1 and Idea 2 tackling slightly different use cases. I agree with your assessment and edging towards Idea 2!

wouldn't it make more sense to call it backfill_value instead of default_value?

Calling it default_value (or just default) was inspired by databases that support specifying a default within DDL like Postgres and SQL Server.

if default_value/backfill_value is not set, continue like today, don't backfill, test the new rows for the constraint but don't set the constraint on the incremental table

I'm concerned that it doesn't actually add the constraint. It feels like it should be all or nothing. Otherwise, it's really more like a pre-insert/merge data test on a batch than a true constraint. Allowing specification of a not_null constraint without actually apply it is non-intuitive to the users. To reduce confusion requires more documentation, more log output from the system, etc.

What do you think?

Discussing the details here makes me wonder if this type of backfill activity is necessarily difficult. i.e., the Alternative 1 that you mentioned originally is definitely overhead for the end user, but it's clear to them what they are doing and how the system is behaving.

@b-per
Copy link
Contributor Author

b-per commented Jun 5, 2023

I think that default and backfill_value are different concepts here:

  • default is the value we want to give to the column if it is NULL
  • backfill_value is the value we want to use for backfills, most likely, if we were to do a --full-refresh of the model, the value would be different from the default one

if default_value/backfill_value is not set, continue like today, don't backfill, test the new rows for the constraint but don't set the constraint on the incremental table

I'm concerned that it doesn't actually add the constraint. It feels like it should be all or nothing. Otherwise, it's really more like a pre-insert/merge data test on a batch than a true constraint. Allowing specification of a not_null constraint without actually apply it is non-intuitive to the users. To reduce confusion requires more documentation, more log output from the system, etc.

It might be confusing but it is the current behaviour. Would we be ok changing the current behaviour and making default_value/backfill_value mandatory for incremental models with append_new_columns? Would this be considered a breaking change? (technically nothing will break until new columns are added, but if they are added without the default_value/backfill_value it will then break.

Discussing the details here makes me wonder if this type of backfill activity is necessarily difficult. i.e., the Alternative 1 that you mentioned originally is definitely overhead for the end user, but it's clear to them what they are doing and how the system is behaving.

I think that not having it managed in dbt and letting people do it by hand is difficult. dev is easy, but what about CI/CD, what about Prod? Having it in dbt makes it part of your logic and makes it go through all the steps of moving a change from dev to prod.

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Dec 23, 2023
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 30, 2023
@b-per
Copy link
Contributor Author

b-per commented Jan 8, 2024

I think that this is still worth considering, hence why I reopen it but feel free to close it for good!

@b-per b-per reopened this Jan 8, 2024
@github-actions github-actions bot removed the stale Issues that have gone stale label Jan 9, 2024
@dbeatty10 dbeatty10 removed the triage label Apr 5, 2024
@dbeatty10 dbeatty10 added the Refinement Maintainer input needed label Apr 5, 2024
@boxysean
Copy link
Contributor

I have a client who asked for the default behavior as described by @b-per. The main reason is because they are accustomed to setting DEFAULT in DDL using non-dbt techniques, and so such a feature would help them achieve this with dbt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request incremental Incremental modeling with dbt model_contracts Refinement Maintainer input needed
Projects
None yet
Development

No branches or pull requests

4 participants