Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Optimize incremental 'insert_overwrite' strategy #1409

Open
3 tasks done
AxelThevenot opened this issue Nov 21, 2024 · 10 comments · May be fixed by #1410 or #1415
Open
3 tasks done

[Feature] Optimize incremental 'insert_overwrite' strategy #1409

AxelThevenot opened this issue Nov 21, 2024 · 10 comments · May be fixed by #1410 or #1415
Labels
enhancement New feature or request incremental

Comments

@AxelThevenot
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

The MERGE statement is sub-optimized in BigQuery.
This is of course ok for unique keys as this is what we are looking for.

But for the insert_overwrite strategy where we are looking to rows at the partition-level there is a better solution and here is why:

  • a DELETE or INSERT statement is cheapest than a MERGE statement.
  • incremental tables are the most expensive tables in real-world projects.
  • The DELETE statement in BigQuery is free at the partition-level.

This has been tested at Carrefour which is my company.

  • On this replacement of the MERGE statement it reduces the cost by 50.4% and the elapsed time by 35.2%
  • On the overall procedure it reduces the cost by 26.1% and the elapsed time by 23.1%

This is wrapped in a transaction to avoid deleting rows if any error occurs.

Describe alternatives you've considered

I have considered an alternative to create an additional delete+insert incremental strategy.
But overriding the existing insert_overwrite is the more convenient as this is exactly what we are looking for and the features are the same.

Who will this benefit?

Of course, everyone without any changes.

Are you interested in contributing this feature?

Yes

Anything else?

No response

@AxelThevenot AxelThevenot added enhancement New feature or request triage labels Nov 21, 2024
AxelThevenot pushed a commit to AxelThevenot/dbt-bigquery that referenced this issue Nov 21, 2024
@philBatardiere
Copy link

Hey @AxelThevenot did you tried the https://docs.getdbt.com/reference/resource-configs/bigquery-configs#copying-partitions?

@AxelThevenot
Copy link
Author

Hi,

yes this is also something i've tried but:

  • It is sequential and not in parallel so really slow compare to this method
  • I prefer to have the entire view of what is executed in my BigQuery procedure instead of having the temp table creation in one side and a BigQuery job in an other side :)

I also though about the static partition definition but uses cases requires to have a dynamic partition overwrite :)

@philBatardiere
Copy link

Yes agreed if your incremental refresh a lot of partitions back then it can takes time (usually i refresh around 3 partitions back so the process isn’t so long).

For debuging purposes, I have seen some benefits having the tmp table by restoring it with the time travel feature. But its clearly worst for auditing purposes, it should be nice to have labels related to the dbt run id in the copy config btw 😀.

I keep your feature in mind in case of future need.

Thanks!

@AxelThevenot
Copy link
Author

For debuging purposes, I have seen some benefits having the tmp table by restoring it with the time travel feature. But its clearly worst for auditing purposes, it should be nice to have labels related to the dbt run id in the copy config btw 😀.

Oh yes it could be nice to have those labels

So you won't valide my pull request for now from what I understand ?

In any case, feel free to reach me if you want some help or anything else
https://www.linkedin.com/in/axel-thevenot/

@philBatardiere
Copy link

Unfortunately I don't have yet the authority to do it :) But I hope someone will do it

@AxelThevenot
Copy link
Author

AxelThevenot commented Nov 25, 2024

Ok, let's pray someone with this authority will go through this pull request/issue

I can create an additional delete+insert strategy instead of overriding the insert_overwrite, even if it make less sense to me 👍

@amychen1776
Copy link

@AxelThevenot, I don't think I quite understand your justification for not creating a new delete+insert materialization strategy. Changing a commonly used incremental strategy can cause some unintended results, especially since it seems like your intent is to remove the merge statement from the insert_override statement (or am I misunderstanding?)

@borjavb
Copy link

borjavb commented Nov 27, 2024

This is great stuff! But I think we should create a delete+insert operation instead of changing the already widely used insert_overwrite, as it could have unexpected consequences across all users. (+1 on @amychen1776)

For example, the DELETE operation being free is only under certain conditions as stated in the documentation and you might not always get the slot-free operation unfortunately. I’ve seen cases where the costs of the DELETE + INSERT end up being higher in aggregate to the whole MERGE. It’s rare, but It can happen and rolling this change for such a fundamental macro can backfire. At the end of the day is all about testing strategies that better suit the use case.

For example granular partitioned tables do not benefit from free DELETEs 😢 :
image

Using a delete+insert will definitely be a performance boost in most cases (if using daily partitions specially), but users have to be aware of how to get the best of it. Better to be clear through documentation and through new methods rather than changing the current one.

@AxelThevenot
Copy link
Author

AxelThevenot commented Nov 27, 2024

@amychen1776 and @borjavb I got your points, I will go for a delete+insert strategy if this is more convenient to you and avoid having edge cases for specific users

I will make the change as soon as possible and tell you when it's done :)

But delete+insert is at the unique-key-level in other adapters and I want a partition-level
Any idea of if I reuse this name even if this is not doing the same or if I give a new name ? (and what name?)

@borjavb
Copy link

borjavb commented Nov 27, 2024

Been thinking about this and... how do we feel about an operator/function configuration within instert_overwrite?

The way I've hijacked the insert_overwrite strategy in the past to implement this was through the config block and it always worked really well. Check this draft PR with the gist behind this idea.

Technically we have an incremental strategy that can work with different functions, but apply the same semantics of replacing a whole partition. So basically we can define an incremental_strategy plus the operator/function we want to use: either a merge or a delete+insert. So we get the best of both worlds:

  1. If it's not defined, we default to merge, so we keep backwards compatibility
  2. If the user selects the new delete+insert underlying function, we swap the macro that works under the hood

We already have custom configurations for merge like merge_update_columns and merge_exclude_columns, so this doesn't break with how we do things with other incremental strategies.

So the config block would look like something like this:

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        insert_overwrite_function='delete+insert',
        cluster_by="id",
        partition_by={
            "field": "date_time",
            "data_type": "datetime",
            "granularity": "day"
        },
        partitions=partitions_to_replace,
    )
}}

Although I'm torn if we should add the incremental_strategy as a dict... but given that we haven't done that for the other configurations for merge... it feels now weird to add it as a dictionary? Not sure.

incremental_strategy={
   strategy = "insert_overwrite",
   operation = "delete+insert"
}

But this would just be some syntax sugar of how we want to expose the config. But I think with this we can let the user decide which operator is the most performant, and keep full backwards compatibility.

What do you think, @amychen1776 and @AxelThevenot?

@borjavb borjavb linked a pull request Nov 27, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request incremental
Projects
None yet
4 participants