Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[async] Evaluate the possiblity of using dbt itself to create the full SQL command #1266

Closed
1 task done
tatiana opened this issue Oct 21, 2024 · 5 comments · Fixed by #1474
Closed
1 task done

[async] Evaluate the possiblity of using dbt itself to create the full SQL command #1266

tatiana opened this issue Oct 21, 2024 · 5 comments · Fixed by #1474
Assignees
Labels
dbt:compile Primarily related to dbt compile command or functionality execution:async Related to the Async execution mode
Milestone

Comments

@tatiana
Copy link
Collaborator

tatiana commented Oct 21, 2024

Context

When implementing #1230, we realised that the dbt compile command outputs the select statements related to models and transformations, but not necessarily the remaining relevant parts of the query (including creates, updates, inserts, drops).

This logic lives partially in dbt-core code and partially in the dbt adaptors of interest.

Could we leverage the --empty flag (dbt-labs/dbt-core#8980 (comment)) in any way?

Acceptance criteria

  • Analyse the possibility of, during the dbt compile - or somewhere related, to a setup task - to pre-create the full queries that we want to run with the async operators afterwards.
@tatiana tatiana added the execution:async Related to the Async execution mode label Oct 21, 2024
@dosubot dosubot bot added the dbt:compile Primarily related to dbt compile command or functionality label Oct 21, 2024
@tatiana tatiana added this to the Cosmos 1.9.0 milestone Oct 30, 2024
Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 29, 2024
@tatiana tatiana removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 13, 2025
@tatiana
Copy link
Collaborator Author

tatiana commented Jan 13, 2025

@pankajastro
Copy link
Contributor

pankajastro commented Jan 16, 2025

DBT Compile

  • This only generates the select statement for model, test, and analysis
  • The –empty flag is used to avoid expensive read on data warehouse but still we get select query only

it's often useful to execute the underlying select statement to find the source of the bug
https://docs.getdbt.com/reference/commands/compile

DBT Run

  • There is no --dry-run option in dbt run command
  • –empty flag: The dbt run command support –empty flag. The --empty flag, when used with the dbt run command, offers a behaviour similar to a dry-run by generating and running SQL statements without loading data into/from the data warehouse. https://docs.getdbt.com/reference/commands/run#the---empty-flag

In conclusion, the --empty flag offers a lightweight, dry-run-like behaviour. So we may be able to use it in compile task to generate sql query but if we do this, we must keep in mind that

  • The compile task will create table/view etc in data warehouse
  • The overall run command may not be idempotent

@pankajastro
Copy link
Contributor

pankajastro commented Jan 20, 2025

@pankajkoti and I discussed this.

1. Empty Flag in Run Command:

  • The --empty flag in the dbt run command introduces additional query elements related to limits and parsing, which can make the query more prone to bugs.

Example

with empty

...
orders as (
    select * from (select * from "postgres"."postgres"."stg_orders" where false limit 0) _dbt_limit_subq_stg_orders
)
...

without empty

...
orders as (
    select * from "postgres"."postgres"."stg_orders"
),
...

2. Empty Flag in Run Command to Create Table/View + Compile command to get full query

  • Use --empty flag in the dbt run command to create a table or view without data
  • Use dbt compile command to generate the sql
  • Use output SQL of dbt compile command generated in async task
  • Problem: The compile command does not generate SQL that handles incremental running dbt model

3. Monkey Patching

We decided to go with option (3)

@tatiana
Copy link
Collaborator Author

tatiana commented Jan 21, 2025

Excellent analysis @pankajastro and @pankajkoti - I'm glad we have a way forward. I know monkey patching can be polemic, but I'm optimistic this will allow us to move forward in the async support.

@tatiana tatiana closed this as completed Jan 21, 2025
pankajkoti added a commit that referenced this issue Feb 5, 2025
…LOW_ASYNC` (#1474)

# Overview

This PR introduces a reliable way to extract SQL statements run by
`dbt-core` so Airflow asynchronous operators can use them. It fixes the
experimental BQ implementation of `ExecutionMode.AIRFLOW_ASYNC`
introduced in Cosmos 1.7 (#1230).

Previously, in #1230, we attempted to understand the implementation of
how `dbt-core` runs `--full-refresh` for BQ, and we hard-coded the SQL
header in Cosmos as an experimental feature. Since then, we realised
that this approach was prone to errors (e.g. #1260) and that it is
unrealistic for Cosmos to try to recreate the logic of how `dbt-core`
and its adaptors generate all the SQL statements for different
operations, data warehouses, and types of materialisation.

With this PR, we use `dbt-core` to create the complete SQL statements
without `dbt-core` running those transformations. This enables better
compatibility with various `dbt-core` features while ensuring
correctness in running models.

The drawback of the current approach is that it relies on monkey
patching, a technique used to dynamically update the behaviour of a
piece of code at run-time. Cosmos is monkey patching `dbt-core` adaptors
methods at the moment that they would generally execute SQL statements -
Cosmos modifies this behaviour so that the SQL statements are writen to
disk without performing any operations to the actual data warehouse.

The main drawback of this strategy is in case dbt changes its interface.
For this reason, we logged the follow-up ticket
#1489 to make sure
we test the latest version of dbt and its adapters and confirm the
monkey patching works as expected regardless of the version being used.
That said, since the method being monkey patched is part of the
`dbt-core` interface with its adaptors, we believe the risks of breaking
changes will be low.

The other challenge with the current approach is that every Cosmos task
relies on the following:
1. `dbt-core` being installed alongside the Airflow installation
2. the execution of a significant part of the `dbtRunner` logic

We have logged a follow-up ticket to evaluate the possibility of
overcoming these challenges: #1477

## Key Changes

1. Mocked BigQuery Adapter Execution:
- Introduced `_mock_bigquery_adapter()` to override
`BigQueryConnectionManager.execute`, ensuring SQL is only written to the
`target` directory and skipping execution in the warehouse.
- The generated SQL is then submitted using Airflow’s
BigQueryInsertJobOperator in deferrable mode.
4. Refactoring `AbstractDbtBaseOperator`:
- Previously, `AbstractDbtBaseOperator` inherited `BaseOperator`,
causing conflicts when used with `BigQueryInsertJobOperator` with
our`EXECUTIONMODE.AIRFLOW_ASYNC` classes and the interface built in
#1483
- Refactored to `AbstractDbtBase` (no longer inheriting `BaseOperator`),
requiring explicit `BaseOperator` initialization in all derived
operators.
- Updated the below existing operators to consider this refactoring
needing derived classes to initialise `BaseOperator`:
        - `DbtAzureContainerInstanceBaseOperator`
        - `DbtDockerBaseOperator`
        - `DbtGcpCloudRunJobBaseOperator`
        - `DbtKubernetesBaseOperator`
5. Changes to dbt Compilation Workflow
- Removed `_add_dbt_compile_task`, which previously pre-generated SQL
and uploaded it to remote storage and subsequent task downloaded this
compiled SQL for their execution.
- Instead, `dbt run` is now directly invoked in each task using the
mocked adapter to generate the full SQL.
- A future
[issue](#1477)
will assess whether we should reintroduce a compile task using the
mocked adapter for SQL generation and upload, reducing redundant dbt
calls in each task.

## Issue updates
The PR fixes the following issues:
1. closes: #1260 
- Previously, we only supported --full-refresh dbt run with static SQL
headers (e.g., CREATE/DROP TABLE).
- Now, we support dynamic SQL headers based on materializations,
including CREATE OR REPLACE TABLE, CREATE OR REPLACE VIEW, etc.
2. closes: #1271 
- dbt macros are evaluated at runtime during dbt run invocation using
mocked adapter, and this PR lays the groundwork for supporting them in
async execution mode.
3. closes: #1265 
- Now, large datasets can avoid full drops and recreations, enabling
incremental model updates.
6. closes: #1261 
- Previously, only tables (--full-refresh) were supported; this PR
implements logic for handling different materializations that dbt
supports like table, view, incremental, ephemeral, and materialized
views.
7. closes: #1266 
- Instead of relying on dbt compile (which only outputs SELECT
statements), we now let dbt generate complete SQL queries, including SQL
headers/DDL statements for the queries corresponding to the resource
nodes and state of tables/views in the backend warehouse
8. closes: #1264 
- We support emitting datasets for `EXECUTIONMODE.AIRFLOW_ASYNC` too
with this PR

## Example DAG showing `EXECUTIONMODE.AIRFLOW_ASYNC` deferring tasks and
the dynamic query submitted in the logs

<img width="1532" alt="Screenshot 2025-02-04 at 1 02 42 PM"
src="https://github.com/user-attachments/assets/baf15864-9bf8-4f35-95b7-954a1f547bfe"
/>


## Next Steps & Considerations:
- It's acknowledged that using mock patching may have downsides,
however, this currently seems the best approach to achieve our goals.
It's understood and accepted the risks associated with this method. To
mitigate them, we are expanding our test coverage to include all
currently supported dbt adapter versions in our test matrix in #1489.
This will ensure compatibility across different dbt versions and helps
us catch potential issues early.
- Further validation of different dbt macros and materializations with
`ExecutionMode.AIRFLOW_ASYNC` by seeking feedback from users by testing
alpha
https://github.com/astronomer/astronomer-cosmos/releases/tag/astronomer-cosmos-v1.9.0a5
created with changes from this PR.
- #1477, Compare
the efficiency of generating SQL dynamically vs. pre-compiling and
uploading SQL via a separate task.
- Add compatibility across all major cloud datawarehouse backends (dbt
adapters).

---------

Co-authored-by: Tatiana Al-Chueyr <[email protected]>
Co-authored-by: Pankaj Singh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt:compile Primarily related to dbt compile command or functionality execution:async Related to the Async execution mode
Projects
None yet
3 participants