Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] adapter response return incorrect data_scanned_in_bytes when incremental model is running #585

Open
2 tasks done
jvyoralek opened this issue Feb 22, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@jvyoralek
Copy link

Is this a new bug in dbt-athena?

  • I believe this is a new bug in dbt-athena
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

We have defined the incremental model model1 in DBT. When dbt model population is started two queries are generated into AWS Athena:

  1. create a model1__dbt_tmp table and populate it with data
    dbt-query1

  2. insert data from model1__dbt_tmp into model1
    dbt_query2

The cost of this incremental should be data_scanned_in_bytes from step 1 + step 2 = 3.95 kB + 0.79 kB = 4.48 kB

In #353 there was added new functionality to return data_scanned_in_bytes, but for incremental build it will return only step 2 = 810 bytes.

run_results.json - part from run_results.json file

      "execution_time": 15.616735935211182,
      "adapter_response": {
        "_message": "OK 10",
        "code": "OK",
        "rows_affected": 10,
        "data_scanned_in_bytes": 810
      },

Expected Behavior

The cost of this incremental run should be the sum of all involved queries started on AWS Athena.

Steps To Reproduce

No response

Environment

- OS: macOS 14.2.1
- Python: 3.11.7
- dbt: 1.7.4
- dbt-athena-community: 1.7.1

Additional Context

DBT command

Running with dbt=1.7.4
Registered adapter: athena=1.7.1
Found 225 models, 1961 tests, 2 seeds, 160 sources, 0 exposures, 0 metrics, 918 macros, 0 groups, 0 semantic models

Concurrency: 8 threads (target='dev')

1 of 1 START sql incremental model hd_dev_playground.model1 .................... [RUN]
1 of 1 OK created sql incremental model hd_dev_playground.model1 ............... [OK 10 in 14.26s]

Finished running 1 incremental model in 0 hours 0 minutes and 20.22 seconds (20.22s).

Completed successfully

Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

DBT model

{{
    config(
        table_type="iceberg",
        format="parquet",
        is_external=false,
        materialized="incremental",
        incremental_strategy="append",
        schema="playground",
        table_properties={"optimize_rewrite_delete_file_threshold": "2"},
        tags=["playground"],
    )
}}
select partition, kafka_offset, time_created, transaction_id, message_type
from {{ source("internal_source", "account_info_json_partitioned") }}
where "year" = 2024 and "month" = 2 and "day" = 21
limit 10

run_results.json - full file

@jvyoralek jvyoralek added the bug Something isn't working label Feb 22, 2024
@nicor88
Copy link
Contributor

nicor88 commented Feb 23, 2024

This is a know issue, and it's particularly tricky because the pure implementation on what was done here: #353 doesn't work anymore, due to how we handle partitioned limitation introduced by #360

Specifically for tables with more than 100 partitions as you noticed, there will be a CTA plus many batch inserts. Adding an accumulation of every run operations and then returned the final sum, is quite an effort therefore when we implemented this #375 we preferred simplicity over accurateness.

Could you please clarify why such feature will be relevant for you? What use cases do you have?

@jvyoralek
Copy link
Author

The use case involves monitoring the AWS Athena cost of model population. We have a bunch of models defined in dbt, using AWS Athena for storage. These models can be populated automatically or manually from Dagster, which is the UI for model orchestration.

We conceived an idea to incorporate the AWS Athena cost (the number of bytes scanned during model population) into the model metadata within Dagster for each run. This addition could help us identify models that are problematic from a cost perspective.

Example how metadata could look in Dagster

{ 
  "unique_id": "model.project.model1",
  "invocation_id": "c8814bf2-e82a-412b-95b3-8df55b7b0bf1",
  "exucution_type": "incremental",
  "execution_duration_seconds": 1708,
  "rows_affected": 313,
  "total_data_scanned_mb": 122942,
  "total_spent_usd": 0.59
}

... populated from dbt run_results.json file and Dagster internal variables.

However, in the case of incremental models, for example, this approach is problematic in the current version. It will return only the last part of the population, which could be just a small portion of the real 'price'."

Does it make sense?

@nicor88
Copy link
Contributor

nicor88 commented Feb 23, 2024

Thanks make sense, it will be neat to have what you requested, not sure how much effort changes requires. As we are community based, we really rely a lot on the OSS contribution, therefore feel free to take a spin to it, and we can guide/review what you propose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants