Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] metadata source freshness does not work as intended if table name contains wildcard. #536

Open
2 tasks done
tanukifk opened this issue Oct 3, 2024 · 3 comments
Open
2 tasks done
Labels
good-first-issue Good for newcomers pkg:dbt-bigquery Issue affects dbt-bigquery type:bug Something isn't working as documented

Comments

@tanukifk
Copy link

tanukifk commented Oct 3, 2024

Is this a new bug in dbt-bigquery?

  • I believe this is a new bug in dbt-bigquery
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I define a table with a wildcard as a source and then run a freshness test without loaded_at_field to retrieve freshness from metadata.

version: 2

sources:

  - name: dummy_source
    database: your-project
    schema: your-dataset
    freshness:
      warn_after: {count: 1, period: day}
      error_after: {count: 2, period: day}
    tables:
      - name: dummy_*

Even if the table was created a few days ago, max_loaded_at_time_ago_in_s is less than a second.

source.json looks like this:

{
    "metadata": {
        "dbt_schema_version": "https://schemas.getdbt.com/dbt/sources/v3.json",
        "dbt_version": "1.8.3",
        "generated_at": "2024-09-27T10:10:58.326203Z",
        "invocation_id": "xxxxxx",
        "env": {}
    },
    "results": [
        {
            "unique_id": "source.my_new_project.dummy_source.dummy_*",
            "max_loaded_at": "2024-09-27T10:10:57.753000+00:00",
            "snapshotted_at": "2024-09-27T10:10:58.320901+00:00",
            "max_loaded_at_time_ago_in_s": 0.567901,
            "status": "pass",
            "criteria": {
                "warn_after": {
                    "count": 1,
                    "period": "day"
                },
                "error_after": {
                    "count": 2,
                    "period": "day"
                },
                "filter": null
            },
            ...
        }
    ],
    "elapsed_time": 7.178150415420532
}

Expected Behavior

max_loaded_at should be the table creation time, not the current timestamp.

Steps To Reproduce

  1. define a table with a wildcard and not provide loaded_at_field to run metadata source freshness.
  2. run source freshness.

Relevant log output

No response

Environment

- OS: Windows 11
- Python: 3.11.9
- dbt-core: 1.8.3
- dbt-bigquery: 1.8.2

Additional Context

According to the issue, the get_table method in calculate_freshness_from_metadata creates a temp table containing all matched tables and returns its creation time as modified instead of actual latest modified time we'd like to retrieve.

Therefore, we should raise an error or implement an alternative way for tables containing a wildcard in their name.

https://github.com/dbt-labs/dbt-bigquery/blob/v1.8.2/dbt/adapters/bigquery/impl.py#L726-L744

@tanukifk tanukifk added type:bug Something isn't working as documented triage:product In Product's queue labels Oct 3, 2024
@amychen1776
Copy link
Contributor

@tanukifk so I can fully understand the use case here - what are the use cases in which you need to use an wildcard in a table name?

@amychen1776 amychen1776 added triage:awaiting-response Awaiting a response from the reporter and removed triage:product In Product's queue labels Oct 28, 2024
@tanukifk
Copy link
Author

tanukifk commented Oct 29, 2024

hi @amychen1776,

I typically use a wildcard for sharded tables like dummy_20241029.
(I know both dbt and bigquery recommend using a partitioned table, but there are still plenty of sharded source tables.
ref: https://cloud.google.com/bigquery/docs/partitioned-tables#dt_partition_shard)

If I specify loaded_at_field for the source with wildcard, the source freshness command can successfully fetch the latest timestamp from all matched tables. Therefore, it is natural that users assume without loaded_at_field the source freshness command also works correctly. However, it doesn't now.

@github-actions github-actions bot added triage:product In Product's queue and removed triage:awaiting-response Awaiting a response from the reporter labels Oct 29, 2024
@amychen1776
Copy link
Contributor

Thank you for the information!

@amychen1776 amychen1776 removed the triage:product In Product's queue label Oct 30, 2024
@colin-rogers-dbt colin-rogers-dbt added the good-first-issue Good for newcomers label Nov 21, 2024
@mikealfare mikealfare added the pkg:dbt-bigquery Issue affects dbt-bigquery label Jan 14, 2025
@mikealfare mikealfare transferred this issue from dbt-labs/dbt-bigquery Jan 14, 2025
colin-rogers-dbt pushed a commit that referenced this issue Feb 3, 2025
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good-first-issue Good for newcomers pkg:dbt-bigquery Issue affects dbt-bigquery type:bug Something isn't working as documented
Projects
None yet
Development

No branches or pull requests

4 participants