Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[destination-bigquery] uses MAX(_airbyte_extracted_at) Query which results in high costs #48846

Open
1 task
timorkal opened this issue Dec 9, 2024 · 2 comments
Labels
area/connectors Connector related issues community connectors/destination/bigquery team/destinations Destinations team's backlog type/bug Something isn't working

Comments

@timorkal
Copy link

timorkal commented Dec 9, 2024

Connector Name

destination-bigquery

Connector Version

2.9.0

What step the error happened?

None

Relevant information

I saw high costs when preiodically syncing (every 5 mintues) my dataset from Postgresql to BigQuery.
When I drilled down a bit, I saw that most of the cost comes from these kinds of queries performed from the airbyte side (on each table):
SELECT MAX(_airbyte_extracted_at) FROM airbyte_internal.xxxxx_db_eu_raw__stream_xxxxx``

some of the tables are very heavy (~750GB).
when running it manually it processes about 3.5GB.
This kind of query is running probably each sync (5 minutes), acuumalating in more than ~300 runs a day.

Is there a way to lower the costs?
maybe changing the way I sync the BigQuery?

Relevant log output

Contribute

  • Yes, I want to contribute
@timorkal timorkal added area/connectors Connector related issues needs-triage type/bug Something isn't working labels Dec 9, 2024
@timorkal timorkal changed the title BigQuery Connector Using MAX(_airbyte_extracted_at) Query which is results in costs BigQuery Connector Using MAX(_airbyte_extracted_at) Query which results in high costs Dec 10, 2024
@justbeez
Copy link
Contributor

@timorkal I'm not sure that this is actually a defect, as Airbyte appropriately clusters by _airbyte_extracted_at. Oddly it seems like BigQuery still scans the data versus the clustered index though (which was unexpected to me).

But even as things are, this query being run every 5 minutes will only be uncached when the underlying table data has changed. And worst case cost performance and non-discounted On-Demand pricing model of $6.25/TiB processed, that would be:
3.5 GB * 288 runs per day * $0.00625 cost per GB = $6.30 daily cost

While not "cheap", at the volumes you're talking I don't consider that particularly expensive either. I imagine the actual MERGE queries to the destination table are likely far more expensive (assuming you're doing deduping). Or even the INSERTs themselves are likely a lot more. So I would argue that this is probably just not the area that needs to be optimized, if there is one.

I will say that if these types of queries are common, you may be able to optimize them by creating a Materialized View in BigQuery that aggregates the raw table in question. A unique attribute of Materialized Views in BigQuery is that they'll be used to automatically optimize queries against not only themselves, but the source tables they represent. Therefore you can do something like:

CREATE MATERIALIZED VIEW AS `airbyte_internal.xxxxx_db_eu_raw__stream_xxxxx_airbyte_extracted_at`
    SELECT MAX(_airbyte_extracted_at) AS _airbyte_extracted_at FROM `airbyte_internal.xxxxx_db_eu_raw__stream_xxxxx`

In testing this on a similarly sized Airbyte table (3.7GB to scan this column in my case), this reduced the cost of the aggregate query to processing only 8 Bytes (resulting in the minimum billable of 10MB). My only concern is that in your case, the data is changing more frequently—and I'm not sure what the underlying cost dynamics of maintaining the state of the materialized view would be for you. Usually the cost is just slots and storage, and in this case storage would only be a single value. But something I have noticed is that MVs tend to be very smart about how they update themselves, and generally I see them only process new data. So in your case, I'm expecting you'd see a positive trade-off.

So that's something you could try, but even if the result is positive for you, I'm not sure I would expect Airbyte to make those types of decisions for me (and the explainability of what all these random MVs are doing is . . . challenging).

Feel free to give it a try and report back though—if nothing else, it may be a workaround. But in this case, I'm assuming the $2.2K/year you're talking about in compute is likely a fairly small portion of your overall cloud data warehousing costs.

Maybe for high frequency syncs, Airbyte could store the high water mark . . . but this makes transferability/upgradeability difficult unless they actually store it in the destination somewhere (which means more random metadata tables that need to be managed).

@marcosmarxm marcosmarxm changed the title BigQuery Connector Using MAX(_airbyte_extracted_at) Query which results in high costs [destination-bigquery] uses MAX(_airbyte_extracted_at) Query which results in high costs Dec 16, 2024
@ProdigyMaster
Copy link

I totally relate to your experience @timorkal . The incremental sync in Airbyte isn't working as well as I hoped. It feels like it performed better and was more efficient before when Airbyte used DBT for incremental syncs. Right now, it seems to be scanning a lot of data during both ingestions and when creating final tables in BigQuery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues community connectors/destination/bigquery team/destinations Destinations team's backlog type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants