[destination-bigquery] uses MAX(_airbyte_extracted_at) Query which results in high costs #48846

timorkal · 2024-12-09T09:58:58Z

Connector Name

destination-bigquery

Connector Version

2.9.0

What step the error happened?

None

Relevant information

I saw high costs when preiodically syncing (every 5 mintues) my dataset from Postgresql to BigQuery.
When I drilled down a bit, I saw that most of the cost comes from these kinds of queries performed from the airbyte side (on each table):
SELECT MAX(_airbyte_extracted_at) FROM airbyte_internal.xxxxx_db_eu_raw__stream_xxxxx``

some of the tables are very heavy (~750GB).
when running it manually it processes about 3.5GB.
This kind of query is running probably each sync (5 minutes), acuumalating in more than ~300 runs a day.

Is there a way to lower the costs?
maybe changing the way I sync the BigQuery?

Relevant log output

Contribute

Yes, I want to contribute

The text was updated successfully, but these errors were encountered:

justbeez · 2024-12-12T22:10:54Z

@timorkal I'm not sure that this is actually a defect, as Airbyte appropriately clusters by _airbyte_extracted_at. Oddly it seems like BigQuery still scans the data versus the clustered index though (which was unexpected to me).

But even as things are, this query being run every 5 minutes will only be uncached when the underlying table data has changed. And worst case cost performance and non-discounted On-Demand pricing model of $6.25/TiB processed, that would be:
3.5 GB * 288 runs per day * $0.00625 cost per GB = $6.30 daily cost

While not "cheap", at the volumes you're talking I don't consider that particularly expensive either. I imagine the actual MERGE queries to the destination table are likely far more expensive (assuming you're doing deduping). Or even the INSERTs themselves are likely a lot more. So I would argue that this is probably just not the area that needs to be optimized, if there is one.

I will say that if these types of queries are common, you may be able to optimize them by creating a Materialized View in BigQuery that aggregates the raw table in question. A unique attribute of Materialized Views in BigQuery is that they'll be used to automatically optimize queries against not only themselves, but the source tables they represent. Therefore you can do something like:

CREATE MATERIALIZED VIEW AS `airbyte_internal.xxxxx_db_eu_raw__stream_xxxxx_airbyte_extracted_at`
    SELECT MAX(_airbyte_extracted_at) AS _airbyte_extracted_at FROM `airbyte_internal.xxxxx_db_eu_raw__stream_xxxxx`

In testing this on a similarly sized Airbyte table (3.7GB to scan this column in my case), this reduced the cost of the aggregate query to processing only 8 Bytes (resulting in the minimum billable of 10MB). My only concern is that in your case, the data is changing more frequently—and I'm not sure what the underlying cost dynamics of maintaining the state of the materialized view would be for you. Usually the cost is just slots and storage, and in this case storage would only be a single value. But something I have noticed is that MVs tend to be very smart about how they update themselves, and generally I see them only process new data. So in your case, I'm expecting you'd see a positive trade-off.

So that's something you could try, but even if the result is positive for you, I'm not sure I would expect Airbyte to make those types of decisions for me (and the explainability of what all these random MVs are doing is . . . challenging).

Feel free to give it a try and report back though—if nothing else, it may be a workaround. But in this case, I'm assuming the $2.2K/year you're talking about in compute is likely a fairly small portion of your overall cloud data warehousing costs.

Maybe for high frequency syncs, Airbyte could store the high water mark . . . but this makes transferability/upgradeability difficult unless they actually store it in the destination somewhere (which means more random metadata tables that need to be managed).

ProdigyMaster · 2024-12-18T09:18:48Z

I totally relate to your experience @timorkal . The incremental sync in Airbyte isn't working as well as I hoped. It feels like it performed better and was more efficient before when Airbyte used DBT for incremental syncs. Right now, it seems to be scanning a lot of data during both ingestions and when creating final tables in BigQuery.

timorkal added area/connectors Connector related issues needs-triage type/bug Something isn't working labels Dec 9, 2024

octavia-squidington-iii added autoteam team/connectors-python community labels Dec 9, 2024

timorkal changed the title ~~BigQuery Connector Using MAX(_airbyte_extracted_at) Query which is results in costs~~ BigQuery Connector Using MAX(_airbyte_extracted_at) Query which results in high costs Dec 10, 2024

marcosmarxm changed the title ~~BigQuery Connector Using MAX(_airbyte_extracted_at) Query which results in high costs~~ [destination-bigquery] uses MAX(_airbyte_extracted_at) Query which results in high costs Dec 16, 2024

marcosmarxm added connectors/destination/bigquery team/destinations Destinations team's backlog and removed needs-triage team/connectors-python autoteam labels Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[destination-bigquery] uses MAX(_airbyte_extracted_at) Query which results in high costs #48846

[destination-bigquery] uses MAX(_airbyte_extracted_at) Query which results in high costs #48846

timorkal commented Dec 9, 2024 •

edited

Loading

justbeez commented Dec 12, 2024

ProdigyMaster commented Dec 18, 2024

[destination-bigquery] uses MAX(_airbyte_extracted_at) Query which results in high costs #48846

[destination-bigquery] uses MAX(_airbyte_extracted_at) Query which results in high costs #48846

Comments

timorkal commented Dec 9, 2024 • edited Loading

Connector Name

Connector Version

What step the error happened?

Relevant information

Relevant log output

Contribute

justbeez commented Dec 12, 2024

ProdigyMaster commented Dec 18, 2024

timorkal commented Dec 9, 2024 •

edited

Loading