-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[destination-bigquery] uses MAX(_airbyte_extracted_at) Query which results in high costs #48846
Comments
@timorkal I'm not sure that this is actually a defect, as Airbyte appropriately clusters by But even as things are, this query being run every 5 minutes will only be uncached when the underlying table data has changed. And worst case cost performance and non-discounted On-Demand pricing model of $6.25/TiB processed, that would be: While not "cheap", at the volumes you're talking I don't consider that particularly expensive either. I imagine the actual I will say that if these types of queries are common, you may be able to optimize them by creating a Materialized View in BigQuery that aggregates the raw table in question. A unique attribute of Materialized Views in BigQuery is that they'll be used to automatically optimize queries against not only themselves, but the source tables they represent. Therefore you can do something like: CREATE MATERIALIZED VIEW AS `airbyte_internal.xxxxx_db_eu_raw__stream_xxxxx_airbyte_extracted_at`
SELECT MAX(_airbyte_extracted_at) AS _airbyte_extracted_at FROM `airbyte_internal.xxxxx_db_eu_raw__stream_xxxxx` In testing this on a similarly sized Airbyte table (3.7GB to scan this column in my case), this reduced the cost of the aggregate query to processing only 8 Bytes (resulting in the minimum billable of 10MB). My only concern is that in your case, the data is changing more frequently—and I'm not sure what the underlying cost dynamics of maintaining the state of the materialized view would be for you. Usually the cost is just slots and storage, and in this case storage would only be a single value. But something I have noticed is that MVs tend to be very smart about how they update themselves, and generally I see them only process new data. So in your case, I'm expecting you'd see a positive trade-off. So that's something you could try, but even if the result is positive for you, I'm not sure I would expect Airbyte to make those types of decisions for me (and the explainability of what all these random MVs are doing is . . . challenging). Feel free to give it a try and report back though—if nothing else, it may be a workaround. But in this case, I'm assuming the $2.2K/year you're talking about in compute is likely a fairly small portion of your overall cloud data warehousing costs. Maybe for high frequency syncs, Airbyte could store the high water mark . . . but this makes transferability/upgradeability difficult unless they actually store it in the destination somewhere (which means more random metadata tables that need to be managed). |
I totally relate to your experience @timorkal . The incremental sync in Airbyte isn't working as well as I hoped. It feels like it performed better and was more efficient before when Airbyte used DBT for incremental syncs. Right now, it seems to be scanning a lot of data during both ingestions and when creating final tables in BigQuery. |
Connector Name
destination-bigquery
Connector Version
2.9.0
What step the error happened?
None
Relevant information
I saw high costs when preiodically syncing (every 5 mintues) my dataset from Postgresql to BigQuery.
When I drilled down a bit, I saw that most of the cost comes from these kinds of queries performed from the airbyte side (on each table):
SELECT MAX(_airbyte_extracted_at) FROM
airbyte_internal.
xxxxx_db_eu_raw__stream_xxxxx``some of the tables are very heavy (~750GB).
when running it manually it processes about 3.5GB.
This kind of query is running probably each sync (5 minutes), acuumalating in more than ~300 runs a day.
Is there a way to lower the costs?
maybe changing the way I sync the BigQuery?
Relevant log output
Contribute
The text was updated successfully, but these errors were encountered: