Skip to content

Monitoring, insights, and total cost calculations per table about all queries run in BigQuery

Notifications You must be signed in to change notification settings

egehanyorulmaz/BigQuery-Cost-Monitoring-and-Insights-with-Apache-airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BigQuery Cost Monitoring, Insights and Cost Extraction with Apache Airflow

Problem

Thousands of businesses prefer BigQuery as data warehouse in their data pipeline. In the lack of data engineer resources, BigQuery Data Transfer Service becomes useful since it offers a variety of sources to integrate to BigQuery like Google Products (Youtube, Google Analytics, Cloud Resources), some AWS products (Redshift, S3), and hundreds of 3rd party transfer tools. After a while, you figure out that the majority of costs billed to you for Bigquery comes from Analysis, not from Storage. At that point, you might decide to find ways of reducing the analysis cost and you find some articles about query optimization in BigQuery. You gave few bullet points that has to be avoided in BigQuery SQL like "Limiting doesn't reduce the bytes processed", "Always SELECT the columns you are interested in". But, you still don't observe any decrease in the billing. At that point, this repository comes to play :)

Approach and Solution

My approach to the problem was to first understand the root cause of the problem. What is the major cost in BigQuery? Is it high because of the queries run by users or automated jobs? What tables and views are contributing to the cost? How do we prioritize the views to optimize?

Using the BigQuery Module, we can access, manage and interact with the element of BigQuery. Within the module, there is bigquery.job class that is used to access the history and details about every query and load jobs run in BigQuery.

1- BigQuery Class
Creates a BigQuery client with the google credential path of json file.

2- BigQueryJobs Class
Inheriting from the BigQuery class and uses the client to list the historical job records and iterates over the jobs to extract the following details:

  1. Job Type (Load or Query)
  2. Creation Time
  3. User Email
  4. Query
  5. Query run time
  6. Billed bytes for the query run

To understand which tables are used in the query, I used the Parser method from sql_metadata Python module. sql_metadata library does the following:

SQL Query: SELECT * FROM table1 t1\ LEFT JOIN table2 t2\ ON t1.x=t2.x
sql_metadata output: ["table1", "table2"]

So, we can apply the same logic to BigQuery queries and extract the tables inside the query. Then, assuming that every table in the query contributes the same amount of processed bytes, for each query we can calculate the total billed bytes per table.
image

Now, since we know the rough estimate of bytes per table we can use the BigQuery Pricing based on region and currency, and calculate the total cost per query by processing all the data returned by the client.

The rest of the job is to schedule this code using Apache Airflow! It has already been implemented in this code :).

To use the code: You have to replace your appropriate credentials to XXXXXX in SqlQueryManager.py

About

Monitoring, insights, and total cost calculations per table about all queries run in BigQuery

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages