GitHub - egehanyorulmaz/BigQuery-Cost-Monitoring-and-Insights-with-Apache-airflow: Monitoring, insights, and total cost calculations per table about all queries run in BigQuery

BigQuery Cost Monitoring, Insights and Cost Extraction with Apache Airflow

Problem

Thousands of businesses prefer BigQuery as data warehouse in their data pipeline. In the lack of data engineer resources, BigQuery Data Transfer Service becomes useful since it offers a variety of sources to integrate to BigQuery like Google Products (Youtube, Google Analytics, Cloud Resources), some AWS products (Redshift, S3), and hundreds of 3rd party transfer tools. After a while, you figure out that the majority of costs billed to you for Bigquery comes from Analysis, not from Storage. At that point, you might decide to find ways of reducing the analysis cost and you find some articles about query optimization in BigQuery. You gave few bullet points that has to be avoided in BigQuery SQL like "Limiting doesn't reduce the bytes processed", "Always SELECT the columns you are interested in". But, you still don't observe any decrease in the billing. At that point, this repository comes to play :)

Approach and Solution

My approach to the problem was to first understand the root cause of the problem. What is the major cost in BigQuery? Is it high because of the queries run by users or automated jobs? What tables and views are contributing to the cost? How do we prioritize the views to optimize?

Using the BigQuery Module, we can access, manage and interact with the element of BigQuery. Within the module, there is bigquery.job class that is used to access the history and details about every query and load jobs run in BigQuery.

1- BigQuery Class
Creates a BigQuery client with the google credential path of json file.

2- BigQueryJobs Class
Inheriting from the BigQuery class and uses the client to list the historical job records and iterates over the jobs to extract the following details:

Job Type (Load or Query)
Creation Time
User Email
Query
Query run time
Billed bytes for the query run

To understand which tables are used in the query, I used the Parser method from sql_metadata Python module. sql_metadata library does the following:

SQL Query: SELECT * FROM table1 t1\ LEFT JOIN table2 t2\ ON t1.x=t2.x
sql_metadata output: ["table1", "table2"]

So, we can apply the same logic to BigQuery queries and extract the tables inside the query. Then, assuming that every table in the query contributes the same amount of processed bytes, for each query we can calculate the total billed bytes per table.

Now, since we know the rough estimate of bytes per table we can use the BigQuery Pricing based on region and currency, and calculate the total cost per query by processing all the data returned by the client.

The rest of the job is to schedule this code using Apache Airflow! It has already been implemented in this code :).

To use the code: You have to replace your appropriate credentials to XXXXXX in SqlQueryManager.py

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
helpers		helpers
.gitattributes		.gitattributes
README.md		README.md
airflow-bq_cost_monitoring.py		airflow-bq_cost_monitoring.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BigQuery Cost Monitoring, Insights and Cost Extraction with Apache Airflow

Problem

Approach and Solution

About

Releases

Packages

Languages

egehanyorulmaz/BigQuery-Cost-Monitoring-and-Insights-with-Apache-airflow

Folders and files

Latest commit

History

Repository files navigation

BigQuery Cost Monitoring, Insights and Cost Extraction with Apache Airflow

Problem

Approach and Solution

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages