A utility that reads the entire SQL and provides a list of suggestions that would help to optimize the query and avoid the long running issues.
One of the most frequently ocurring issues in BigQuery are Long Running Issues. This is seen in 2 cases :-
- The Queries go into long running mode and eventually fail with the timeout error after running for 6+ hours.
- The query runs for many hours , consumes more computation and finally produces an output.
In both these cases, most likely the root cause is due to the un-optimized bigquery sql. It will help to identify if the query is likely to fall into long running issues before execution saving time, resources and cost.
- It is recommended to avoid joins that generate more outputs than inputs. The utility advises to avoid
CROSS JOIN
and points out the line number of the SQL query where it is present. - When using multiple
CTE
, BigQuery processes them each time. It is better to replace them withtemporary tables
which are computed only once. The utility points out the number of times aCTE
is used, the line number and recommends to replace withtemporary tables
. - The utility also recommends using
GROUP BY
instead ofDISTINCT
. It points out the line number ofDISTINCT
and recommendsGROUP BY
with the object name to give more clarity. - The utility is able to identify if aggregation happens after joins. This leads to more processing. The utility recommends filtering data before joins by identifying the line number and the filter clause.
- BigQuery also recommends to use
APPROX_QUANTILE
instead ofNTILE
. The utility identifies the line number ofNTILE
and provides a recommendation. - Writing Large Queries can make reading more complex. Reusing these queries also becomes an issue. The utility points out large queries separated by
UNION
to be converted totemporary tables
by pointing the line numbers to the start of these queries. REGEXP_CONTAINS
is slower in compute time. The utility identifies the line number where theREGEXP_CONTAINS
was used and recommends usingLIKE
instead.- It is recommended to use only the needed columns in the
SELECT
statement. The utility identifies use of*
with the line number and statement. - Self joins should be avoided in BigQuery. The utility identifies where the self joins are ocurring and table name.
Below packages are need to run the script:pandas, sqlparse
- Login to gcloud using
gcloud auth application-default login
. Make sure the environment variableGOOGLE_APPLICATION_CREDENTIALS
points to credential file. - Install the dependencies listed in requirements.txt using pip3.
pip3 install -r requirements.txt
- Add your test files to a folder. You can add them in the
test_files
folder. - Run the utility
python3 main_dag.py -input-path=path/to/input/folder -bucket=bucket_name -project-id=project_name -output-path=path/to/output/folder
- The results are stored in
bucket
specified above in theoutput-path
folder.
The order of execution of the script is as follows
- ./custom_table_expression/custom_table_expression.py
- ./select_wildcard/select_wildcard.py
- ./filter_data_before_join/filter_data_before_join.py
- ./distinct/distinct.py
- ./reduce_to_tmp_tables/reduce_to_tmp_tables.py
- ./cross_joins/cross_joins.py
- ./self_join/self_join.py
- ./ntile_to_appprox_quant/ntile_to_appprox_quant.py
- ./regex_contains/regex_contains.py