Use CTEs instead of subqueries #1523

siljamardla · 2024-11-12T12:49:49Z

siljamardla
Nov 12, 2024

Idea

At the moment MF queries are fully based on subqueries.
To make the queries more efficient and easier to read, they should start by CTEs to read the data that will be needed.

For example:

WITH 
subq_1 AS (
SELECT 
  /*all the columns we will need from metric1_upstream_table*/
FROM metric1_upstream_table
WHERE metric1_upstream_table__agg_time_dimension BETWEEN /*start_time*/ AND /*end_time*/
)
, subq_2 AS (
SELECT 
  /*all the columns we will need from metric2_upstream_table*/
FROM metric2_upstream_table
WHERE metric2_upstream_table__agg_time_dimension BETWEEN /*start_time*/ AND /*end_time*/
)
SELECT ... FROM ...

Impact

Real world example: I have a saved query with 200 metrics that depends on 22 tables, but the compiled SQL has about 630 places where it reads data from a table.

By using the CTEs:

we'd be able to make the query much easier to verify by human review (yes, I expected to read this amount of data from these tables)
the output SQL will be much shorter
we'd be able to rewrite some of the SQL for metrics with filters in them

Sample compiled SQL

Here's an example where we have two metrics that are otherwise similar, but the filter value is different. And another metric that comes from another table.

mf query --metrics rides_orders_in_finished_state_local,rides_orders_in_cancelled_state_local,delivery_orders_in_finished_state_local --explain --group-by calendar_date_local --start-time '2023-08-02' --end-time '2023-08-02'

currently compiles to

SELECT
  COALESCE(subq_13.calendar_date_local, subq_21.calendar_date_local, subq_29.calendar_date_local) AS calendar_date_local
  , MAX(subq_13.rides_orders_in_finished_state_local) AS rides_orders_in_finished_state_local
  , MAX(subq_21.rides_orders_in_cancelled_state_local) AS rides_orders_in_cancelled_state_local
  , MAX(subq_29.delivery_orders_in_finished_state_local) AS delivery_orders_in_finished_state_local
FROM (
  SELECT
    calendar_date_local
    , SUM(rides_orders_local) AS rides_orders_in_finished_state_local
  FROM (
    SELECT
      created_date_local AS calendar_date_local
      , order_state AS rides_order_key__order_state
      , 1 AS rides_orders_local
    FROM `fact_rides_order` fact_rides_order_src_10000
    WHERE DATE_TRUNC('day', created_date_local) BETWEEN '2023-08-02' AND '2023-08-02'
  ) subq_9
  WHERE rides_order_key__order_state = 'finished'
  GROUP BY
    calendar_date_local
) subq_13
FULL OUTER JOIN (
  SELECT
    calendar_date_local
    , SUM(rides_orders_local) AS rides_orders_in_cancelled_state_local
  FROM (
    SELECT
      created_date_local AS calendar_date_local
      , order_state AS rides_order_key__order_state
      , 1 AS rides_orders_local
    FROM `fact_rides_order` fact_rides_order_src_10000
    WHERE DATE_TRUNC('day', created_date_local) BETWEEN '2023-08-02' AND '2023-08-02'
  ) subq_17
  WHERE rides_order_key__order_state = 'cancelled'
  GROUP BY
    calendar_date_local
) subq_21
ON
  subq_13.calendar_date_local = subq_21.calendar_date_local
FULL OUTER JOIN (
  SELECT
    calendar_date_local
    , SUM(delivery_orders_local) AS delivery_orders_in_finished_state_local
  FROM (
    SELECT
      order_created_date_local AS calendar_date_local
      , order_state AS global_order_id__order_state
      , 1 AS delivery_orders_local
    FROM `fact_order_delivery` fact_order_delivery_src_10000
    WHERE DATE_TRUNC('day', order_created_date_local) BETWEEN '2023-08-02' AND '2023-08-02'
  ) subq_25
  WHERE global_order_id__order_state = 'delivered'
  GROUP BY
    calendar_date_local
) subq_29
ON
  COALESCE(subq_13.calendar_date_local, subq_21.calendar_date_local) = subq_29.calendar_date_local
GROUP BY
  COALESCE(subq_13.calendar_date_local, subq_21.calendar_date_local, subq_29.calendar_date_local)

Notice how we query the rides order table twice.

This could instead compile to something like this (keeping the logic of filtering the whole subquery):

WITH 
--Read data
fact_rides_order_src_10000 AS (
SELECT
      created_date_local AS calendar_date_local
      , order_state AS rides_order_key__order_state
      , 1 AS rides_orders_local
    FROM `schema.fact_order` fact_rides_order_src_10000
    WHERE DATE_TRUNC('day', created_date_local) BETWEEN '2023-08-02' AND '2023-08-02'
)
, fact_order_delivery_src_10000 AS (
SELECT
      order_created_date_local AS calendar_date_local
      , order_state AS global_order_id__order_state
      , 1 AS delivery_orders_local
    FROM `fact_order_delivery` fact_order_delivery_src_10000
    WHERE DATE_TRUNC('day', order_created_date_local) BETWEEN '2023-08-02' AND '2023-08-02'
)
--Calculate metrics
, rides_orders_in_finished_state_local AS (
  SELECT
    calendar_date_local
    , SUM(rides_orders_local) AS rides_orders_in_finished_state_local
  FROM fact_order_delivery_src_10000
  WHERE rides_order_key__order_state = 'finished'
  GROUP BY calendar_date_local
)
, rides_orders_in_cancelled_state_local AS (
  SELECT
    calendar_date_local
    , SUM(rides_orders_local) AS rides_orders_in_cancelled_state_local
  FROM fact_order_delivery_src_10000
  WHERE rides_order_key__order_state = 'cancelled'
  GROUP BY calendar_date_local
)
, delivery_orders_in_finished_state_local AS (
  SELECT
    calendar_date_local
    , SUM(delivery_orders_local) AS delivery_orders_in_finished_state_local
  FROM fact_order_delivery_src_10000
  WHERE global_order_id__order_state = 'delivered'
  GROUP BY calendar_date_local
)
--Final join to put metrics side by side
SELECT
  COALESCE(subq_13.calendar_date_local, subq_21.calendar_date_local, subq_29.calendar_date_local) AS calendar_date_local
  , MAX(subq_13.rides_orders_in_finished_state_local) AS rides_orders_in_finished_state_local
  , MAX(subq_21.rides_orders_in_cancelled_state_local) AS rides_orders_in_cancelled_state_local
  , MAX(subq_29.delivery_orders_in_finished_state_local) AS delivery_orders_in_finished_state_local
FROM rides_orders_in_finished_state_local
FULL OUTER JOIN rides_orders_in_cancelled_state_local
ON rides_orders_in_finished_state_local.calendar_date_local = rides_orders_in_cancelled_state_local.calendar_date_local
FULL OUTER JOIN delivery_orders_in_finished_state_local
ON rides_orders_in_cancelled_state_local.calendar_date_local = delivery_orders_in_finished_state_local.calendar_date_local

The example I'm giving here with different filters is making it more difficult to merge the metrics subqueries (there's COUNT_IF (...) available as a function, but couldn't find a corresponding SUM_IF, at least not in Databricks). For use cases when we just need multiple columns from the same table, but used in different downstream derived metrics, it would be more obvious. I can come back here later and provide examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CTEs instead of subqueries #1523

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Use CTEs instead of subqueries #1523

siljamardla Nov 12, 2024

Idea

Impact

Sample compiled SQL

Replies: 0 comments

siljamardla
Nov 12, 2024