[batch] make batches query go brrrrrrr (for realsies) (#14649)

#14629 improved the speed of listing a user's batches but introduced a large regression for listing all batches readble by that user. This change fixes that regression by making use index hints and `STRAIGHT_JOIN`s. The index hint tells MySQL to never consider the index `batches_deleted` as it has very low cardinality. In some forms of this query, the planner tries to use it to its peril. A problem in query 0 with #14629 (see below) was that fewer filters on batches made the optimiser consider joins in a suboptimal order - it did a table scan on `job_groups` first then sorted the results by to `batches.id DESC` instead of doing an index scan on `batches` in reverse. Using `STRAIGHT_JOIN`s instead of `INNER JOIN` mades the optimiser start from `batches` and read its index in reverse before considering other tables in subsequent joins. From the [documentation](https://dev.mysql.com/doc/refman/8.4/en/join.html): > STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order. This is advantageous for a couple of reasons: - We want to list newer batches first - For this query, the `batches` table has more applicables indexes - We want the variable to order by to be in the primary key of the first table so we can read the index in reverse Before and after timings, collected by running the query 5 times, then using profiles gathered by MySQL. ``` +-------+---------------------------------------------------* | query | description | +-------+---------------------------------------------------+ | 0 | All batches accessible to user `ci` | | 1 | All batches accessible to user `ci` owned by `ci` | +-------+---------------------------------------------------* +-------+--------+--------------------------------------------------------+------------+------------+ | query | branch | timings | mean | stdev | +-------+--------+--------------------------------------------------------+------------+------------+ | 0 | main | 0.05894400,0.05207850,0.07067875,0.06281800,0.060250 | 0.06095385 | 0.00602255 | | 1 | main | 14.1106150,12.2619323,13.8442850,12.0749633,14.0297822 | 13.2643156 | 0.90087263 | +-------+--------+--------------------------------------------------------+------------+------------+ | 0 | PR | 0.04717375,0.04974350,0.04312150,0.04070850,0.04193650 | 0.04453675 | 0.00339069 | | 1 | PR | 0.04423925,0.03967550,0.03935425,0.04056875,0.05286850 | 0.04334125 | 0.00507140 | +-------+--------+--------------------------------------------------------+------------+------------+ ``` I'm hopeful that this won't introduce regressions for most use cases. While I haven't benchmarked other queries, the MySQL client does feel more responsive for a wider array of users. One notable exception is for the user `dking` who owns 3.7x more batches than has access to, of which all have been deleted. I don't think this is a common enough use case to make this query even more complicated than it already is. Resolves #14599
hail-is · Aug 8, 2024 · 7d25779 · 7d25779
1 parent ba1ac58
commit 7d25779
Showing 1 changed file with 32 additions and 29 deletions.
diff --git a/batch/batch/front_end/query/query_v2.py b/batch/batch/front_end/query/query_v2.py
@@ -125,40 +125,43 @@ def parse_list_batches_query_v2(user: str, q: str, last_batch_id: Optional[int])
         where_conditions.append(f'({cond})')
         where_args += args
 
-    sql = f"""
-SELECT batches.*,
-  cancelled_t.cancelled IS NOT NULL AS cancelled,
-  job_groups_n_jobs_in_complete_states.n_completed,
-  job_groups_n_jobs_in_complete_states.n_succeeded,
-  job_groups_n_jobs_in_complete_states.n_failed,
-  job_groups_n_jobs_in_complete_states.n_cancelled,
-  cost_t.cost, cost_t.cost_breakdown
-FROM job_groups
-LEFT JOIN batches ON batches.id = job_groups.batch_id
-LEFT JOIN billing_projects ON batches.billing_project = billing_projects.name
-LEFT JOIN job_groups_n_jobs_in_complete_states ON job_groups.batch_id = job_groups_n_jobs_in_complete_states.id AND job_groups.job_group_id = job_groups_n_jobs_in_complete_states.job_group_id
-LEFT JOIN LATERAL (
-  SELECT 1 AS cancelled
-  FROM job_group_self_and_ancestors
-  INNER JOIN job_groups_cancelled
-    ON job_group_self_and_ancestors.batch_id = job_groups_cancelled.id AND
-      job_group_self_and_ancestors.ancestor_id = job_groups_cancelled.job_group_id
-  WHERE job_groups.batch_id = job_group_self_and_ancestors.batch_id AND
-    job_groups.job_group_id = job_group_self_and_ancestors.job_group_id
-) AS cancelled_t ON TRUE
-STRAIGHT_JOIN billing_project_users ON batches.billing_project = billing_project_users.billing_project
-LEFT JOIN LATERAL (
-  SELECT COALESCE(SUM(`usage` * rate), 0) AS cost, JSON_OBJECTAGG(resources.resource, COALESCE(`usage` * rate, 0)) AS cost_breakdown
-  FROM (
+    sql = f"""\
+SELECT batches.*
+     , cancelled_t.cancelled IS NOT NULL AS cancelled
+     , job_groups_n_jobs_in_complete_states.n_completed
+     , job_groups_n_jobs_in_complete_states.n_succeeded
+     , job_groups_n_jobs_in_complete_states.n_failed
+     , job_groups_n_jobs_in_complete_states.n_cancelled
+     , cost_t.cost
+     , cost_t.cost_breakdown
+FROM batches IGNORE INDEX (batches_deleted)
+INNER JOIN billing_projects
+  ON batches.billing_project = billing_projects.name
+INNER JOIN billing_project_users
+  ON batches.billing_project = billing_project_users.billing_project
+STRAIGHT_JOIN job_groups
+  ON job_groups.batch_id = batches.id
+STRAIGHT_JOIN job_groups_n_jobs_in_complete_states
+  ON job_groups.batch_id     = job_groups_n_jobs_in_complete_states.id
+ AND job_groups.job_group_id = job_groups_n_jobs_in_complete_states.job_group_id
+LEFT JOIN (SELECT *, 1 AS cancelled FROM job_groups_cancelled) AS cancelled_t
+  ON job_groups.batch_id     = cancelled_t.id
+ AND job_groups.job_group_id = cancelled_t.job_group_id
+INNER JOIN LATERAL (
+  WITH resource_costs AS (
     SELECT resource_id, CAST(COALESCE(SUM(`usage`), 0) AS SIGNED) AS `usage`
     FROM aggregated_job_group_resources_v3
-    WHERE job_groups.batch_id = aggregated_job_group_resources_v3.batch_id AND job_groups.job_group_id = aggregated_job_group_resources_v3.job_group_id
+    WHERE     batch_id = batches.id
+      AND job_group_id = job_groups.job_group_id
     GROUP BY resource_id
-  ) AS usage_t
-  LEFT JOIN resources ON usage_t.resource_id = resources.resource_id
+  )
+  SELECT COALESCE(SUM(`usage` * rate), 0) AS cost
+       , JSON_OBJECTAGG(resource, COALESCE(`usage` * rate, 0)) AS cost_breakdown
+  FROM resource_costs
+  INNER JOIN resources USING (resource_id)
 ) AS cost_t ON TRUE
 WHERE {' AND '.join(where_conditions)}
-ORDER BY job_groups.batch_id DESC
+ORDER BY batches.id DESC
 LIMIT 51;
 """