-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] make batches query go brrrrrrr (for realsies) #14649
[batch] make batches query go brrrrrrr (for realsies) #14649
Conversation
NB: brrrrr to brrrrr comparison: https://github.com/hail-is/hail/compare/5dbf80e..08e7f6f#diff-0f931312c631fea66daf5de2961a2df18757e69d33e661390c1c2837c4a2efec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good reinterpretation of the original PR, with more specific control over the index being used and the ordering of the joins.
Since this is some specific customization based on knowledge of the underlying data, it may be worth documenting in the PR description (or code? or docs?) why this particular ordering (and choice of indexes) is important?
, cancelled_t.cancelled IS NOT NULL AS cancelled | ||
, job_groups_n_jobs_in_complete_states.n_completed | ||
, job_groups_n_jobs_in_complete_states.n_succeeded | ||
, job_groups_n_jobs_in_complete_states.n_failed | ||
, job_groups_n_jobs_in_complete_states.n_cancelled | ||
, cost_t.cost | ||
, cost_t.cost_breakdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably this layout makes adding and removing lines easier? But did you intend to keep this in the final version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did. MySQL (rightly) does not support trailing commas. This is the One True Way (TM) to do split comma-separated lists.
, job_groups_n_jobs_in_complete_states.n_cancelled | ||
, cost_t.cost | ||
, cost_t.cost_breakdown | ||
FROM batches IGNORE INDEX (batches_deleted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight forward-looking concern over introducing so much mysql-specific syntax to the query. I have a suspicion that postgres support might be a useful option to have in certain potential terra futures. Not enough to block this, but enough to make me mildly uneasy...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, rawls and sam use MySQL (or cloud sql). Anyway, all that is kind of irrelevant as
- different databases have different optimisers and may plan queries differently
- many of the queries in batch are very tuned to MySQL (eg
STRAIGHT_JOIN
,LATERAL
,(a, b, ..) IN Expression
, etc) - we'd therefore have to consider using different DBMSs very carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point well taken that this is not "the thin end of the wedge" and more like "a pattern already throughout batch". Point also well taken that adding support is not necessarily a trivial thing to do. So 👍 for following the pattern here.
For more context on my thoughts here - there are certainly 'ways to terra' that do not go through psql, but our terra-on-azure app, for example, cannot have a managed database because of this reliance, though maybe that's more because it was never added than some fundamental blocker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I have a natural prior that 'all this optimization for mysql' is just not as necessary in psql because their optimizer is less likely to do the weird things that mysql does. But there's not data behind that, just biases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our terra-on-azure app, for example, cannot have a managed database because of this reliance, though maybe that's more because it was never added than some fundamental blocker
That's good to know for future work, thanks for sharing.
Good idea. I'll do that. |
#14629 improved the speed of listing a user's batches but introduced a large
regression for listing all batches readble by that user. This change fixes that
regression by making use index hints and
STRAIGHT_JOIN
s.The index hint tells MySQL to never consider the index
batches_deleted
as ithas very low cardinality. In some forms of this query, the planner tries to use
it to its peril.
A problem in query 0 with #14629 (see below) was that fewer filters on batches
made the optimiser consider joins in a suboptimal order - it did a table scan
on
job_groups
first then sorted the results by tobatches.id DESC
insteadof doing an index scan on
batches
in reverse.Using
STRAIGHT_JOIN
s instead ofINNER JOIN
mades the optimiser start frombatches
and read its index in reverse before considering other tables insubsequent joins. From the documentation:
This is advantageous for a couple of reasons:
batches
table has more applicables indexestable so we can read the index in reverse
Before and after timings, collected by running the query 5 times, then using
profiles gathered by MySQL.
I'm hopeful that this won't introduce regressions for most use cases. While I
haven't benchmarked other queries, the MySQL client does feel more responsive
for a wider array of users. One notable exception is for the user
dking
whoowns 3.7x more batches than has access to, of which all have been deleted. I
don't think this is a common enough use case to make this query even more
complicated than it already is.
Resolves #14599