-
Notifications
You must be signed in to change notification settings - Fork 55
Are we really needed a smart statement timeout?
If we make 'JOB' benchmark results with specific preferences (script):
fsync = off
max_parallel_workers_per_gather = 0
statement_timeout = 600000
from_collapse_limit = 20
join_collapse_limit = 20
we will see the result on the graph below (gnuplot template, raw data):
As you can see, we have some queries that can't be executed in reasonable time. This situation looks better with the parallel workers machinery usage, but for clarity we turned it off. What if we had a machinery to learn on a partial query plan, interrupted by a statement_timeout? We should have a chance to recognize situations, when we were interrupted under AQO control and have an option to make one more attempt with, possible, different query plan.
The benchmark showed that we can get different plans on different computers with the same parameters. For example, AMD5800/48GB RAM/SSD gives us set of 17 timed out queries:
14a, 14c, 19a, 21a, 21b, 22a, 22b, 22c, 22d, 25a, 25c, 28b, 29c, 31a, 31b, 31c, 7a
.
Benchmarking on Macbook Pro I see another set of 19 queries:
14a, 14c, 21a, 21b, 22a, 22b, 22c, 22d, 23b, 25a, 25c, 28b, 29a, 30a, 31a, 31b, 31c, 7a, 7b
.
See raw data here and here respectively. In summary we have a divergency table:
Test Number | Query Name | AMD | Mac |
---|---|---|---|
35 | 19a.sql | - | 210433 |
54 | 23b.sql | 13485 | - |
70 | 29a.sql | 53164 | - |
72 | 29c.sql | - | 120075 |
77 | 30a.sql | 14684 | - |
104 | 7b.sql | 579937 | - |
Quick look into the explains show that it is a problem (here and here) of unsteadiness of estimations.
After enabling the partial learning feature (aqo.learn_query_statement = 'on'
) we got such a picture for the timed out queries:
Test Number | Query | Execution time, s | Interrupts |
---|---|---|---|
15 | 14a.sql | 8 | 1 |
17 | 14c.sql | 8 | 1 |
32 | 18a.sql | 13 | 0+1 |
37 | 19c.sql | 18 | 0+1 |
46 | 21a.sql | 7 | 3+1 |
47 | 21b.sql | 5 | 1 |
49 | 22a.sql | 7 | 3+1+1 |
51 | 22c.sql | 418 | 7 |
52 | 22d.sql | 10 | 3+2 |
58 | 25a.sql | 12 | 1+1 |
60 | 25c.sql | 16 | 4 |
67 | 28a.sql | 306 | 0+1 |
70 | 29a.sql | 14 | 0+1+1 |
72 | 29c.sql | 14 | 0+2+2+2+1 |
77 | 30a.sql | 13 | 1+1 |
78 | 30b.sql | 13 | 1 |
79 | 30c.sql | 15 | 1+3+1 |
80 | 31a.sql | 14 | 1+5+2 |
81 | 31b.sql | 86 | 1+1+1 |
82 | 31c.sql | 15 | 3+5+1 |
103 | 7a.sql | 9 | 1+1 |
All queries fit into 10min limit now. Interruptions in the table - number of learnings on timed out queries. Learning process on some queries increased execution time. So, after successful execution we observe interruptions too. In the table we use '+' to show the fact of timed out query after successful execution. Also, for the most part of queries optimizer has found a good solution.