-
Notifications
You must be signed in to change notification settings - Fork 505
Optimizer does not successfully time out when query has a limit #1414
Comments
So is seems like my guess was probably somewhat right. In order to time out we need some low cost expression for the root group of our op tree. When we don't have the noisepage/src/optimizer/optimizer_task.cpp Lines 242 to 384 in d770eb9
Specifically the calls to SetExpressionCost . However according to this part
// Check whether we successfully optimize all child group
if (cur_child_idx_ == static_cast<int>(group_expr_->GetChildrenGroupsSize())) { we don't generate any cost for the current node until all child nodes are fully optimized and costed. So I think this means that the |
One suggestion that @thepinetree had was to move the the LIMIT into a property. That way there wouldn't actually be a LIMIT node and the join node would be the root and it could properly time out. Sorts are handled as properties so the implementation would probably be similar. If we go with this approach we may want to wait until after I don't know if it's possible to reproduce this issue without limit, but I think a related issue to this is how the timeout works in general. The timeout in the optimizer is really just a timeout on the root node, not the entire op tree. So if some inner node takes a long time to optimize, then we have no way of timing out. If this issue exists with other queries then we may want to consider having functionality to time out inner nodes if they take too long and not just the root node. If that's even possible. These are things we should probably discuss at the next meeting. |
I was thinking about this today and came up with one possible solution. Instead of waiting for the child to be completely costed, we could just use the best cost available for the child. So when we check to see if the child is done with costing, we also check to see if there is any cost available for the child. If there is some cost, we cost the parent using the child's best cost so far and then we push the parent task back on the task stack so we can cost again when the child is finished or has more costs available. That way when we reach the time out we are more likely to have some cost for the root node, even if it's not the best cost. This is also more to address the underlying problem of having to wait for all inner nodes to be fully costed before timing out. It still may make sense to convert LIMIT to a property. |
Bug Report
Summary
When you have a complicated join with a
LIMIT
clause, then the optimizer fails to time out. Suppose that we have the tables from this ddl: https://github.com/oltpbenchmark/oltpbench/blob/master/src/com/oltpbenchmark/benchmarks/tpch/ddls/tpch-postgres-ddl.sql. Then if we run the following query:Then we should expect that there are so many join orderings that DBMS optimizer will time out. In fact if you run that query, then it will time out, print the following in the console:
[2020-12-23 22:05:03.397] [optimizer_logger] [warning] Optimize Loop ended prematurely: Optimizer task execution timed out
, and return a result.However if we run the following query (the only difference is the added
LIMIT
clause):Then the optimizer WILL NOT time out.
The offending line is right here:
noisepage/src/optimizer/optimizer.cpp
Lines 154 to 156 in 8855886
Although
elapsed_time
gets much greater thantask_execution_timeout_
,root_group->HasExpressions(required_props)
will return false. The code forHasExpressions
is here:noisepage/src/optimizer/group.cpp
Lines 72 to 75 in 8855886
For this query,
lowest_cost_expressions_
is empty soHasExpressions
will always return false. For the original querylowest_cost_expressions_
is not empty and contains some result. For both queriesrequired_props
is empty, which is a little weird to me, but I don't really know whatrequired_props
is, so this might be fine.I don't know enough about the optimizer to know exactly why this is happening, but my guess at a very high conceptual level is, generating a physical plan for the
LIMIT
happens after we enumerate the join orderings. Since we don't have a physical plan for the limit, we can't time out during the join enumeration.It's very likely that #1229 is suffering from this.
Environment
OS: Ubuntu (LTS) 20.04
Compiler: GCC 7.0+
CMake Profile:
Debug
Jenkins/CI: N/A
Steps to Reproduce
Expected Behavior
Actual Behavior
This keeps running for a long time without ever timing out. I've never stuck around long enough to confirm that it ever finishes but I'm assuming that it does eventually. As you remove more and more tables from the
FROM
clause then it will start to terminate and execute faster.The text was updated successfully, but these errors were encountered: