-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: parallel queries #475
Conversation
This reverts commit 67eb20c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, once comments are addressed.
packages/core/src/QueryManager.js
Outdated
this._logger.debug('Cache'); | ||
result.fulfill(cached); | ||
result.prepare(type === 'exec' ? null : data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a separate issue, but this makes me think we should double check caching interactions with exec queries. I don't think we want to be going to the cache for any exec queries...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like cache is set to false by default and exec
makes a query without cache so it should always be off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, sounds like younger me did think it through :)
One last question: did you try testing this against all of our current backends? (WASM, Node.js, Python, Rust) One of the reasons I first ensured serialized queries was due to Node.js DuckDB failures. I believe it was due to an interaction at the time with the Arrow extension. |
I tested the (at least casually) the flight 200k and 10m examples on
There is a problem when I brush on one chart and move the cursor over another chart because mosaic then tries to run the query to build the index and query the index at the same time (and the latter fails since the index is not there yet). I guess this problem (index not ready yet) can also happen if the index creation is slow and you start interacting without the index being there. I think the solution here is to always prioritize exec queries and don't send any queries until the exec is done. I'll implement that unless you have another suggestion. We may still want to prevent activation of chart while the mouse is down. At some point it might be nice to have an actual dependency model for queries but that's a lot more invasive. |
I think blocking on I agree that a full dependency model is a bigger lift. That said, checking (and if necessary, blocking) on data cube table construction within the data cube indexer might not be so bad as a stop-gap. Either way, we should make sure we have accompanying unit tests for these cases. |
# Conflicts: # packages/core/src/Coordinator.js # packages/core/src/QueryManager.js
Added logic to prevent multiple parallel exec requests and also added a unit test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, pending my review comments. But please do not merge until after #519 lands and I've cut a v0.11.0 release. I plan to cut a release without parallel queries and then merge this PR for the subsequent release, in part to permit more careful testing.
In addition, we need to check all of the database connectors. I think rest
and wasm
should be fine. I didn't re-check the Jupyter widget but I think it will be ok for parallel use, too. However, the socket
connector includes its own blocking, forcing sequential query execution. We need to revisit that.
packages/core/src/QueryManager.js
Outdated
this._logger.debug('Cache'); | ||
result.fulfill(cached); | ||
result.ready(type === 'exec' ? null : data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line looks harmless, but raises a question: do we ever see exec calls hit the cache? I don't think we should... they should be issued with cache: false | undefined
. And if we didn't shouldn't the cache result resolve to undefined
anyway? So I don't think we need this conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we don't cache exec requests but actually could benefit from it. Below, we add the promise for pending requests to the cache so we don't submit the same request twice.
This makes me think that we maybe want to have a separate map of pending requests that we put requests in no matter whether their results should be cached or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think it is safe to cache exec queries, as they can modify DB state. For example if I add a table, drop it, then add it again (which could maybe happen for a data cube index table), I don’t want the last exec call ignored due to caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good example. A common request of creating an index would be nice to cache, though. I guess for now we can not cache and hopefully improve the parallelization when we have full query dependency tracking.
this._logger.debug(`Request: ${(performance.now() - t0).toFixed(1)}`); | ||
result.fulfill(data); | ||
result.ready(type === 'exec' ? null : data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this here so that we don't err in QueryResult.fulfill.
@@ -42,7 +42,7 @@ async fn handle_post( | |||
|
|||
pub fn app() -> Result<Router> { | |||
// Database and state setup | |||
let db = ConnectionPool::new(":memory:", 16)?; | |||
let db = ConnectionPool::new(":memory:", 1)?; // TODO: we can only use one connection since temp tables are scoped per connection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once #519 lands data cube indexes will no longer be temp tables. But other Mosaic exec calls may make temp tables, so I don't think we can expand the connection pool...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will leave this at 1 for now.
What are your thoughts on not using temp tables at all anymore? We could change all the examples to write tables into a separate namespace so they are still easy to clean up.
Updated the code to address your comments.
I tested with the ws connector and it works fine. We are unnecessarily trying to parallelize queries that are then executed in series by the connector, yes. I could add a flag to not parallelize in the query manager but that introduces an extra code path so I'd be in favor of merging as is. |
I agree that we don't need another flag. My point was just to note that we need to update the socket connector to actually get benefits of parallel queries there. |
Yes, that would be nice. I don't see an obvious way right now to do that, though. When we get a result on the socket, we just send binary so we can't easily change the order of responses. We would need to add some control signal that's sent before a result to send query results out of order. Should we do this in this pull request or a follow up (I can file an issue for it)? |
I filed #547 as a follow up. |
Updated version of #438
Rather than sending requests in series, send them in parallel but serialize the order before returning them from the query manager.
Fixes #388