-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat(NODE-6882): close outstanding connections #4499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3b95935
to
4b798d7
Compare
0d71ec5
to
008c6dc
Compare
008c6dc
to
67421a1
Compare
src/cmap/connection_pool.ts
Outdated
for (const conn of this.checkedOut) { | ||
this.emitAndLog( | ||
ConnectionPool.CONNECTION_CLOSED, | ||
new ConnectionClosedEvent(this, conn, 'poolClosed') | ||
); | ||
conn.onError(new ConnectionPoolClosedError()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we follow a similar pattern to interruptInUseConnection
, where we error the connection and then check it in, relying on the checkIn logic to emit the appropriate error? (I actually think we're missing a checkIn event here, with the current implementation).
What should happen is:
- connection is destroyed
- connection is checked in
- checkIn event emitted
- connection is errored, so we destroy the connection (using destroyConnection()) (this already handles emitting a connectionClosedEvent with the correct configuration when the pool is closed.)
edit: I see you must do this manually, because you're killing connections before the pool has been closed. This question still stands, because I'd expect server.close() to be the method responsible for cleaning up each pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my reply to the other comment thread clarifies the reason for this, lmk if not.
I added a test to demonstrate we're emitting check-ins as expected, the handling for check in already happens where the operation is being awaited in server.command()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should have exactly one mechanism for interrupting in-use connections. I think that it probably doesn't make sense to share a method, because the other interruptInUse
method has particular conditions about what it closes (i.e, pool generation requirements) but we should use the same pattern in both cases.
Here's the current implementation of the old interruptInUse:
for (const connection of this.checkedOut) {
if (connection.generation <= minGeneration) {
connection.onError(new PoolClearedOnNetworkError(this));
this.checkIn(connection);
}
}
iirc, the implementation was important for two reasons:
- according to the spec tests, check in events must be emitted before connection closed events (this is handled by checkIn())
- ordering of these events relative to other events matters, so we needed to process them synchronously per-connection (i.e., call checkIn() manually instead of letting the connection destroy itself, and relying on the error handling in server.ts to check the connection back in)
Even if we don't consider 2. to be important for your current PR, I think we should ensure 1. is satisfied. It currently isn't, because we manually emit the closed event before destroying the connection. So at a minimum, I'd suggest removing that logic. I'd also suggest using the same logic as we use in interruptInUse already, for consistency.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and related to my last comment): can we assert on the ordering of checkin + close events in the test you added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to assert the ordering of events and simplified the helper on the pool to make that the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we call checkIn twice? we should be able to remove the check in call from interruptInUse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I answered that above, although we can re-evaluate now I suppose because the connection layer has changed. But:
iirc, the implementation was important for two reasons:
- according to the spec tests, check in events must be emitted before connection closed events (this is handled by checkIn())
- ordering of these events relative to other events matters, so we needed to process them synchronously per-connection (i.e., call checkIn() manually instead of letting the connection destroy itself, and relying on the error handling in server.ts to check the connection back in)
checking in immediately ensures that we get a checkIn + closed event synchronously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkin and close events are both emitted from the checkIn() function so they're always emitted one after the other synchronously, which part is async? Is there a test I can write to assert this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try making the change and seeing if anything fails? Maybe nothing will fail. My memory is failing me and I don't remember exactly why this was so important / which tests failed.
Sorry for not explaining thoroughly though. The issue was particularly with the ordering of events relative to other pool + sdam events, not the immediacy of a checkIn + close event for the same connection (although that is also important, that won't change because checkIn is synchronous). The checkin + close events will always be next to each other, you're right. but there's nothing that guarantees that, if we remove the call to checkIn
from interruptInUse, that the checkIn+close will happen immediately with no other events before it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed checkIn from interruptInUse and nothing is failing
@@ -677,6 +679,8 @@ export class MongoClient extends TypedEventEmitter<MongoClientEvents> implements | |||
writable: false | |||
}); | |||
|
|||
this.topology?.closeCheckedOutConnections(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by the ordering of events here. Why are we explicitly closing in-use connections before we do other cleanup logic? I'd expect that this would just be handled automatically by topology.close()
. Using topology.close()
also means that we don't need to add new methods to the topology + server classes, just for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this also lead to connection establishments during close? Hypothetically, if we kill all connections in a pool, killCursors or endSessions will require establishing new connections for these commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we explicitly closing in-use connections before we do other cleanup logic?
To avoid a deadlock, since we're about to run operations (killCursors, abortTransaction, endSessions) we can deadlock if the pool is at maximum use (ex. 100 checked out connections). First closing in-use connections means we have a pool that is ready for use so we can proceed to clean up operations.
I'd expect that this would just be handled automatically by topology.close()
Maybe this was answered by the previous paragraph but to be clear we still need an open and operating topology/servers/pool/connections to be able to perform best-effort clean-up.
Could this also lead to connection establishments during close?
Yes, but this is not new, a MongoClient can already have empty pools when we call MongoClient close running clean up operations will create new connections. Even thought we're closing in-use operations that doesn't mean that we will need to make a new connection, the pool can still have idle connections that won't be closed by this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternate approach: since cleanup is best-effort, could we instead just attempt the commands we need to send with a low waitQueueTimeoutMS set to set an upper bound on how long we wait in case of a deadlock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same proposal could be made for the killCursors
run inside cursor.close()
(not in here, client.close()
, but normally by the user).
But I think the reason that we do "risk" the full waitQueueTimeoutMS and serverSelectionTimeoutMS is because I don't think we can predict good timeout values that will fit some common denominator given the driver is general purpose and deployed in many ways, so we don't have a common case to approach (ex. the timing for mongosh is vastly different from an edge server app).
Closing in-use connection separately is specifically so our connection count isn't at maxPoolSize
when we reach the next steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test demonstrating that closing connections pre-killCursors prevents the deadlock you mentioned? Probably worth capturing in a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test, and I checked if I comment out closeCheckedOutConnections then it fails
3159584
to
99d6a90
Compare
src/cmap/connection_pool.ts
Outdated
for (const conn of this.checkedOut) { | ||
this.emitAndLog( | ||
ConnectionPool.CONNECTION_CLOSED, | ||
new ConnectionClosedEvent(this, conn, 'poolClosed') | ||
); | ||
conn.onError(new ConnectionPoolClosedError()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should have exactly one mechanism for interrupting in-use connections. I think that it probably doesn't make sense to share a method, because the other interruptInUse
method has particular conditions about what it closes (i.e, pool generation requirements) but we should use the same pattern in both cases.
Here's the current implementation of the old interruptInUse:
for (const connection of this.checkedOut) {
if (connection.generation <= minGeneration) {
connection.onError(new PoolClearedOnNetworkError(this));
this.checkIn(connection);
}
}
iirc, the implementation was important for two reasons:
- according to the spec tests, check in events must be emitted before connection closed events (this is handled by checkIn())
- ordering of these events relative to other events matters, so we needed to process them synchronously per-connection (i.e., call checkIn() manually instead of letting the connection destroy itself, and relying on the error handling in server.ts to check the connection back in)
Even if we don't consider 2. to be important for your current PR, I think we should ensure 1. is satisfied. It currently isn't, because we manually emit the closed event before destroying the connection. So at a minimum, I'd suggest removing that logic. I'd also suggest using the same logic as we use in interruptInUse already, for consistency.
Thoughts?
@@ -677,6 +679,8 @@ export class MongoClient extends TypedEventEmitter<MongoClientEvents> implements | |||
writable: false | |||
}); | |||
|
|||
this.topology?.closeCheckedOutConnections(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternate approach: since cleanup is best-effort, could we instead just attempt the commands we need to send with a low waitQueueTimeoutMS set to set an upper bound on how long we wait in case of a deadlock?
src/cmap/connection_pool.ts
Outdated
for (const conn of this.checkedOut) { | ||
this.emitAndLog( | ||
ConnectionPool.CONNECTION_CLOSED, | ||
new ConnectionClosedEvent(this, conn, 'poolClosed') | ||
); | ||
conn.onError(new ConnectionPoolClosedError()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and related to my last comment): can we assert on the ordering of checkin + close events in the test you added?
233531a
to
91a3103
Compare
src/error.ts
Outdated
* | ||
* @public | ||
**/ | ||
constructor(message = 'Operation interrupted because client was closed') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constructor(message = 'Operation interrupted because client was closed') { | |
constructor() { |
suggest: don't bother with a message, we only construct this in one place with the same error every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No message at all? that seems unconventional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just hard code the message on the error. There's no reason to provide it in the constructor if we only use it in one place and always expect it to be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh gotcha
const insertCheckInAndCloses = allClientEvents | ||
.filter(e => e.name === 'connectionCheckedIn' || e.name === 'connectionClosed') | ||
.filter(({ address, connectionId }) => | ||
insertConnectionIds.includes(`${address} + ${connectionId}`) | ||
); | ||
|
||
expect(insertCheckInAndCloses).to.have.lengthOf(6); | ||
|
||
// check that each check-in is followed by a close (not proceeded by one) | ||
expect(insertCheckInAndCloses.map(e => e.name)).to.deep.equal( | ||
Array.from({ length: 3 }, () => ['connectionCheckedIn', 'connectionClosed']).flat(1) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest: can we just deep assert on the events, like the UTR does?
expect(
allClientEvents.map(event => event.name)
).to.deep.equal(
[checkout, checkout, checkout, checkIn, close, checkIn, close, checkIn, close]
)
Or is there something that asserting on the actual connection ids captures here that my proposed assertion doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no assertions on the ids they're just filtering for the insert related checkins/outs/closes. There's more events because close is running endSessions and there's a connection created for the initial ping. So there's just a handful of things that aren't relevant to the test and would need revising if there were unrelated details changed about the other operations run
Description
What is changing?
Is there new documentation needed for these changes?
Yes
What is the motivation for this change?
Graceful event loop exit
Release Highlight
MongoClient close shuts outstanding in-use connections
The
MongoClient.close()
method now shuts connections that are in-use allowing the event loop to close if the only remaining resource was the MongoClient.Double check the following
npm run check:lint
scripttype(NODE-xxxx)[!]: description
feat(NODE-1234)!: rewriting everything in coffeescript