Skip to content

CSHARP-3550: CSOT: Server Selection #1705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sanych-sun
Copy link
Member

No description provided.

@sanych-sun sanych-sun requested a review from a team as a code owner June 6, 2025 01:11
@sanych-sun sanych-sun removed the request for review from a team June 6, 2025 01:16

public TimeSpan Elapsed => _stopwatch.Elapsed;

public TimeSpan Timeout { get; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need TWO properties:

public TimeSpan OperationTimeout { get; }
public TimeSpan StepTimeout { get; }

to keep track of BOTH kinds of timeouts.

That probably also means we need two stopwatches. One for the operation as a whole, and the other for whatever sub step we are in (server selection, etc..).

The reason I say this is that when a timeout occurs we should be able to say in the message whether the whole operation timed out or whether a sub step timed out.

Copy link
Member Author

@sanych-sun sanych-sun Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea. What about having single timeout as is and ParentContext property instead? Somehow I would like to keep the Context kind of immutable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a ParentContext might be a good idea in its own right.

You still need a way to detect when a timeout occurs whether it was the OperationTimeout or the StepTimeout that timed out. Does a ParentContext help you with that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pseudo-code to check which the timeout occurred:

if (operationContext.IsTimedOut())
{
   var isLocalTimeout = operationContext?.ParentContext.IsTimedOut() == false;
}

So the logic is: timeout is "local" when there is no parent context or parent context is not timed out yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code works, because child context is always more strict then parent. It means there should not be a case when ParentContext is timed out, but child is not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a way we are talkinig about the same thing, since a context represents (among other things) a timeout.

d) whole operation is timed out, but server selection is not - this is impossible state as per spec.

I still think this is possible. I just have to configure the operation timeout to be smaller than the server selection timeout.

Copy link
Member Author

@sanych-sun sanych-sun Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spec says we should use more strict timeout in such situation:

If timeoutMS is set, drivers MUST use min(serverSelectionTimeoutMS, remaining timeoutMS), referred to as computedServerSelectionTimeout as the timeout for server selection and connection checkout.

https://github.com/mongodb/specifications/blob/master/source/client-side-operations-timeout/client-side-operations-timeout.md#server-selection

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we aren't talking about the same thing again.

I'm saying that the operation timeout can be less than the server selection timeout. Of course I can do that.

That's not the same thing as the "computedServerSelectionTimeout".

When the "computedServerSelectionTimeout" (NOT a configured value, but a value that is computed and is different evry time) times out, we probably need to know WHY it timed out. Did it timeout because the operation timed out or because server selection timed out? I think the error message should say why. Seems like the user would want to know which timeout timed out because they may want to know which timeout value to reconfigure if they actually didn't want to time out so quickly.

Copy link
Contributor

@rstam rstam Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d) whole operation is timed out, but server selection is not - this is impossible state as per spec.

I don't think the spec says this is impossible. The spec describes how to calculate the "computedServerSelectionTimeout", but if the configured operation timeout is less than the configured server selection timeout and we timeout during server selection it is NOT the server selection that timed out, but the WHOLE operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it was discussed offline:
We both were talking about the same thing, but there were small confusion regarding "configured" timeout and "computed" timeout. We confirmed that we DO need to distinguish what exactly timeout was occurred: was it "local" timeout or whole operation timeout. Also we decided to go with ParentContext approach as it seems to cover our needs and also let us have more the 2 level of nesting. To check whether it was local timeout occurred or whole operation - we have to examine the ParentContext property and see if it was timed out or not. If there is no ParentContext and the context is timed out - it means whole operation was timed out.

{
_stopwatch = stopwatch;
Timeout = timeout;
CancellationToken = cancellationToken;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend not mixing backing fields and auto-properties in the same class.

return remainingTimeout < TimeSpan.Zero;
}

public OperationCancellationContext WithTimeout(TimeSpan timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should be called WithStepTimeout?


namespace MongoDB.Driver
{
internal sealed class OperationCancellationContext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OperationContext?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


public TimeSpan Elapsed => _stopwatch.Elapsed;

public TimeSpan Timeout { get; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a ParentContext might be a good idea in its own right.

You still need a way to detect when a timeout occurs whether it was the OperationTimeout or the StepTimeout that timed out. Does a ParentContext help you with that?

@@ -47,7 +47,7 @@ private Exception CreateTimeoutException(Stopwatch stopwatch, string message)
var checkOutsForOtherCount = checkOutsCount - checkOutsForCursorCount - checkOutsForTransactionCount;

message =
$"Timed out after {stopwatch.ElapsedMilliseconds}ms waiting for a connection from the connection pool. " +
$"Timed out after {operationContext.Elapsed.TotalMilliseconds}ms waiting for a connection from the connection pool. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Looks like Elapsed.TotalMilliseconds is frequently used, does this award a shortcut property:
ElapsedMilleseconds?

// TODO: this static field is temporary here and will be removed in a future PRs in scope of CSOT.
public static readonly OperationContext NoTimeout = new(System.Threading.Timeout.InfiniteTimeSpan, CancellationToken.None);

private readonly Stopwatch _stopwatch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are saving the multiple stopwatches creation on operation execution path, it's minor but still nice.

}

stopwatch.Stop();
throw CreateException(stopwatch);
throw CreateException(operationContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the stopwatch stopped at some point?
Should it be stopped after first IsTimedOut is called (or StopAndValidateTimeout or similar)? Seems that it's more accurate to report the elapsed when first validated and not on creation of the actual report /log later.

Copy link
Member Author

@sanych-sun sanych-sun Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not stopping, because the same context could be in use more. For example if IsTimedOut returned false, we could decide to wait more. May be we should replace IsTimedOut with EnsureTimeout method, that will throw in case of timeout and it could save the current elapsed time in the exception together with the timeout source (was it whole operation timeout or server selection timeout, or something else).

[Values(false, true)]
bool async)
{
var subject = CreateSubject();
var subject = CreateSubject(serverSelectionTimeout: TimeSpan.FromMilliseconds(10));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adjusted serverSelectionTimeout is an optimization of the test: we probably do not have to wait whole 2 seconds here (the default server selection for this test class, see line 49).

@@ -134,23 +134,13 @@ await Record.ExceptionAsync(() => subject.ExecuteWriteOperationAsync(null, opera
.Subject.ParamName.Should().Be("session");
}

private OperationExecutor CreateSubject(out Mock<IClusterInternal> clusterMock, out Mock<ICoreSessionHandle> implicitSessionMock)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftovers of the implicit session creation, that was factored out in one of the latest commits related to OperationExecutor refactoing.

@sanych-sun sanych-sun requested a review from rstam June 7, 2025 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants