Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient error drops the subscription. (Connection reset by peer) #307

Open
PehrGit opened this issue Jan 23, 2024 · 7 comments
Open

Transient error drops the subscription. (Connection reset by peer) #307

PehrGit opened this issue Jan 23, 2024 · 7 comments

Comments

@PehrGit
Copy link
Contributor

PehrGit commented Jan 23, 2024

Describe the bug
We noticed that a subscription had stopped processing. We discovered that it was due to a SqlException with inner exception SocketException with message:

"A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 35 - An internal exception was caught) Unable to read data from the transport connection: Connection reset by peer. Connection reset by peer".

I believe this should be recognized as a transient error and retried, like the other error numbers listed in

static readonly List<int> TransientErrorNumbers = new() { 4060, 40197, 40501, 40613, 49918, 49919, 49920, 11001 };

To Reproduce
Steps to reproduce the behavior:

  • Have SQL Server subscription
  • Have server close the connection

Expected behavior
The error is recognized as transient and the message is retried.

Screenshots
N/A

Desktop (please complete the following information):

  • OS: Linux Azure App Service
  • Eventuous version 0.15.0-beta.7

Additional context
There is no additional logging because this didn't occur during the processing of a message, it was in the middle of the night and nobody was using the system. So we assume it was just a hiccup on the Azure side.

Full stack trace:

Microsoft.Data.SqlClient.SqlException:
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderAsync (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Eventuous.SqlServer.Subscriptions.SqlServerSubscriptionBase`1+<PollingQuery>d__15.MoveNext (Eventuous.SqlServer, Version=0.15.0.0, Culture=neutral, PublicKeyToken=null)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Eventuous.SqlServer.Subscriptions.SqlServerSubscriptionBase`1+<PollingQuery>d__15.MoveNext (Eventuous.SqlServer, Version=0.15.0.0, Culture=neutral, PublicKeyToken=null)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Eventuous.SqlServer.Subscriptions.SqlServerSubscriptionBase`1+<PollingQuery>d__15.MoveNext (Eventuous.SqlServer, Version=0.15.0.0, Culture=neutral, PublicKeyToken=null)
Inner exception System.IO.IOException handled at Microsoft.Data.SqlClient.SqlCommand.EndExecuteReaderAsync:
   at System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.ThrowException (System.Net.Sockets, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
   at System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult (System.Net.Sockets, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
   at Microsoft.Data.SqlClient.SNI.SNINetworkStream+<ReadAsync>d__1.MoveNext (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Data.SqlClient.SNI.SslOverTdsStream+<ReadAsync>d__5.MoveNext (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Net.Security.SslStream+<EnsureFullTlsFrameAsync>d__186`1.MoveNext (System.Net.Security, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Net.Security.SslStream+<ReadAsyncInternal>d__188`1.MoveNext (System.Net.Security, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Data.SqlClient.SNI.SNISslStream+<ReadAsync>d__1.MoveNext (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
Inner exception System.Net.Sockets.SocketException handled at System.Net.Sockets.Socket+AwaitableSocketAsyncEventArgs.ThrowException:
@alexeyzimarev
Copy link
Contributor

alexeyzimarev commented Jan 23, 2024

I will accept the PR if you add this error to the list you mentioned. I am not sure that SQL error number will be 35 though.

@alexeyzimarev
Copy link
Contributor

Look also here dotnet/SqlClient#2103 (comment), it seems that on Windows it will produce 10053, but on Linux it's impossible to figure out.

@alexeyzimarev
Copy link
Contributor

Strangely enough, the only thing that should have happened is that the subscription would drop and resubscribe.
Can you confirm that the subscription just silently died?
Do you have any health checks set up using the provided diagnostics methods? https://eventuous.dev/docs/subscriptions/subs-diagnostics/#health-checks

@PehrGit
Copy link
Contributor Author

PehrGit commented Jan 24, 2024

Strangely enough, the only thing that should have happened is that the subscription would drop and resubscribe. Can you confirm that the subscription just silently died? Do you have any health checks set up using the provided diagnostics methods? https://eventuous.dev/docs/subscriptions/subs-diagnostics/#health-checks

I've tested again by running the app locally and stopping the SqlServer instance. I see the "Dropped" message in the logs but it doesn't resubscribe, and the health check keeps outputting "Healthy".

It makes sense, as the Resubscribe() method is only called from Dropped(), which is not called when the polling connection fails.

async Task PollingQuery(ulong? position, CancellationToken cancellationToken) {
var start = position.HasValue ? (long)position : -1;
while (!cancellationToken.IsCancellationRequested) {
try {
await using var connection = await OpenConnection(cancellationToken).NoContext();
await using var cmd = PrepareCommand(connection, start);
await using var reader = await cmd.ExecuteReaderAsync(cancellationToken).NoContext();
var result = reader.ReadEvents(cancellationToken);
await foreach (var persistedEvent in result.WithCancellation(cancellationToken).ConfigureAwait(false)) {
await HandleInternal(ToConsumeContext(persistedEvent, cancellationToken)).NoContext();
start = MoveStart(persistedEvent);
}
} catch (OperationCanceledException) {
// Nothing to do
} catch (SqlException e) when (TransientErrorNumbers.Contains(e.Number)) {
// Try again
} catch (SqlException e) when (e.Message.Contains("Operation cancelled by user.")) {
IsDropped = true;
Log.InfoLog?.Log("Polling query stopped");
}
catch (SqlException e) when (e.Number == 3980 && e.Message.Contains("Operation cancelled by user.")) {
IsDropped = true;
Log.InfoLog?.Log("Polling query stopped");
} catch (Exception e) {
IsDropped = true;
Log.WarnLog?.Log(e, "Dropped");
throw;
}
}
}

It is only called from HandleInternal, and only in the case of an OperationCanceledException

protected async ValueTask HandleInternal(IMessageConsumeContext context) {
try {
Logger.Current = Log;
var ctx = new AsyncConsumeContext(context, Ack, Nack);
await Handler(ctx).NoContext();
} catch (OperationCanceledException e) when (context.CancellationToken.IsCancellationRequested) {
context.LogContext.MessageHandlingFailed(Options.SubscriptionId, context, e);
Dropped(DropReason.Stopped, e);
} catch (Exception e) {
context.LogContext.MessageHandlingFailed(Options.SubscriptionId, context, e);
if (Options.ThrowOnError) throw;
}
}

@PehrGit
Copy link
Contributor Author

PehrGit commented Jan 24, 2024

Look also here dotnet/SqlClient#2103 (comment), it seems that on Windows it will produce 10053, but on Linux it's impossible to figure out.

Ah that's too bad. Thanks for looking into that!

I suppose we should focus on getting the SQL subscription to resubscribe when the connection drops, that should also fix this issue?

@PehrGit
Copy link
Contributor Author

PehrGit commented Jan 24, 2024

Looking at the ESDB AllStreamSubscription, I see that EventSubscription.Dropped() is called when the subscription drops.

Could it be that we just need to replace IsDropped = true; with a call to .Dropped() in this method?

Edit: it looks like this is already fixed in dev, where Dropped(DropReason.ServerError, e); is called in the catch:

async Task<PollingResult> Poll() {
try {
await using var connection = await OpenConnection(cancellationToken).NoContext();
await using var cmd = PrepareCommand(connection, start);
await using var reader = await cmd.ExecuteReaderAsync(cancellationToken).NoContext();
var result = reader.ReadEvents(cancellationToken);
var received = 0;
await foreach (var persistedEvent in result.NoContext(cancellationToken)) {
await HandleInternal(ToConsumeContext(persistedEvent, cancellationToken)).NoContext();
start = MoveStart(persistedEvent);
received++;
}
return new PollingResult(true, false, received);
} catch (Exception e) {
if (IsStopping(e)) {
IsDropped = true;
return new PollingResult(false, false, 0);
}
if (IsTransient(e)) {
return new PollingResult(true, true, 0);
}
Dropped(DropReason.ServerError, e);
return new PollingResult(false, false, 0);
}
}

@alexeyzimarev
Copy link
Contributor

@PehrGit do you think this issue has been solved? I believe all transient errors are added to the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants