Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tunnel connections get corrupted leading to 502 errors #12

Open
mode777 opened this issue Feb 23, 2024 · 13 comments
Open

Tunnel connections get corrupted leading to 502 errors #12

mode777 opened this issue Feb 23, 2024 · 13 comments

Comments

@mode777
Copy link

mode777 commented Feb 23, 2024

When tunneling a reasonably complex website at some point in time connections in the frontend will become corrupt and unusable leading to an error 502 for the end user. After this happens the YARP frontend becomes unusable and needs to be restarted. This happens sporadically but the longer the system is running the more likely it becomes this will happen. We had some success with closing the corrupt connection and waiting for the backend to re-establish it - by this at least the system becomes usable again until this problem occurs again.

Repro Steps

This is how a I was able to reliably reproduce this. Please note that the azure site is only an example here as we have experienced this with multiple sites.

  1. Set the Destination Cluster Adress to "https://azure.microsoft.com"
  2. Start both Frontend and Backend application
  3. Open the Frontend URL: https://localhost:7244/ --> Microsoft Azure Site will be shown
  4. Wait for 5 minutes
  5. Refresh the page twice (!) --> Loading will fail with error 502
    I've reproduced this with both WS and HTTP2 connections on multiple machines.

Additional Information

For WebSockets this leads to corrupt data coming out of the stream when reading the response from the tunnel in the backend. This then causes non-deteministic errors when parsing the HTTP response.

for example...

Yarp.ReverseProxy.Forwarder.HttpForwarder[48]
      ResponseBodyDestination: The destination reported an error when copying the response body.
      System.Net.Http.HttpIOException: Received an invalid chunk terminator: '0'. (InvalidResponse)
         at System.Net.Http.HttpConnection.ChunkedEncodingReadStream.ReadChunkFromConnectionBuffer(Int32 maxBytesToRead, 

but more often..

Request: An error was encountered before receiving a response.
      System.Net.Http.HttpRequestException: Received an invalid status line: '1000'.
         at System.Net.Http.HttpConnection.ParseStatusLineCore(Span`1 line, HttpResponseMessage response)
CancellationTokenRegistration cancellationRegistration)

(The invalid status line bit will often also contain garbage)

For HTTP2 this will always lead to an error like this, with the same results

Request: An error was encountered before receiving a response.
      System.Net.Http.HttpRequestException: An error occurred while sending the request.
       ---> System.InvalidOperationException: Concurrent reads or writes are not supported.
         at System.IO.Pipelines.Pipe.GetReadResult(ReadResult& result)

This makes me wonder if race condituons might be involved here.

We already spent quite some time debugging this but haven't been able to determine the root cause. Maybe someone can help.

@piamaydpsbs
Copy link

During our tests with this demo we we've observed the same error from time to time. We tested it with a simple testconsole and multiple requests. Sometimes all keeps ok, but then we got System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.InvalidOperationException: Concurrent reads or writes are not supported. , too. Would be great if someone could find a solution.

@jimmycartrette
Copy link

I'm finding when using this in our practical circumstances, any reuse of the stream on the Frontend side works 0% of the time for both HTTP and websockets. It might required a timeout on some side to replicate so I leave it and a polling interval will eventually hang after around 2 minutes. So changing the following code areas in TunnelExsensions:

            // Keep reusing this connection while, it's still open on the backend
            // while (ws.State == WebSocketState.Open)
            // {
                // Make this connection available for requests
                await responses.Writer.WriteAsync(stream, context.RequestAborted);

                await stream.StreamCompleteTask;

                await stream.DisposeAsync();
            // }

and

            // Keep reusing this connection while, it's still open on the backend
            // while (!context.RequestAborted.IsCancellationRequested)
            // {
                // Make this connection available for requests
                await responses.Writer.WriteAsync(stream, context.RequestAborted);

                await stream.StreamCompleteTask;

                await stream.DisposeAsync();
            // }

Allows it to work for more than 2 minutes or so. However, I eventually saw the "Concurrent reads or writes are not supported." on the HTTP connection after about an hour with this but didn't get a chance to investigate that at the time.

I'm still looking into the underlying issues are breaking the streams. Obviously this should affect throughput and maybe allocation, but currently it's not really useable in the current form.

@davidfowl
Copy link
Owner

Did anyone figure it out? I haven't had any time to spend on this recently.

@jimmycartrette
Copy link

I haven't, I've moved on with the non-reused web socket streams code above and it is good enough to continue into a demo environment so far. Once I get into the resiliency and performance testing we may have to revisit

@Noahnc
Copy link

Noahnc commented Jun 18, 2024

I had the same issue, WebSocket stream got currupted after it has been disposed once. The following fixes the issue for me: #14

@davidfowl
Copy link
Owner

@Noahnc you're only seeing this issue for websockets and not http2?

@Noahnc
Copy link

Noahnc commented Jun 18, 2024

@davidfowl I only tested with websocket as transport, but for http2 the issue is probably the same.

@Noahnc
Copy link

Noahnc commented Jun 18, 2024

@davidfowl just tested with HTTP2 transport and the fix is not working with this, only for WebSocket. I probably have time tomorrow to look further into this. Regarding the cancellation token not being canceled, I just checked the source code of the HttpConnection and there is currently no token passed to the _stream.ReadAsync() call for the zero byte read ahead. Is this intentional or a bug?

@mode777
Copy link
Author

mode777 commented Jun 19, 2024

@Noahnc I can confirm that #14 fixes the issue with WS following the steps described above. HTTP2 shows the same error behaviour.

@jimmycartrette
Copy link

This seems to be removing the error server side, but I still get an error client side and it is not successfully reusing the connection for me still. The web socket is always in state aborted. On the client side I get:
System.Net.WebSockets.WebSocketException (0x80004005): The remote party closed the WebSocket connection without completing the close handshake.

On the server side, this is always Aborted :
await stream.StreamCompleteTask;
logger.LogInformation("On complete, WS state is {state}", ws.State);

It does seem to recover a little bit better with the abort as it seems to also prevent re-use as I was doing rather than failing badly on the next attempt to send. Although I do still get 502s just more rarely.

From what I read here:
https://mcguirev10.com/2019/08/17/how-to-close-websocket-correctly.html

This matches exactly what we see with the new linked token from the Dispose cancelling the ReceiveAsync:

A very surprising problem I ran into is how ReceiveAsync reacts when you actually cancel the CancellationToken passed into it. In short, it trashes the WebSocket and sets the State to Aborted. In other words, if you cancel ReceiveAsync you can’t ever use that WebSocket again for anything, including trying to close it gracefully. (For anyone who has ever suffered through low-level socket programming against the low-level winsock API, this actually makes sense – but I had hoped the managed implementation was a little more friendly.) All you can do is Dispose it and move on. Also, the party at the other end of the connection will (as usual) throw the exception The remote party closed the WebSocket connection without completing the close handshake.

I don't have an understanding of if it can be reused. I'm also not sure why the TunnelClientFactory seems to dispose. I'd rather it stay alive forever rather than seeing to timeout after about 90-120 seconds and then either trying to reuse and fail or (with the while loop gone) just causing a reconnect. Sure I get that it makes sense to have the number of connections go up and down with demand, but I'd like to opt out if I can.

I'll have to go back to removing the while loop in TunnelExtensions attempting to re-use apparently un-reusable connections. I've got no 502s at all with that so far. The TunnelClientFactory is handling the normal send of many requests until whatever timeout happens, when the timeout happens and it's aborted I don't see how it is recoverable.

@jimmycartrette
Copy link

Thinking about this a little further, there is probably little difference between the posted PR vs removing the while loop, outside of just removing the whole loop and not cancelling the receiveasync will cause a clean close, because the cancellation token is ensuring the while loop doesn't come into play. Unless there really is a way to reuse an aborted web socket connection, fix ing the issue may mean not trying to reuse the websocket at that stage but finding which side of the websocket is causing the disconnect after 1-2 minutes of idle activity and resolving it. I won't be able to fire up Wireshark for a couple of weeks to investigate.

The other challenge with the timeout disconnect is that the connection ramp up tends to happen at the same time in testing, so the disconnects all happen at the same time. So when I try to reload right at that disconnect period, all connections are trying to recover at the same time.

@Noahnc
Copy link

Noahnc commented Jun 23, 2024

@jimmycartrette I can confirm that my previous fix doesn't work, since the stream is aborted by cancelling the cancellation token (I totally missed that in my testing). I instead tried the following, which in my initial testing seems to work: Noahnc@78f5e5a

This works by checking if the stream has been disposed after a zero-byte read returns back to the ReadAsync function. If the stream has been disposed meanwhile, we throw an exception to stop the calling code from further reading from the stream.

Sadly, this also only works for the WebSocket stream, the HTTP2 stream still breaks after dispose.

@Noahnc
Copy link

Noahnc commented Aug 6, 2024

Was anyone able to validate the fix in #15?
For the solution I built for my employer, I used a different approach for connection management with the following key concepts:

  1. Each BackendProxy connection is registered as its own Yarp backend target on the frontend. Each target has a max stream count set to 1.
  2. Requests sent from the frontend to the backend are forced to use HTTP2 for better multiplexing over one stream.
  3. The idle timeout for each stream in the handler is increased from 1 minute to 1 hour.
  4. For each registered backend target, a Yarp active health check is configured for monitoring and connection keep-alive. The health check sends requests to an internal endpoint of the BackendProxy.

So far, this approach appears to be very stable and performant.
Unfortunately, I can't share any code since the solution is for my employer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants