Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle certain QUICSocket.send errors #86

Merged
merged 1 commit into from
Feb 6, 2024

Conversation

tegefaulkes
Copy link
Contributor

@tegefaulkes tegefaulkes commented Feb 1, 2024

Description

This PR attempts to fix the problem with socket send errors not being handled properly when they're specific to the connection triggering the send.

Issues Fixed

Tasks

  • 1. Send errors specific to the connection triggering send should bubble back to that connection to be handled properly.
  • 2. Starting a connection should fail if the target address is invalid. In this case due to external addresses being invalid for loop back bound sockets.
  • 3. Temp network failure does not cause a crash but connections eventually time out.

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@tegefaulkes tegefaulkes self-assigned this Feb 1, 2024
@ghost
Copy link

ghost commented Feb 1, 2024

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@tegefaulkes
Copy link
Contributor Author

I think this stems from a design decision where data flowing downwards are normal function calls where data flowing upwards would be events. In this case it's making it inconvenient to bubble the send error where we need it.

Consider the following.

image

The QUICConnection emits a send event, the QUICClient OR QUICServer takes this and passes it along to the QUICSocket. This is already a little round-about but it breaks any reasonable flow back the the QUICConnection if there was an error.

The recv path takes a much cleaner flow through normal function calls.

For our needs, it would be much simpler if the QUICConnection makes the QUICSocket calls directly, then it can catch the errors it needs. Something like the following.

image

@CMCDragonkai
Copy link
Member

I want to maintain the design decision for now.

To solve the #78 I think the proper solution is about distinguishing these kinds of errors.

This is still doable with event/push flow.

It's a matter of the the thing catching the exception and then deciding exactly what kind of event to dispatch.

So instead of changing our control flow and data flow design, keep it the same and figure out how to distinguish 2 different event types.

@CMCDragonkai
Copy link
Member

Remember push flow is inherently multi-consumer possible. So QUIC connections are all registered on the the socket events, but one can also register a specific kind of event type - if they all register to the same event type, they can further check the contents of the event to see if it is relevant to them.

Generally it's more efficient to register unique event IDs - but this may not always be possible.

@CMCDragonkai
Copy link
Member

Any design decisions on how events is going to work needs to take into the global consideration of MatrixAI/Polykey#444, and to add commentary there.

@CMCDragonkai
Copy link
Member

We need to make sure our event abstractions and push flow abstraction makes sense in all our work.

@tegefaulkes
Copy link
Contributor Author

If we want to go the event route, then there are two things to do.

  1. QUICConnection needs to have a reference to QUICSocket to be aware of any send errors.
  2. QUICConnection needs to add an event handler to QUICSocket send errors to handle the errors.

Since there are many connections to 1 socket, each connection will have to filter for the events only relevant to them. This means every active connection will have to do some processing for every error event filtering for the only one relevant to itself. I'm not sure I like that.

Say we have 100 connections and the network cuts out, Now we have 100 sends failing. Each failed send will have an event being handled by all 100 connections trying to filter for what it wants. So we have 10000 if-checks happening. It's not going to scale very well.

@CMCDragonkai
Copy link
Member

Firstly QUICConnection already has reference to the QUICSocket. And there's only 1 socket for all of them.

As for scaling the dispatch, there are ways to do this, lots of ways to achieve it, but the original dataflow/controlflow design should be maintained.

@CMCDragonkai
Copy link
Member

Like I said, it's possible for lots of connections to register specific event IDs, thus enabling O(1) dispatch.

@CMCDragonkai
Copy link
Member

The event names right now correspond to the to the name of the type. But I believe if you go to js-events you can imagine how it can be architected to support parameterised names.

And this should lead you to understand how that might work seamlessly with observables.

@tegefaulkes
Copy link
Contributor Author

Okay, I'll look into it.

@CMCDragonkai
Copy link
Member

And every importantly, the node specific warnings on event listener counts might need to be entirely eliminated, and instead what we do is manage the total counter of events in a global event bus system that keeps track of it, and can do a total warning if it detects resource leaks at a global level.

@tegefaulkes
Copy link
Contributor Author

EventTarget is kind of not great to use, I've been thinking we'd just make our own pure TS event class at some point.

@CMCDragonkai
Copy link
Member

That's kind of what AbstractEvent is atm. I think that js-events is what needs to do its overrides/abstractions there, but the underlying system should still be EventTarget to keep it compatible with the rest of the JS ecosystem.

@tegefaulkes
Copy link
Contributor Author

I've got the fix down but just need to review and polish.

I'm thinking I can do a fix for MatrixAI/Polykey-CLI#115 in this PR as well. At minimum I have the groundwork down for it. I just need to work out what errors happen when the internet drops out.

@tegefaulkes
Copy link
Contributor Author

Ultimately the solution was to create a new event for a send error that is coded with the connection ID of the connection so the connection can listen for the error event for it specifically. Right now it only this logic applies mostly to QUICClient since the EINVAL error can only happen when specifying bad parameters for new connections. server connections shouldn't run into this.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Feb 5, 2024

I think if we just ignore or do warnings about the following error then we can prevent crashes when the network cuts out.

Error: send ENETUNREACH ::ffff:13.54.214.222:1314
    at doSend (node:dgram:716:16)
    at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18)
    at afterDns (node:dgram:662:5)
    at processTicksAndRejections (node:internal/process/task_queues:83:21) {
  errno: -101,
  code: 'ENETUNREACH',
  syscall: 'send',
  address: '::ffff:13.54.214.222',
  port: 1314
}

@tegefaulkes
Copy link
Contributor Author

I'm almost done with this. I'm just adding some tests to test network outages. Mainly mocking the udpsocket send to throw the same kind of error I posted in the above comment.

The first test is working fine. But the 2nd where the connection times out when the sends is failing is working mostly. The only problem is the ErrorQUICConnectionIdleTimeout is being thrown somewhere even after the test ends successfully. Trying to track down the error through all the event handling is tricky since we're loosing basically all useful stack information in the process.

@CMCDragonkai
Copy link
Member

Error handling for push flows is inherently different from pull flows. Errors have to explicitly handled at the handler and possibly flow to a different global error handler. Don't rely on the error stack for push flows.

@tegefaulkes tegefaulkes force-pushed the feature-socket-errors branch from 657f032 to e2c084b Compare February 6, 2024 00:13
@tegefaulkes
Copy link
Contributor Author

I'm considering this one done now if you want to look it over.

@tegefaulkes tegefaulkes marked this pull request as ready for review February 6, 2024 01:37
This commit addresses the send failures by taking any send specific errors and passing them back to the connection to be handled.
Any errors such as bad arguments results in the connection throwing the problem proactivity.
Any network failures are generally ignored and the Connections are left to time out. This allows for the network to drop for a short amount of time without failure of the connection and streams.

[ci skip]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

js-quic hard crashes when bound to loopback and creating connection to an external address.
2 participants