Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout support to the Agnostic RPC system #510

Closed
tegefaulkes opened this issue Mar 3, 2023 · 45 comments · Fixed by #513
Closed

Add timeout support to the Agnostic RPC system #510

tegefaulkes opened this issue Mar 3, 2023 · 45 comments · Fixed by #513
Assignees
Labels
development Standard development r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management

Comments

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Mar 3, 2023

Specification

We need to support default and client supplied timeouts for the RPCServer handlers.

We need to support some features with this.

  • User can provide a timeout when making a client call
  • Client calls have default timeouts,
  • When a client times out the connection is immediately ended with a timeout code.
  • Handlers are provided with a timeout timer as a TimedCancellable context
  • Handlers have a default timeout provided
  • Handler timeout can be overridden by the client supplied value but only if it's less than the default.
  • When a handler times out is the abort is advisory and then the connection is forced closed after a small delay.

I need to do some prototyping with how I want to implement this. We have two options.

  1. The RPC system supports timeouts directly.
  2. RPC system allows timeouts to be supplied via the middleware.

Note that the raw handlers don't support the middleware, It's pretty low level in that respect. So raw handlers wouldn't support timeouts with option 2. Conversely Option 1 means adding a timeout field to the RPC messages since by design the RPC system doesn't interpret or apply constraints to the params.

Given some constraints the first option would mean the timeout feature is always available and enforced. The second option means it's up to the implementation of the middleware and handlers to enforce it. I need to weigh both options after some prototyping. I'm leaning towards option 1, but It may require appending a field to the JSONRPC request message format.

RPCServer changes
  1. Two parameters have been added to the RPCServer when creating it. These are defaultTimeout and forceEndDelay. defaultTimeout sets the default amount of time before a handler times out and an abort signal is sent. Since this signal is advisory the forceEndDelay (name subject to change) sets the time between the signal and the streamPair being directly aborted.
  2. When handling a stream a timeout timer is created sends a timeout abort to the abort controller. When created it is started with a delay of defaultTimeout. When a handler is selected the timer is reset with the provided timeout for the handler if is is less than the default. If none is set then the timer is just refresh-ed.
  3. The duplex handler and by extension the other 3 basic handler types will refresh the timeout timer every time a message is sent or received. This is not done by the raw handler for now. A raw handler can directly affect the timeout timer by refreshing or resetting it.
  4. The handlers have a timeout parameter that can be set when implementing the handler. This will be used as the timeout if it defined and less than the RPCServer defaultTimeout that was provided.
  5. The signal and timer are provided to the handlers as a ctx: ContextTimed. The timer can be refresh-ed or reset(delay) or even cancel(reason). This gives us the freedom to override the timeout behaviour from the handlers if we want that. If that is not desired, we can reduce the functionality by casting it with a more restrictive interface.

Additional context

Tasks

  1. RPCServer handle timeouts and provide them to the handlers
  2. RPCClient takes a timeout value or default and force closes the connection when exceeded.
  3. RPCClient can communicate it's timeout to the RPCServer.
@tegefaulkes tegefaulkes added the development Standard development label Mar 3, 2023
@tegefaulkes tegefaulkes self-assigned this Mar 3, 2023
@tegefaulkes
Copy link
Contributor Author

This requires some prototyping and expansion of the spec. I think the cleanest way to implement this is option 1. But that means modifying the JSON RPC messages structure which we are trying to avoid. Mind that if the timeout parameter is optional it shouldn't conflict with the normal structure. Think of it as extending the message structure.
RPCMessageRequest & {timeout? number};

@CMCDragonkai
Copy link
Member

On the receiving side we should default to a value even if not set.

Absolute timeout vs keep alive timeout. For streams it could default to null for absolute timeout.

Keep alive timeout should be something that each rpc params sets up themselves.

I'm just wondering how we would prevent any DoS and we should avoid providing infinite streams for agent service so that we don't allow any DoS between agents.

@tegefaulkes
Copy link
Contributor Author

This is comprised of a two stages.

  1. Updating the RPC code to support timeouts and provide them to handlers via the ctx: ContextTimed parameter.
  2. updating all handlers to make use of the timeouts as well as set default timeouts.

The first taking maybe 1-2 days, while the 2nd is tedious given the amount of handlers so It could take 2-3 days. This could be done before or after the agent handler migration.

@CMCDragonkai
Copy link
Member

I think this or QUIC should be tackled first. The agent rpc migration has to wait until quic is ready and I'd like to have this ready for testing. So this and quic then the agent handler migration.

@tegefaulkes tegefaulkes mentioned this issue Mar 13, 2023
24 tasks
@tegefaulkes
Copy link
Contributor Author

I'll start with the server side default timeouts. For this the following changes need to be made

  1. Timeout parameter added to the handler classes so they can be set when defining the handler.
  2. Handlers now take a TimedCancellable context when they are called. Can we use the timed cancellable decorator here? How does that work with extended methods?
  3. The RPCServer will take a default timeout parameter. This should be a relatively large value to accommodate long running RPC handlers.
  4. When handling a stream we create the timeout and abort signal and pass it along to the handler. The value of the timeout should be the minimum out of the global default, handler default and the client provided value.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Mar 13, 2023 via email

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Mar 13, 2023 via email

@tegefaulkes
Copy link
Contributor Author

That's possible but a little more tricky. It requires some feed back compared to the easier method of creating the timeout and forgetting about it.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Mar 14, 2023

The reason is because a timeout for the entire duration of a stream is kind of useless, we don't really have a situation where that would be useful.

However a timeout for unary is ultimately intended to prevent DOS and agents that are just stuck.

Therefore a similar reason should apply for streams, it should be a per-message timer.

Furthermore:

  1. null - explicitly infinite
  2. undefined - should mean I don't care, set the default
  3. number - set a specific number

On the client side, during any given call, the client can set one of the 3 values above.

On the server side, it can do one of several things:

  1. Accept number
  2. Consider that number is too high, and set to the max
  3. Accept undefined and use the default
  4. Accept null and use the default
  5. Consider that null is too high, and set to the max
  6. Consider that number is too low, and set to the min

So therefore on the server side the number of configurations is a bit more sophisticated.

You could say:

  1. This handler has max timeout of 3000 (therefore anything higher than 3000 or null is set to 3000)
  2. This handler has max timeout of null - meaning infinity
  3. This handler has a min timeout of 100
  4. This handler has a min timeout of 0

So that could mean a handler should be able to set a min and max timeout to provide a range.

I'm actually not sure whether we even should have min/max, or just have a default.

Furthermore, I'm also not sure if we even allow the client to "request" a timeout. Perhaps it makes more sense that timeouts can be set on both sides, but there's no actual negotiation.

Here's an idea. Timeouts can be set on both sides, but they are only advisory to the other side.

  1. If a client sets a timeout (and the server doesn't set one), it's a deadline for the response. If a response does not arrive. Then the call is timed out with an abortion.
  2. If a server sets a timeout (and the client doesn't set one), it's a deadline for responding - as in the default timer to pass down to its operations.
  3. If both the client and server sets a timeout, then the resulting timeout will be the lower of the 2.
  4. If both the client and server does not set a timeout, then the resulting timeout is infinity.

@CMCDragonkai
Copy link
Member

I think the last idea is clearer... to do. And that is also applied per-message when in CS, DS, SS.

As for timeout applied to replies from the client, let's address this later, it's not that important.

@CMCDragonkai
Copy link
Member

@tegefaulkes the timed cancellable functions works by wrapping functions. It's a higher order function that wraps a lower order function.

There are actually 2 ways:

  1. Using the decorator applies to only class methods.
  2. Use the HOF wrapper

The first way can only work if handlers are defined as methods in a class. Arrow functions or just plain exported functions can only use the HOF wrapper. See the tests for how to use it.

@CMCDragonkai
Copy link
Member

BTW, there are 2 possible abortions.

  1. Timer expires, that aborts all subsequent operations - remember when we had the agent still doing work even when the client already died or disappeared. We want to prevent this!
  2. The client just cancelling the stream, if that occurs, we should also abort.

TimedCancellable supports both.

You may want to create symbols specific to the 2nd case though. See how we did something similar in the tasks domain.

@CMCDragonkai
Copy link
Member

#243 might not be fixable here under this issue until agent RPC migration has occurred.

In particular I wanted #243 to change how we do cross-signing from in-band duplex stream protocol to a nested series of unary calls.

@CMCDragonkai
Copy link
Member

If you're going to add a "default" timer to the RPC server class... you may need to have that as the 3 options: undefined, null, number. The undefined should be default.

@tegefaulkes
Copy link
Contributor Author

Two small problems with having the timer reset after every sent message.

  1. the Timer from @matrixai/timer doesn't support resetting it's delay. The NodeJS.Timeout supports it with timer.refresh(). So it's possible to cheese this feature by accessing the protected property Timer.timeoutRef. Care should be taken here to update the timestamp scheduled values as well when doing this.
  2. From the context of outside a stream itself, we can't really tell if data is being sent through it. So for the raw handler we can't know when to refresh the timer when a message is sent. We'll have to do it from inside a transform stream (not ideal) or when a message is read and written. this is not a problem for the other handlers since we convert to a stream in the duplex handler. We can refresh the timeout then. But for the raw handler it will be up to the implementation of the handler to refresh this timer. Just something to keep in mind.

To address 1 for now i'm going to make a refreshTimer(timer: Timer) utility function that will modify the private properties to refresh the timer. Ideally this should me a method supported by Timer itself.

@tegefaulkes
Copy link
Contributor Author

As for the timeout delays we can set. for the handlers I'm going with number | undefined, number sets the timeout for the handler where the lower value between it and the default. Undefined defaults to the RPCServer default value witch must be set. I don't see a need for null for infinite to be an option. It's only really needed for debug? and in that case you can just set a large value.

@tegefaulkes
Copy link
Contributor Author

I can refresh the timer but since the timestamp and scheduled are readonly I can't update them. So these values will be invalid. I'll need to update @matrixai/timer to support refreshing the delay of a timer.

@CMCDragonkai
Copy link
Member

Does the Timer have access to an infinite configuration? The null is just for the JSON to communicate that.

@CMCDragonkai
Copy link
Member

Timer method can be called refresh too. But if you want to parameterize it to a specific amount of time the it can be called reset.

However beware of what happens if the timer is already activated? I think the call should fail. Since if you reset the timer it could mean the designated function is called twice. The Timer is intended for a only once execution. We should maintain such behaviour.

@tegefaulkes tegefaulkes linked a pull request Mar 14, 2023 that will close this issue
11 tasks
@CMCDragonkai
Copy link
Member

Two small problems with having the timer reset after every sent message.

  1. the Timer from @matrixai/timer doesn't support resetting it's delay. The NodeJS.Timeout supports it with timer.refresh(). So it's possible to cheese this feature by accessing the protected property Timer.timeoutRef. Care should be taken here to update the timestamp scheduled values as well when doing this.
  2. From the context of outside a stream itself, we can't really tell if data is being sent through it. So for the raw handler we can't know when to refresh the timer when a message is sent. We'll have to do it from inside a transform stream (not ideal) or when a message is read and written. this is not a problem for the other handlers since we convert to a stream in the duplex handler. We can refresh the timeout then. But for the raw handler it will be up to the implementation of the handler to refresh this timer. Just something to keep in mind.

To address 1 for now i'm going to make a refreshTimer(timer: Timer) utility function that will modify the private properties to refresh the timer. Ideally this should me a method supported by Timer itself.

You mean on every received message right? Not on every sent message.

@CMCDragonkai
Copy link
Member

For the raw handlers you can reset the timer upon every byte of data.

Actually that may end up being a bit slow. Ideally you can just update a property. But I think that's what your function does anyway.

Its a keep alive timer at this point in time.

@CMCDragonkai
Copy link
Member

Ok instead of updating upon EVERY byte of data. You want to update when a "block" is received. That's how must sockets work. So a block could be 1 byte, it could be 64 bytes... etc.

@tegefaulkes
Copy link
Contributor Author

The timer is available in the context that is provided to the raw handler. I was planning to just have the raw handler implement it's timer refresh logic. I'm hesitant to refresh on the raw Uint8Array stream since it seems possible to DDOS by holding a bunch of connections open for a long time by sending a single byte every 60 seconds. This is compared to the duplex stream handler that refreshes after every valid JSONRPC message.

I'ts possible to add a transparent transform stream to the handleStream logic to track if data is sent and received. I just don't think it's worth adding.

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Mar 16, 2023

I'm quoting #510 (comment) here, the quote reply didn't seem to work.

For 1., I'm working on this, but for some extra protection I want to add a grace period after the timeout where the abort signal is sent before the stream is forced closed. This is to prevent the possibility of the handler failing to respect the abort signal.

for 2, if the client cancels it's stream then the stream ends on the server side. I can send the abort signal at this point just as an extra measure.

I don't think using symbols here is appropriate since the handler's perspective is external to the RPC. They should be descriptive errors.

@tegefaulkes
Copy link
Contributor Author

If you're going to add a "default" timer to the RPC server class... you may need to have that as the 3 options: undefined, null, number. The undefined should be default.

I've gone with two options, number | undefined for the handlers and number for the default. I don't like the idea of allowing for infinite timeout. Now that I think about it, since we using Timer for the timeout it does actually support Infinite as a number value so it is ultimately supported.

@CMCDragonkai
Copy link
Member

The timer is available in the context that is provided to the raw handler. I was planning to just have the raw handler implement it's timer refresh logic. I'm hesitant to refresh on the raw Uint8Array stream since it seems possible to DDOS by holding a bunch of connections open for a long time by sending a single byte every 60 seconds. This is compared to the duplex stream handler that refreshes after every valid JSONRPC message.

I'ts possible to add a transparent transform stream to the handleStream logic to track if data is sent and received. I just don't think it's worth adding.

I believe we should separate these duties.

The timer is best used for "keepalive".

For preventing DOS, we should use a separate mechanism, one that tracks resource usage and starts killing things. That's more of an OOM. Not in-scope here. In fact, we can just leave that to the Operating System to deal with.

@CMCDragonkai
Copy link
Member

If you're going to add a "default" timer to the RPC server class... you may need to have that as the 3 options: undefined, null, number. The undefined should be default.

I've gone with two options, number | undefined for the handlers and number for the default. I don't like the idea of allowing for infinite timeout. Now that I think about it, since we using Timer for the timeout it does actually support Infinite as a number value so it is ultimately supported.

No the problem is that Infinite is not a valid JSON value. So when you are sending this information over JSON, you have to distinguish it.

In fact... I believe undefined is actually not a valid JSON value.

You are only given 2 options: number or null.

So if you want to represent something as infinity, then I guess a very large number can be used. Something like Number.MAX_SAFE_INTEGER can be used.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Mar 16, 2023

I'm ok with not having an explicit infinite option.

These are going to be valid values in the RPC JSON:

deadline: 123
deadline: 0
deadline: null
deadline: Number.MAX_SAFE_INTEGER

Note that if you provide a negative deadline, it should be clipped to 0.

However a 0 deadline is technically invalid too. Therefore valid values is > 1. meaning 0 should be clipped to 1.

Furthermore our deadline should be on the order of milliseconds. Not seconds.

@tegefaulkes
Copy link
Contributor Author

The RPCServer now supports timeouts. When creating the server you can provide defaultTimeout and forceEndDelay. defaultTimeout is the default timeout used and also the maximum timeout. When a timeout happens an abort signal is sent. Since a signal can just be ignored, forceEndDelay defines how long to wait between sending the signal and force aborting the stream.

A timeout value can be provided when implementing the handler. This value is used for the timeout if it is provided. If it is undefined or larger than defaultTimeout then defaultTimeout is used.

@CMCDragonkai
Copy link
Member

What happens if forceEndDelay is not set?

Also forceEndDelay sounds confusing? Is there a better for this?

@CMCDragonkai
Copy link
Member

Can you make sure to preserve the semantics of the words used? If you're using timeout then timeout must be used as well during RPC calls. Consider unifying all the words used for tasks domain too and whatever is used when js-timer is used.

@CMCDragonkai
Copy link
Member

What did you end up doing for JSON? The number and null?

@CMCDragonkai
Copy link
Member

Remember to update the spec above with the final design you arrived on.

@tegefaulkes
Copy link
Contributor Author

forceEndDelay has a default set to 2 seconds and yes it's a not a great name. It's the delay between the timeout ending and aborting the streams. I've been thinking of it as a grace period? timeoutGrace? I don't know, it's a hard one to have descriptive and succinct.

As for the JSON, I haven't started on that or fully specced it out. The underlying functionality of the RPCServer and RPCClient for the timeouts is separate enough to do first. I'm still not certain how to implement this nicely with the constraints we have. I think ultimately if it's handled via the middleware then so long as I provide the context with the timeout timer to the middleware factory, the middleware can update the timeout using timer.reset(delay). It gives the middleware a lot of freedom here.

If it's up to the middleware to communicate the timeouts then the message structure is out of scope for the agnostic RPC system. It's up to the middleware to define and implement this behaviour.

As for what to set a timeout value in JSON if we want to use the default. undefined technically is allowed, it's just stripped out when the JSON is stringed. If you don't provide timeout parameter it will be undefined when parsed. and you try to read it with message.timeout.

@tegefaulkes
Copy link
Contributor Author

I'll need to re-base this on #509. However I think Updating the client and agent handlers to use cancellation can be it's own issue/PR. Atomic tasks and all that.

@CMCDragonkai
Copy link
Member

forceEndDelay has a default set to 2 seconds and yes it's a not a great name. It's the delay between the timeout ending and aborting the streams. I've been thinking of it as a grace period? timeoutGrace? I don't know, it's a hard one to have descriptive and succinct.

As for the JSON, I haven't started on that or fully specced it out. The underlying functionality of the RPCServer and RPCClient for the timeouts is separate enough to do first. I'm still not certain how to implement this nicely with the constraints we have. I think ultimately if it's handled via the middleware then so long as I provide the context with the timeout timer to the middleware factory, the middleware can update the timeout using timer.reset(delay). It gives the middleware a lot of freedom here.

If it's up to the middleware to communicate the timeouts then the message structure is out of scope for the agnostic RPC system. It's up to the middleware to define and implement this behaviour.

As for what to set a timeout value in JSON if we want to use the default. undefined technically is allowed, it's just stripped out when the JSON is stringed. If you don't provide timeout parameter it will be undefined when parsed. and you try to read it with message.timeout.

Can you provide some code examples of how these are used.

@tegefaulkes
Copy link
Contributor Author

about what specifically?

@CMCDragonkai
Copy link
Member

about what specifically?

About how timers are used, how to configure the middleware with respect to the timers. How timers can be propagated downstream contexts. Even execution with asciicast.

@CMCDragonkai
Copy link
Member

If we debug what the JSON will look like.

@tegefaulkes
Copy link
Contributor Author

Ok, I can do that.

@tegefaulkes
Copy link
Contributor Author

Progress update:

  1. core support for timeouts for RPCServer and handlers is done.
  2. this branch needs to be re-based on staging.
  3. Need to add timeout support to RPCClient. about 1-2 days
  4. Need to update client handlers to make use of the timeout/cancellation. About 2ish days, changes should be simple but there are 60ish handlers.
  5. Need to update agent handlers to use timeout/cancellation. half a day, blocked by agent migration.
  6. Communicating timeouts between server and client. I think technically this can be implemented via the middleware if we provide the TimedCancellable ctx to the middleware. The timer's deadline can be updated any time with the timer.reset(delay) method. 1 day, low priority, could be skipped. I don't see us actually needing or using this feature.

@tegefaulkes
Copy link
Contributor Author

First blush spec for communicating timeouts between client and server.

Basically, the client and the server can communicate how long it will wait before timing out. Ideally the client and server can change their timeout behaviour based on this. There are a few parts to this.

  1. Client needs to send it's timeout delay as part of the message.
  2. server needs to respond with it's timeout delay as part of a response.
  3. Both the client and server could reduce it's timeout in response to the communicated values. Increasing timeouts are not reccomended.

There are some constraints to implementing this.

  1. Ideally we can't modify the JSONRPC message structure. However adding an optional timeout field to the message won't interfere with the message structure.
  2. If we can't add an optional timeout parameter to the JSONRPC message structure, then this needs to be a value in the params or response data. The RPC system by design can't apply any restrictions to the params beyond it being a JsonValue. so any built in systems for the RPC can't depend on any params values.
  3. Given the restrictions above, the only place that can augment the messages with this data is the middleware but the middleware is strictly implemented outside of the RPC system and passed in.

So to implement this feature, we need to provide the tools for a middleware to communicate the timeout values and update the timeouts. I don't see any other way without appending a timeout parameter to the JSONRPC message structure. This does give us a lot of freedom to specify the timeout update logic.

To allow the middleware to update the timer, all we need to do is provide the ctx to the middleware. From there updating the time is a simple as ctx.timer.reset(delay) to change the delay or ctx.timer.refresh(). then it's up to the middleware implementation to make use of it.

@CMCDragonkai
Copy link
Member

Only problem with not putting the timeout into the params is that it is no longer compatible with JSONRPC.

I think you should try to put it in the params and just not require strict typing. The timeout is an entirely optional property that you have to dynamically check no matter what because you cannot control who is sending requests.

@CMCDragonkai
Copy link
Member

Generic error code for web sockets is now 4000.

@tegefaulkes did you create the new issues for web socket error specification as well as any additional issues needed for future work involving the deadlines?

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Apr 6, 2023

I'll make then now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management
Development

Successfully merging a pull request may close this issue.

2 participants