-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSDK-8566 Send gRPC heartbeats from signaling server to answerer #356
Conversation
5feb2cb
to
1ad92d9
Compare
@@ -140,6 +140,10 @@ message AnswerRequestErrorStage { | |||
google.rpc.Status status = 1; | |||
} | |||
|
|||
// AnswerRequestHeartbeatStage is sent periodically to verify liveness of answerer. | |||
message AnswerRequestHeartbeatStage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What forbidden pleasure is this? Being able to update proto and use new proto without a cross repo PR and bump of the release
rpc/wrtc_signaling_answerer.go
Outdated
for { | ||
// `client.Recv` waits, typically for a long time, for a caller to show up. Which is | ||
// when the signaling server will send a response saying someone wants to connect. It | ||
// can also receive heartbeats every 15s. Discard heartbeats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the useful side-effect of having the signaling server send heartbeats if we're just going to discard them?
I'm guessing the act of the signaling server calling SendMsg(heartbeat)
is what can now trigger an error that the server handles by cleaning up server-side resources? Or maybe we expect the server's internal gRPC logic handles that error in a way that makes our problem go away?
Whatever it is, can we document that? I think someone seeing this will see these heartbeats are intentional, but not know how to measure if it's having the intended effect. I think it's fine if we don't exactly know the magic here, but leaving as much detail as we do know can be useful.
I can see a developer coming across this and wanting to send a "heartbeat" in response, as that's the usual pattern. Documentation can help that developer understand if there would be an observable impact of such a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the act of the signaling server calling SendMsg(heartbeat) is what can now trigger an error that the server handles by cleaning up server-side resources?
This is my understanding: it's the error from heartbeating server-side that's really helping out here. I'll document a bit better.
if err := server.Send(&webrtcpb.AnswerRequest{ | ||
Stage: &webrtcpb.AnswerRequest_Heartbeat{}, | ||
}); err != nil { | ||
srv.logger.Debugw( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar remark to the other comment re: side-effects. I buy these changes fix the long delays before robots can be connected to again. But because we just ignore the errors, it's going to be hard for a reader to pick up our expectation of what's changing under the hood. I know our information here is limited, but sometimes it's important to communicate what we don't know just as much as it is communicating what we do know.
Code changes look perfectly fine to me. Just want to leave as much context in the form of code documentation on how we think these messages help for future readers. |
// The answerer does not respond to heartbeats. The signaling server is only | ||
// using heartbeats to ensure the answerer is reachable. If the answerer is | ||
// down, the heartbeat will error in the heartbeating goroutine below, the | ||
// stream's context will be canceled, and we will stop handling interactions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "the stream's context will be canceled, and we will stop handling interactions" refer to the ctx
used in line 380 RecvOffer(ctx, hosts)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed that context. More specifically, that ctx
value is really server.Context()
which is the context stored on the bidi stream passed into Answer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this makes a lot more sense, thanks for that explanation! Sorry for one (last?) ask: can we incorporate that information into the documentation? The context handling here deviates from common patterns (I think out of necessity), so it's not obvious that the line 365 server.Send
getting an error implies the context used on line 380 for a blocking operation gets canceled. Because of the connection on line 342. Understanding this deeply now, I can see your current documentation is touching all the important pieces. I just think some more hand-holding/babying of the reader in this specific case will be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure! Thanks for asking for more clarity here; some of this stuff is opaque, so I agree that it's important to document what's going on here. I added
// .... We stop handling interactions because the stream's
// context (`ctx` here and below) is used in the `RecvOffer` call below this
// goroutine that waits for a caller to attempt to establish a connection.
I only added that piece to the signaling server side, as I think that concept might be more difficult to explain from the answerer side, but hopefully future NetCoders will be able to see the relationship between the server and answerer heartbeat logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only added that piece to the signaling server side
For sure. In hindsight, now knowing where the action is happening, I would have better placed my questions. I think duplicating some of the content as documentation is perfectly fine though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you mentioned tests, but approving for now. Happy to look again after tests are added if there's something interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approach makes sense per the context you gave IRL - @dgottlieb covered most of my questions and I think the rationale behind this change is well-captured in the code comments
rpc/wrtc_signaling_test.go
Outdated
webrtcServer.Stop() | ||
answerer.Stop() | ||
grpcServer.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[q]: does the order in which these stop matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A great q; I copied from another test above. Does the ordering matter? From what I can tell: sort of (I think there could be unexpected errors from sig server/answerer or machine if the ordering is particularly bad). I've changed the order here and in the test I copied to be the same order as the one in simpleServer.Stop
to mimic what happens in prod. Left a comment, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! one q mostly out of curiosity if you happen to know.
rpc/wrtc_signaling_server.go
Outdated
@@ -361,6 +361,8 @@ func (srv *WebRTCSignalingServer) Answer(server webrtcpb.SignalingService_Answer | |||
// goroutine that waits for a caller to attempt to establish a connection. | |||
if HeartbeatsAllowedFromCtx(ctx) { | |||
utils.PanicCapturingGo(func() { | |||
// Capture as tests can mutate this value. | |||
heartbeatInterval := heartbeatInterval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err, it's not obvious to me this fixes the race. I guess we spin up this goroutine once per test. And the test only exits when it sees the debug line for receiving a heartbeat. Which implies the server.Send
on line 369 went through.
But that assumes the test otherwise passes. I think hypothetically if the test fails because the log line is never seen, we could still have a data race (e.g: add a time.Sleep()
on line 364).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally correct; we discussed offline. Apologies for the errant fix there 😮💨 . Passed heartbeatInterval
to the signaling server constructor and used a const defaultHeartbeatInterval
value instead.
https://viam.atlassian.net/browse/RSDK-8566
Sends heartbeats from signaling server to signaling answerer.
(cc @npmenard)