Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSDK-8566 Send gRPC heartbeats from signaling server to answerer #356

Merged
merged 9 commits into from
Oct 7, 2024

Conversation

benjirewis
Copy link
Member

@benjirewis benjirewis commented Sep 23, 2024

https://viam.atlassian.net/browse/RSDK-8566

Sends heartbeats from signaling server to signaling answerer.

(cc @npmenard)

@viambot viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Sep 23, 2024
@viambot viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 2, 2024
@benjirewis benjirewis changed the title Signaling heartbeats RSDK-8566 Send gRPC heartbeats from signaling server to answerer Oct 2, 2024
@benjirewis benjirewis marked this pull request as ready for review October 2, 2024 14:05
@@ -140,6 +140,10 @@ message AnswerRequestErrorStage {
google.rpc.Status status = 1;
}

// AnswerRequestHeartbeatStage is sent periodically to verify liveness of answerer.
message AnswerRequestHeartbeatStage {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What forbidden pleasure is this? Being able to update proto and use new proto without a cross repo PR and bump of the release

for {
// `client.Recv` waits, typically for a long time, for a caller to show up. Which is
// when the signaling server will send a response saying someone wants to connect. It
// can also receive heartbeats every 15s. Discard heartbeats.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the useful side-effect of having the signaling server send heartbeats if we're just going to discard them?

I'm guessing the act of the signaling server calling SendMsg(heartbeat) is what can now trigger an error that the server handles by cleaning up server-side resources? Or maybe we expect the server's internal gRPC logic handles that error in a way that makes our problem go away?

Whatever it is, can we document that? I think someone seeing this will see these heartbeats are intentional, but not know how to measure if it's having the intended effect. I think it's fine if we don't exactly know the magic here, but leaving as much detail as we do know can be useful.

I can see a developer coming across this and wanting to send a "heartbeat" in response, as that's the usual pattern. Documentation can help that developer understand if there would be an observable impact of such a change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the act of the signaling server calling SendMsg(heartbeat) is what can now trigger an error that the server handles by cleaning up server-side resources?

This is my understanding: it's the error from heartbeating server-side that's really helping out here. I'll document a bit better.

if err := server.Send(&webrtcpb.AnswerRequest{
Stage: &webrtcpb.AnswerRequest_Heartbeat{},
}); err != nil {
srv.logger.Debugw(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar remark to the other comment re: side-effects. I buy these changes fix the long delays before robots can be connected to again. But because we just ignore the errors, it's going to be hard for a reader to pick up our expectation of what's changing under the hood. I know our information here is limited, but sometimes it's important to communicate what we don't know just as much as it is communicating what we do know.

@dgottlieb
Copy link
Member

Code changes look perfectly fine to me. Just want to leave as much context in the form of code documentation on how we think these messages help for future readers.

@viambot viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 2, 2024
// The answerer does not respond to heartbeats. The signaling server is only
// using heartbeats to ensure the answerer is reachable. If the answerer is
// down, the heartbeat will error in the heartbeating goroutine below, the
// stream's context will be canceled, and we will stop handling interactions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "the stream's context will be canceled, and we will stop handling interactions" refer to the ctx used in line 380 RecvOffer(ctx, hosts)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed that context. More specifically, that ctx value is really server.Context() which is the context stored on the bidi stream passed into Answer.

Copy link
Member

@dgottlieb dgottlieb Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this makes a lot more sense, thanks for that explanation! Sorry for one (last?) ask: can we incorporate that information into the documentation? The context handling here deviates from common patterns (I think out of necessity), so it's not obvious that the line 365 server.Send getting an error implies the context used on line 380 for a blocking operation gets canceled. Because of the connection on line 342. Understanding this deeply now, I can see your current documentation is touching all the important pieces. I just think some more hand-holding/babying of the reader in this specific case will be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure! Thanks for asking for more clarity here; some of this stuff is opaque, so I agree that it's important to document what's going on here. I added

// .... We stop handling interactions because the stream's                          
// context (`ctx` here and below) is used in the `RecvOffer` call below this                                               
// goroutine that waits for a caller to attempt to establish a connection.  

I only added that piece to the signaling server side, as I think that concept might be more difficult to explain from the answerer side, but hopefully future NetCoders will be able to see the relationship between the server and answerer heartbeat logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added that piece to the signaling server side

For sure. In hindsight, now knowing where the action is happening, I would have better placed my questions. I think duplicating some of the content as documentation is perfectly fine though.

@viambot viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 2, 2024
Copy link
Member

@dgottlieb dgottlieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you mentioned tests, but approving for now. Happy to look again after tests are added if there's something interesting.

Copy link
Contributor

@maximpertsov maximpertsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approach makes sense per the context you gave IRL - @dgottlieb covered most of my questions and I think the rationale behind this change is well-captured in the code comments

@viambot viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 3, 2024
Comment on lines 251 to 253
webrtcServer.Stop()
answerer.Stop()
grpcServer.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[q]: does the order in which these stop matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great q; I copied from another test above. Does the ordering matter? From what I can tell: sort of (I think there could be unexpected errors from sig server/answerer or machine if the ordering is particularly bad). I've changed the order here and in the test I copied to be the same order as the one in simpleServer.Stop to mimic what happens in prod. Left a comment, too.

Copy link
Contributor

@maximpertsov maximpertsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! one q mostly out of curiosity if you happen to know.

@viambot viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 3, 2024
@@ -361,6 +361,8 @@ func (srv *WebRTCSignalingServer) Answer(server webrtcpb.SignalingService_Answer
// goroutine that waits for a caller to attempt to establish a connection.
if HeartbeatsAllowedFromCtx(ctx) {
utils.PanicCapturingGo(func() {
// Capture as tests can mutate this value.
heartbeatInterval := heartbeatInterval
Copy link
Member

@dgottlieb dgottlieb Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, it's not obvious to me this fixes the race. I guess we spin up this goroutine once per test. And the test only exits when it sees the debug line for receiving a heartbeat. Which implies the server.Send on line 369 went through.

But that assumes the test otherwise passes. I think hypothetically if the test fails because the log line is never seen, we could still have a data race (e.g: add a time.Sleep() on line 364).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally correct; we discussed offline. Apologies for the errant fix there 😮‍💨 . Passed heartbeatInterval to the signaling server constructor and used a const defaultHeartbeatInterval value instead.

@viambot viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 4, 2024
@benjirewis benjirewis merged commit 598a0ed into viamrobotics:main Oct 7, 2024
6 checks passed
@benjirewis benjirewis deleted the signaling-heartbeats branch October 7, 2024 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test Mark as safe to test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants