RSDK-8566 Send gRPC heartbeats from signaling server to answerer #356

benjirewis · 2024-09-23T20:52:24Z

https://viam.atlassian.net/browse/RSDK-8566

Sends heartbeats from signaling server to signaling answerer.

dgottlieb · 2024-10-02T15:00:15Z

proto/rpc/webrtc/v1/signaling.proto

@@ -140,6 +140,10 @@ message AnswerRequestErrorStage {
 	google.rpc.Status status = 1;
 }

+// AnswerRequestHeartbeatStage is sent periodically to verify liveness of answerer.
+message AnswerRequestHeartbeatStage {


What forbidden pleasure is this? Being able to update proto and use new proto without a cross repo PR and bump of the release

dgottlieb · 2024-10-02T15:05:19Z

rpc/wrtc_signaling_answerer.go

+			for {
+				// `client.Recv` waits, typically for a long time, for a caller to show up. Which is
+				// when the signaling server will send a response saying someone wants to connect. It
+				// can also receive heartbeats every 15s. Discard heartbeats.


What's the useful side-effect of having the signaling server send heartbeats if we're just going to discard them?

I'm guessing the act of the signaling server calling SendMsg(heartbeat) is what can now trigger an error that the server handles by cleaning up server-side resources? Or maybe we expect the server's internal gRPC logic handles that error in a way that makes our problem go away?

Whatever it is, can we document that? I think someone seeing this will see these heartbeats are intentional, but not know how to measure if it's having the intended effect. I think it's fine if we don't exactly know the magic here, but leaving as much detail as we do know can be useful.

I can see a developer coming across this and wanting to send a "heartbeat" in response, as that's the usual pattern. Documentation can help that developer understand if there would be an observable impact of such a change.

I'm guessing the act of the signaling server calling SendMsg(heartbeat) is what can now trigger an error that the server handles by cleaning up server-side resources?

This is my understanding: it's the error from heartbeating server-side that's really helping out here. I'll document a bit better.

dgottlieb · 2024-10-02T15:25:05Z

rpc/wrtc_signaling_server.go

+					if err := server.Send(&webrtcpb.AnswerRequest{
+						Stage: &webrtcpb.AnswerRequest_Heartbeat{},
+					}); err != nil {
+						srv.logger.Debugw(


Similar remark to the other comment re: side-effects. I buy these changes fix the long delays before robots can be connected to again. But because we just ignore the errors, it's going to be hard for a reader to pick up our expectation of what's changing under the hood. I know our information here is limited, but sometimes it's important to communicate what we don't know just as much as it is communicating what we do know.

dgottlieb · 2024-10-02T15:25:58Z

Code changes look perfectly fine to me. Just want to leave as much context in the form of code documentation on how we think these messages help for future readers.

dgottlieb · 2024-10-02T15:38:52Z

rpc/wrtc_signaling_server.go

+	// The answerer does not respond to heartbeats. The signaling server is only
+	// using heartbeats to ensure the answerer is reachable. If the answerer is
+	// down, the heartbeat will error in the heartbeating goroutine below, the
+	// stream's context will be canceled, and we will stop handling interactions


Does "the stream's context will be canceled, and we will stop handling interactions" refer to the ctx used in line 380 RecvOffer(ctx, hosts)?

It is indeed that context. More specifically, that ctx value is really server.Context() which is the context stored on the bidi stream passed into Answer.

Ok, this makes a lot more sense, thanks for that explanation! Sorry for one (last?) ask: can we incorporate that information into the documentation? The context handling here deviates from common patterns (I think out of necessity), so it's not obvious that the line 365 server.Send getting an error implies the context used on line 380 for a blocking operation gets canceled. Because of the connection on line 342. Understanding this deeply now, I can see your current documentation is touching all the important pieces. I just think some more hand-holding/babying of the reader in this specific case will be helpful.

For sure! Thanks for asking for more clarity here; some of this stuff is opaque, so I agree that it's important to document what's going on here. I added

// .... We stop handling interactions because the stream's // context (`ctx` here and below) is used in the `RecvOffer` call below this // goroutine that waits for a caller to attempt to establish a connection.

I only added that piece to the signaling server side, as I think that concept might be more difficult to explain from the answerer side, but hopefully future NetCoders will be able to see the relationship between the server and answerer heartbeat logic.

I only added that piece to the signaling server side

For sure. In hindsight, now knowing where the action is happening, I would have better placed my questions. I think duplicating some of the content as documentation is perfectly fine though.

dgottlieb

I know you mentioned tests, but approving for now. Happy to look again after tests are added if there's something interesting.

maximpertsov

approach makes sense per the context you gave IRL - @dgottlieb covered most of my questions and I think the rationale behind this change is well-captured in the code comments

maximpertsov · 2024-10-03T18:16:45Z

rpc/wrtc_signaling_test.go

+	webrtcServer.Stop()
+	answerer.Stop()
+	grpcServer.Stop()


[q]: does the order in which these stop matter?

A great q; I copied from another test above. Does the ordering matter? From what I can tell: sort of (I think there could be unexpected errors from sig server/answerer or machine if the ordering is particularly bad). I've changed the order here and in the test I copied to be the same order as the one in simpleServer.Stop to mimic what happens in prod. Left a comment, too.

maximpertsov

looks good! one q mostly out of curiosity if you happen to know.

dgottlieb · 2024-10-03T19:56:26Z

rpc/wrtc_signaling_server.go

@@ -361,6 +361,8 @@ func (srv *WebRTCSignalingServer) Answer(server webrtcpb.SignalingService_Answer
 	// goroutine that waits for a caller to attempt to establish a connection.
 	if HeartbeatsAllowedFromCtx(ctx) {
 		utils.PanicCapturingGo(func() {
+			// Capture as tests can mutate this value.
+			heartbeatInterval := heartbeatInterval


Err, it's not obvious to me this fixes the race. I guess we spin up this goroutine once per test. And the test only exits when it sees the debug line for receiving a heartbeat. Which implies the server.Send on line 369 went through.

But that assumes the test otherwise passes. I think hypothetically if the test fails because the log line is never seen, we could still have a data race (e.g: add a time.Sleep() on line 364).

Totally correct; we discussed offline. Apologies for the errant fix there 😮‍💨 . Passed heartbeatInterval to the signaling server constructor and used a const defaultHeartbeatInterval value instead.

viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Sep 23, 2024

benjirewis added 4 commits October 2, 2024 09:57

initial proto

e80e352

heartbeats

57cb638

fixes

baa85cf

more fixes

1ad92d9

benjirewis force-pushed the signaling-heartbeats branch from 5feb2cb to 1ad92d9 Compare October 2, 2024 13:59

viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 2, 2024

benjirewis changed the title ~~Signaling heartbeats~~ RSDK-8566 Send gRPC heartbeats from signaling server to answerer Oct 2, 2024

benjirewis requested review from dgottlieb and maximpertsov October 2, 2024 14:05

benjirewis marked this pull request as ready for review October 2, 2024 14:05

dgottlieb reviewed Oct 2, 2024

View reviewed changes

documentation

4ac606a

benjirewis requested a review from dgottlieb October 2, 2024 15:35

viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 2, 2024

dgottlieb reviewed Oct 2, 2024

View reviewed changes

more docs

b617ba4

viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 2, 2024

dgottlieb approved these changes Oct 2, 2024

View reviewed changes

maximpertsov reviewed Oct 3, 2024

View reviewed changes

Basic test

68478c5

viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 3, 2024

maximpertsov reviewed Oct 3, 2024

View reviewed changes

maximpertsov approved these changes Oct 3, 2024

View reviewed changes

maxim comment and data race

8b0b003

viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 3, 2024

benjirewis requested a review from dgottlieb October 3, 2024 19:52

dgottlieb approved these changes Oct 3, 2024

View reviewed changes

dgottlieb reviewed Oct 3, 2024

View reviewed changes

pass heartbeatInterval to NewWebrtcSignalingServer

3315124

viambot added safe to test Mark as safe to test and removed safe to test Mark as safe to test labels Oct 4, 2024

benjirewis requested a review from dgottlieb October 4, 2024 20:49

dgottlieb approved these changes Oct 5, 2024

View reviewed changes

benjirewis merged commit 598a0ed into viamrobotics:main Oct 7, 2024
6 checks passed

benjirewis deleted the signaling-heartbeats branch October 7, 2024 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSDK-8566 Send gRPC heartbeats from signaling server to answerer #356

RSDK-8566 Send gRPC heartbeats from signaling server to answerer #356

benjirewis commented Sep 23, 2024 •

edited

Loading

dgottlieb Oct 2, 2024

dgottlieb Oct 2, 2024

benjirewis Oct 2, 2024

dgottlieb Oct 2, 2024

dgottlieb commented Oct 2, 2024

dgottlieb Oct 2, 2024

benjirewis Oct 2, 2024

dgottlieb Oct 2, 2024 •

edited

Loading

benjirewis Oct 2, 2024

dgottlieb Oct 2, 2024

dgottlieb left a comment

maximpertsov left a comment

maximpertsov Oct 3, 2024

benjirewis Oct 3, 2024

maximpertsov left a comment

dgottlieb Oct 3, 2024 •

edited

Loading

benjirewis Oct 4, 2024

RSDK-8566 Send gRPC heartbeats from signaling server to answerer #356

RSDK-8566 Send gRPC heartbeats from signaling server to answerer #356

Conversation

benjirewis commented Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgottlieb commented Oct 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgottlieb Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgottlieb left a comment

Choose a reason for hiding this comment

maximpertsov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maximpertsov left a comment

Choose a reason for hiding this comment

dgottlieb Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjirewis commented Sep 23, 2024 •

edited

Loading

dgottlieb Oct 2, 2024 •

edited

Loading

dgottlieb Oct 3, 2024 •

edited

Loading