-
Notifications
You must be signed in to change notification settings - Fork 126
RSDK-12369: TCP mode: don't get stuck on checkReady if module crashes after 100ms #5467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
module/modmanager/module.go
Outdated
| for { | ||
| // 5000 is an arbitrarily high number of attempts (context timeout should hit long before) | ||
| // If module is not yet ready, grpc (not the loop) internally retries the call with backoff of 1s * 1.6 multiplier up to 120s | ||
| resp, err := m.client.Ready(ctxTimeout, req, grpc_retry.WithMax(5000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically all I'm trying to do is check errors.Is(m.process.Status(), os.ErrProcessDone) before each retry, and if true, cancel ctxTimeout. A background goroutine for this feels slightly excessive, but also seems to be the simplest option compared to:
- manually retrying here in a nested loop (instead of using
grpc_retry.WithMax). - having
ManagedProcesscancel the context being used here if the process exits abruptly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, I like 1 because then we wouldn't need an extra goroutine, but the responsiveness allowed by the current setup is better I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can change it to manually retry to get rid of the extra goroutine. I guess recreating grpc's internal backoff here is unnecessary, and I can just have it retry say every 2s?
It will be something like:
for {
for {
if errors.Is(m.process.Status(), os.ErrProcessDone) {
cancelFunc()
}
resp, err := m.client.Ready(ctxTimeout, req)
// if err == not yet available { sleep 2s }
}
...
74123ce to
c1eb6b6
Compare
module/modmanager/module.go
Outdated
| for { | ||
| // 5000 is an arbitrarily high number of attempts (context timeout should hit long before) | ||
| // If module is not yet ready, grpc (not the loop) internally retries the call with backoff of 1s * 1.6 multiplier up to 120s | ||
| resp, err := m.client.Ready(ctxTimeout, req, grpc_retry.WithMax(5000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know if the Ready request retries at all? like if the tcp server just took a bit longer than expected to come online, would this code work or would we still stall out here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The m.client.Ready with the grpc_retry.WithMax(5000) CallOption means it will internally Retry 5000 times or until the context is cancelled. The default number of retry attempts is 5 if you don't specify that minor correction: this is CallOption.go-grpc-middleware's retry that's implemented via interceptors with a linear 50ms with +-10% backoff. By default it's deactivated (doesn't retry). gRPC-go internally retries 5x if dialing fails.
If the TCP server is fine, but just takes a moment to start up (e.g. if the module has a long setup process), there is no change to the existing behavior with this PR.
module/modmanager/module.go
Outdated
| for { | ||
| // 5000 is an arbitrarily high number of attempts (context timeout should hit long before) | ||
| // If module is not yet ready, grpc (not the loop) internally retries the call with backoff of 1s * 1.6 multiplier up to 120s | ||
| resp, err := m.client.Ready(ctxTimeout, req, grpc_retry.WithMax(5000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, I like 1 because then we wouldn't need an extra goroutine, but the responsiveness allowed by the current setup is better I think
module/modmanager/module.go
Outdated
| // If running in TCP mode, use a background watcher to short circuit this check if the process has already exited | ||
| // (OUE is waiting on the same modmanager lock, so it can't try to restart). | ||
| // This is not needed in UNIX Socket mode, because we only get here once .sock is present (indicating that the module has started up). | ||
| if m.isRunningInTCPMode() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this solve the issue reported? I thought part of the issue was that tcp mode wasn't getting picked up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The report was "tcp_mode taking a long time to be respected", not that it it never gets picked up. At least for my repro on the Windows machine, toggling TCP mode on/off with this change now gets picked up asap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chatted offline, if statement was removed and design/change makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If statement in question is this: https://github.com/viamrobotics/rdk/compare/c1eb6b6d50e77ebe8a60da2339c7169844a7a52d..58cd3261b1a215efdc5b0407a36eb4409b744293
Initially missed because modlib.CheckSocketOwner in UNIX mode on Linux causes the it to wait until the .sock is created so we're less likely to get stuck on checkReady, but on Windows it always returns err==nil causing it to break early.
rdk/module/modmanager/module.go
Lines 270 to 278 in 1c3e1f9
| err = modlib.CheckSocketOwner(m.addr) | |
| if errors.Is(err, fs.ErrNotExist) { | |
| continue | |
| } | |
| if err != nil { | |
| return errors.WithMessage(err, "module startup failed") | |
| } | |
| } | |
| break |
c1eb6b6 to
58cd326
Compare
58cd326 to
94fdb6c
Compare
3317555 to
df0f5a0
Compare
df0f5a0 to
a8702e5
Compare
|
|
a8702e5 to
f423e81
Compare
|
generally looks good, can you add a test? |
b571196 to
1dda87c
Compare
1dda87c to
115b9a3
Compare
115b9a3 to
f316d1e
Compare
cheukt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic looks good, just have a question around how it's getting surfaced
module/modmanager/manager_test.go
Outdated
| test.That(t, err.Error(), test.ShouldContainSubstring, "error while waiting for module to be ready test: context canceled") | ||
| } | ||
| // should not wait for the full DefaultModuleStartupTimeout / ModuleStartupTimeoutEnvVar | ||
| test.That(t, time.Since(startTime), test.ShouldBeLessThan, time.Second*10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think we need this assertion (the assertion on the error should be enough). mostly concerned with the possibility of a flaky test even when the time given is generous (definitely have run into that before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good way to test that the error is returned shortly after the process unexpectedly exits rather than after 5 mins (DefaultModuleStartupTimeout)? The module used for this test crashes after 2s, so I'd expect it to return with error <1s after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so - the issue is that the test process/scheduler could sometimes randomly pause execution of the test, which will cause flakes. The best assertion is that the error returned is different from one that would return after 5 mins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right; if it fails due to DefaultModuleStartupTimeout the error returned is error while waiting for module to be ready test: context deadline exceeded, so this assert is not needed.
| if !rutils.TCPRegex.MatchString(m.addr) { | ||
| if !m.isRunningInTCPMode() { | ||
| // note: we don't do this check in TCP mode because TCP addresses are not file paths and will fail check. | ||
| // note: CheckSocketOwner on Windows only returns err, if any, from os.Stat. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this extra note!
module/modmanager/module.go
Outdated
| if errors.Is(m.process.Status(), os.ErrProcessDone) { | ||
| m.logger.Info("Module process exited unexpectedly while waiting for ready.") | ||
| parentCtxCancelFunc() | ||
| return parentCtxTimeout.Err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we return the info log as an error instead of a context canceled error? seems like a bit of an obfuscation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkReady's two callers (startModule & attemptRestart) prefix the error with error while waiting for module to be ready *name* *err* and the fact that it's due to the module process unexpectedly exiting is also already logged. Is it worth adding another ERROR log to repeat the same info?
2025-11-13T15:59:19.020Z INFO rdk impl/local_robot.go:1505 (Re)configuring robot
2025-11-13T15:59:19.020Z INFO rdk.modmanager.viam_rsdk12369-3 modmanager/manager.go:268 Now adding module {"module":"viam_rsdk12369-3"}
. . .
2025-11-13T15:59:29.023Z WARN rdk.modmanager.viam_rsdk12369-3 utils/ticker.go:22 Waiting for module to complete startup and registration {"module":"viam_rsdk12369-3","time_elapsed":"10s"}
(OUE) 2025-11-13T15:59:29.025Z ERROR rdk.modmanager.viam_rsdk12369-3 modmanager/manager.go:845 Module has unexpectedly exited. {"module":"viam_rsdk12369-3","exit_code":1}
(checkReady) 2025-11-13T15:59:29.175Z INFO rdk.modmanager.viam_rsdk12369-3 modmanager/module.go:150 Module process exited unexpectedly while waiting for ready.
2025-11-13T15:59:29.176Z INFO rdk.modmanager.viam_rsdk12369-3 modmanager/module.go:332 Stopping module: viam_rsdk12369-3 process
2025-11-13T15:59:29.176Z INFO rdk.modmanager.viam_rsdk12369-3.viam_rsdk12369-3 pexec/managed_process_unix.go:100 stopping process 336248 with signal terminated
2025-11-13T15:59:29.176Z WARN rdk.modmanager.viam_rsdk12369-3.viam_rsdk12369-3 pexec/managed_process.go:539 Stopped module did not exit cleanly {"err":"exit status 1"}
(checkReady's caller) 2025-11-13T15:59:29.176Z ERROR rdk.modmanager.viam_rsdk12369-3 modmanager/manager.go:271 Error adding module {"module":"viam_rsdk12369-3","error":"error while waiting for module to be ready viam_rsdk12369-3: context canceled","errorVerbose":"context canceled\nerror while waiting for module to be ready viam_rsdk12369-3"}
2025-11-13T15:59:29.176Z ERROR rdk.resource_manager impl/resource_manager.go:1246 error adding modules {"error":"error while waiting for module to be ready viam_rsdk12369-3: context canceled","errorVerbose":"context canceled\nerror while waiting for module to be ready viam_rsdk12369-3"}
2025-11-13T15:59:29.176Z INFO rdk.modmanager.viam_rsdk12369-3 modmanager/manager.go:886 Restart context canceled, abandoning restart attempt {"err":"context canceled"}
2025-11-13T15:59:29.177Z INFO rdk.modmanager modmanager/manager.go:807 Removing module data parent directory {"parent dir":"/home/alden/.viam/module-data/67d2e9f5-29f8-4b57-963b-c94780a53825"}
2025-11-13T15:59:29.177Z INFO rdk impl/local_robot.go:1561 Robot (re)configured
2025-11-13T15:59:29.226Z INFO rdk impl/local_robot.go:1505 (Re)configuring robot
2025-11-13T15:59:29.226Z INFO rdk.modmanager.viam_rsdk12369-3 modmanager/manager.go:268 Now adding module {"module":"viam_rsdk12369-3"}
IMO context canceled is technically not inaccurate, because we've cancelled the checkReady context early since we noticed the process no longer exists, and therefore is not worth continuing to ping it.
Actually, this could probably even be at Debug level, because it more explains the checkReady implementation than anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean to log m.logger.Info("Module process exited unexpectedly while waiting for ready.") as an error level log, more that the returned error should be "module process exited unexpectedly while waiting for ready." IMO or "module process exited unexpectedly so that the full message would be error while waiting for module to be ready viam_rsdk12369-3: module process exited unexpectedly
8cc3f7d to
feb25f6
Compare
If a module starts up in TCP mode and crashes after 100ms (
module exited too quickly after attempted startup), the startup process will proceed and get stuck oncheckReady, which repeatedly tries making a grpcReadycall until it either succeeds or times out in ~5 minutes (it does not know that the process has already exited). OUE cannot restart the module at this stage, because it's waiting on the same ModManager lock thatcheckReadyis holding: Add, ReconfigureThis PR adds a background watcher to
checkReadythat will help it exit early if the process exits early.