Skip to content

Conversation

@aldenh-viam
Copy link
Contributor

If a module starts up in TCP mode and crashes after 100ms (module exited too quickly after attempted startup), the startup process will proceed and get stuck on checkReady, which repeatedly tries making a grpc Ready call until it either succeeds or times out in ~5 minutes (it does not know that the process has already exited). OUE cannot restart the module at this stage, because it's waiting on the same ModManager lock that checkReady is holding: Add, Reconfigure

This PR adds a background watcher to checkReady that will help it exit early if the process exits early.

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Nov 11, 2025
@aldenh-viam aldenh-viam added appimage-ignore-tests Build AppImage of PR and ignore tests static-ignore-tests Build static binaries from PR and ignore tests labels Nov 11, 2025
for {
// 5000 is an arbitrarily high number of attempts (context timeout should hit long before)
// If module is not yet ready, grpc (not the loop) internally retries the call with backoff of 1s * 1.6 multiplier up to 120s
resp, err := m.client.Ready(ctxTimeout, req, grpc_retry.WithMax(5000))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically all I'm trying to do is check errors.Is(m.process.Status(), os.ErrProcessDone) before each retry, and if true, cancel ctxTimeout. A background goroutine for this feels slightly excessive, but also seems to be the simplest option compared to:

  1. manually retrying here in a nested loop (instead of using grpc_retry.WithMax).
  2. having ManagedProcess cancel the context being used here if the process exits abruptly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, I like 1 because then we wouldn't need an extra goroutine, but the responsiveness allowed by the current setup is better I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can change it to manually retry to get rid of the extra goroutine. I guess recreating grpc's internal backoff here is unnecessary, and I can just have it retry say every 2s?

It will be something like:

	for {
		for {
			if errors.Is(m.process.Status(), os.ErrProcessDone) {
				cancelFunc()
			}
			resp, err := m.client.Ready(ctxTimeout, req)
			// if err == not yet available { sleep 2s }
		}
	...

@aldenh-viam aldenh-viam requested a review from a team November 11, 2025 17:09
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 11, 2025
for {
// 5000 is an arbitrarily high number of attempts (context timeout should hit long before)
// If module is not yet ready, grpc (not the loop) internally retries the call with backoff of 1s * 1.6 multiplier up to 120s
resp, err := m.client.Ready(ctxTimeout, req, grpc_retry.WithMax(5000))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if the Ready request retries at all? like if the tcp server just took a bit longer than expected to come online, would this code work or would we still stall out here?

Copy link
Contributor Author

@aldenh-viam aldenh-viam Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The m.client.Ready with the grpc_retry.WithMax(5000) CallOption means it will internally Retry 5000 times or until the context is cancelled. The default number of retry attempts is 5 if you don't specify that CallOption. minor correction: this is go-grpc-middleware's retry that's implemented via interceptors with a linear 50ms with +-10% backoff. By default it's deactivated (doesn't retry). gRPC-go internally retries 5x if dialing fails.

If the TCP server is fine, but just takes a moment to start up (e.g. if the module has a long setup process), there is no change to the existing behavior with this PR.

for {
// 5000 is an arbitrarily high number of attempts (context timeout should hit long before)
// If module is not yet ready, grpc (not the loop) internally retries the call with backoff of 1s * 1.6 multiplier up to 120s
resp, err := m.client.Ready(ctxTimeout, req, grpc_retry.WithMax(5000))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, I like 1 because then we wouldn't need an extra goroutine, but the responsiveness allowed by the current setup is better I think

// If running in TCP mode, use a background watcher to short circuit this check if the process has already exited
// (OUE is waiting on the same modmanager lock, so it can't try to restart).
// This is not needed in UNIX Socket mode, because we only get here once .sock is present (indicating that the module has started up).
if m.isRunningInTCPMode() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this solve the issue reported? I thought part of the issue was that tcp mode wasn't getting picked up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The report was "tcp_mode taking a long time to be respected", not that it it never gets picked up. At least for my repro on the Windows machine, toggling TCP mode on/off with this change now gets picked up asap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chatted offline, if statement was removed and design/change makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If statement in question is this: https://github.com/viamrobotics/rdk/compare/c1eb6b6d50e77ebe8a60da2339c7169844a7a52d..58cd3261b1a215efdc5b0407a36eb4409b744293

Initially missed because modlib.CheckSocketOwner in UNIX mode on Linux causes the it to wait until the .sock is created so we're less likely to get stuck on checkReady, but on Windows it always returns err==nil causing it to break early.

err = modlib.CheckSocketOwner(m.addr)
if errors.Is(err, fs.ErrNotExist) {
continue
}
if err != nil {
return errors.WithMessage(err, "module startup failed")
}
}
break

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 11, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 11, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 11, 2025
@aldenh-viam aldenh-viam removed the appimage-ignore-tests Build AppImage of PR and ignore tests label Nov 11, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 12, 2025
@aldenh-viam
Copy link
Contributor Author

aldenh-viam commented Nov 12, 2025

Trying to figure out why tests are flaking with the latest changes. They pass on my machine & changes look good on the Windows dev machines.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 12, 2025
@cheukt
Copy link
Member

cheukt commented Nov 12, 2025

generally looks good, can you add a test?

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 12, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 12, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 12, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 13, 2025
Copy link
Member

@cheukt cheukt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic looks good, just have a question around how it's getting surfaced

test.That(t, err.Error(), test.ShouldContainSubstring, "error while waiting for module to be ready test: context canceled")
}
// should not wait for the full DefaultModuleStartupTimeout / ModuleStartupTimeoutEnvVar
test.That(t, time.Since(startTime), test.ShouldBeLessThan, time.Second*10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think we need this assertion (the assertion on the error should be enough). mostly concerned with the possibility of a flaky test even when the time given is generous (definitely have run into that before)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good way to test that the error is returned shortly after the process unexpectedly exits rather than after 5 mins (DefaultModuleStartupTimeout)? The module used for this test crashes after 2s, so I'd expect it to return with error <1s after that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so - the issue is that the test process/scheduler could sometimes randomly pause execution of the test, which will cause flakes. The best assertion is that the error returned is different from one that would return after 5 mins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right; if it fails due to DefaultModuleStartupTimeout the error returned is error while waiting for module to be ready test: context deadline exceeded, so this assert is not needed.

if !rutils.TCPRegex.MatchString(m.addr) {
if !m.isRunningInTCPMode() {
// note: we don't do this check in TCP mode because TCP addresses are not file paths and will fail check.
// note: CheckSocketOwner on Windows only returns err, if any, from os.Stat.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this extra note!

if errors.Is(m.process.Status(), os.ErrProcessDone) {
m.logger.Info("Module process exited unexpectedly while waiting for ready.")
parentCtxCancelFunc()
return parentCtxTimeout.Err()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we return the info log as an error instead of a context canceled error? seems like a bit of an obfuscation

Copy link
Contributor Author

@aldenh-viam aldenh-viam Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkReady's two callers (startModule & attemptRestart) prefix the error with error while waiting for module to be ready *name* *err* and the fact that it's due to the module process unexpectedly exiting is also already logged. Is it worth adding another ERROR log to repeat the same info?

2025-11-13T15:59:19.020Z	INFO	rdk	impl/local_robot.go:1505	(Re)configuring robot
2025-11-13T15:59:19.020Z	INFO	rdk.modmanager.viam_rsdk12369-3	modmanager/manager.go:268	Now adding module	{"module":"viam_rsdk12369-3"}
. . .
2025-11-13T15:59:29.023Z	WARN	rdk.modmanager.viam_rsdk12369-3	utils/ticker.go:22	Waiting for module to complete startup and registration	{"module":"viam_rsdk12369-3","time_elapsed":"10s"}
(OUE) 2025-11-13T15:59:29.025Z	ERROR	rdk.modmanager.viam_rsdk12369-3	modmanager/manager.go:845	Module has unexpectedly exited.	{"module":"viam_rsdk12369-3","exit_code":1}
(checkReady) 2025-11-13T15:59:29.175Z	INFO	rdk.modmanager.viam_rsdk12369-3	modmanager/module.go:150	Module process exited unexpectedly while waiting for ready.
2025-11-13T15:59:29.176Z	INFO	rdk.modmanager.viam_rsdk12369-3	modmanager/module.go:332	Stopping module: viam_rsdk12369-3 process
2025-11-13T15:59:29.176Z	INFO	rdk.modmanager.viam_rsdk12369-3.viam_rsdk12369-3	pexec/managed_process_unix.go:100	stopping process 336248 with signal terminated
2025-11-13T15:59:29.176Z	WARN	rdk.modmanager.viam_rsdk12369-3.viam_rsdk12369-3	pexec/managed_process.go:539	Stopped module did not exit cleanly	{"err":"exit status 1"}
(checkReady's caller) 2025-11-13T15:59:29.176Z	ERROR	rdk.modmanager.viam_rsdk12369-3	modmanager/manager.go:271	Error adding module	{"module":"viam_rsdk12369-3","error":"error while waiting for module to be ready viam_rsdk12369-3: context canceled","errorVerbose":"context canceled\nerror while waiting for module to be ready viam_rsdk12369-3"}
2025-11-13T15:59:29.176Z	ERROR	rdk.resource_manager	impl/resource_manager.go:1246	error adding modules	{"error":"error while waiting for module to be ready viam_rsdk12369-3: context canceled","errorVerbose":"context canceled\nerror while waiting for module to be ready viam_rsdk12369-3"}
2025-11-13T15:59:29.176Z	INFO	rdk.modmanager.viam_rsdk12369-3	modmanager/manager.go:886	Restart context canceled, abandoning restart attempt	{"err":"context canceled"}
2025-11-13T15:59:29.177Z	INFO	rdk.modmanager	modmanager/manager.go:807	Removing module data parent directory	{"parent dir":"/home/alden/.viam/module-data/67d2e9f5-29f8-4b57-963b-c94780a53825"}
2025-11-13T15:59:29.177Z	INFO	rdk	impl/local_robot.go:1561	Robot (re)configured
2025-11-13T15:59:29.226Z	INFO	rdk	impl/local_robot.go:1505	(Re)configuring robot
2025-11-13T15:59:29.226Z	INFO	rdk.modmanager.viam_rsdk12369-3	modmanager/manager.go:268	Now adding module	{"module":"viam_rsdk12369-3"}

IMO context canceled is technically not inaccurate, because we've cancelled the checkReady context early since we noticed the process no longer exists, and therefore is not worth continuing to ping it.

Actually, this could probably even be at Debug level, because it more explains the checkReady implementation than anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean to log m.logger.Info("Module process exited unexpectedly while waiting for ready.") as an error level log, more that the returned error should be "module process exited unexpectedly while waiting for ready." IMO or "module process exited unexpectedly so that the full message would be error while waiting for module to be ready viam_rsdk12369-3: module process exited unexpectedly

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 13, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 13, 2025
@aldenh-viam aldenh-viam requested a review from cheukt November 13, 2025 18:40
@aldenh-viam aldenh-viam merged commit 88dd087 into viamrobotics:main Nov 13, 2025
24 checks passed
@aldenh-viam aldenh-viam deleted the push-tkqqywwqllkv branch November 13, 2025 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test This pull request is marked safe to test from a trusted zone static-ignore-tests Build static binaries from PR and ignore tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants