Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix RestartModule #4114

Merged
merged 11 commits into from
Jun 21, 2024
Merged

fix RestartModule #4114

merged 11 commits into from
Jun 21, 2024

Conversation

abe-winter
Copy link
Member

@abe-winter abe-winter commented Jun 20, 2024

What changed

  • added localRobot.localModuleVersions map where local module versions can be incremented per conversation with @dgottlieb (see discussion section below)
  • added Module.LocalVersion field (used only in-memory, not in API)
  • simplified restartSingleModule now that local tarballs support the normal restart flow
  • incidental: CLI error detection fix

Why

Restart broke at some point (probably when I added local tarball support).

Discussion

This change allows local tarball modules to use the normal module restart flow. Local tarball reloading matters because remote reloading will use local tarballs under the surface.

The normal module restart flow is:

  • unpack new version into new folder (packageManager.Sync at top of reconfigure)
  • stop old
  • start new
  • delete old version (packageManager.Cleanup at bottom of reconfigure)

That previously broke with local tarballs, because their version was always "" (empty string). It would unpack on top of the old unpack, which will break in edge cases where a file is removed from the tarball.

(This PR fixes that).

Also note: I added a test here, but it's not a very good test. Ideally we'd want to 1) test a local tarball so we could observe unpacking happening correctly and 2) test that the PID of the module rolls over. This state is hard to get so I cut some corners; if people have ideas on how to get this stuff in localRobot, would be interested to expand the new test.

Checklist

  • add a test for this (because it has already silently broken once)
  • check that I'm right about module restart flow; when is old folder deleted? before process is stopped?

Logs from manual test

With a local tarball module, it incremented the folder version and restarted the process as I restarted from CLI:

2024-06-21T04:44:24.999Z WARN robot_server.modmanager modmanager/manager.go:1066 VIAM_MODULE_ROOT was not passed to module. Defaulting to module's working directory {"module":"local-module-1","dir":"/home/USER/.viam/packages-local/data/module/synthetic-local-module-1-0_0_0/dist"}
2024-06-21T04:44:24.999Z INFO robot_server.modmanager modmanager/manager.go:1098 Starting up module {"module":"local-module-1"}

2024-06-21T04:45:01.973Z WARN robot_server.modmanager modmanager/manager.go:1066 VIAM_MODULE_ROOT was not passed to module. Defaulting to module's working directory {"module":"local-module-1","dir":"/home/USER/.viam/packages-local/data/module/synthetic-local-module-1-0_0_1/dist"}
2024-06-21T04:45:01.973Z INFO robot_server.modmanager modmanager/manager.go:1098 Starting up module {"module":"local-module-1"}

2024-06-21T04:45:14.719Z WARN robot_server.modmanager modmanager/manager.go:1066 VIAM_MODULE_ROOT was not passed to module. Defaulting to module's working directory {"module":"local-module-1","dir":"/home/USER/.viam/packages-local/data/module/synthetic-local-module-1-0_0_2/dist"}
2024-06-21T04:45:14.720Z INFO robot_server.modmanager modmanager/manager.go:1098 Starting up module {"module":"local-module-1"}

Follow ups

  • this map grows infinitely; ideally it would hook into module deletion and clean up map entries as local modules are removed

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Jun 20, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jun 21, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jun 21, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jun 21, 2024
@abe-winter abe-winter marked this pull request as ready for review June 21, 2024 15:55
// that state. 'no error' + 'version incremented' is a cheap proxy for that.
err := r.(*localRobot).restartSingleModule(ctx, mod)
test.That(t, err, test.ShouldBeNil)
test.That(t, mod.LocalVersion, test.ShouldResemble, "0.0.1")
Copy link
Member

@dgottlieb dgottlieb Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I understand the fuller lifecycle of a real viam-server here. If this test were to continue from here with:

  • Create a new config object from json: cfg, err := config.Read(ctx, cfgPath, logger)
  • Use that for a reconfigure: r.Reconfigure(cfg)

I expect that cfg.Modules[0].LocalVersion value to be the default value/empty string. And the robot would consider this a diff and restart the module? Presumably unpackaging it back into the "0.0.0" directory (or whatever gets used for the default scenario).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the state is stored in the localRobot, not in the config

so in your example, if r is reused, the fresh config.Read won't overwrite the version state; it will be restored by the r.applyLocalModuleVersions(newConfig) call

in reconfigure at L1168 here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense -- thanks for pointing me to the important missing link here!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw it would be very nice if we could store it in config; would make it way easier to understand the state for reloading + local tarball

I often start these PRs by mutating the Config struct and then am confused for an hour bc it is getting overwritten or I am mutating a copy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'd like to have a better story here as well.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jun 21, 2024
@abe-winter abe-winter merged commit bbc9158 into viamrobotics:main Jun 21, 2024
17 checks passed
@abe-winter abe-winter deleted the fix-restart branch June 21, 2024 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants