Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠️ RSDK-7903 Include unhealthy remotes in resource list with status #4273

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
968b479
get resource names with resource statuses instead
maximpertsov Aug 6, 2024
282e057
TestStatusRemote: inject machine status
maximpertsov Aug 6, 2024
d837e20
filter remotes in client
maximpertsov Aug 7, 2024
07d9a72
update mocks to include machine status for resources
maximpertsov Aug 7, 2024
83cdc93
account for double log in test
maximpertsov Aug 7, 2024
cf85b6b
WIP: add machine status mocks to sessions remotes
maximpertsov Aug 7, 2024
f274e56
flyby: dedent session table tests
maximpertsov Aug 7, 2024
4754282
whitelist MachineStatus for sessions
maximpertsov Aug 8, 2024
b232589
Revert "flyby: dedent session table tests"
maximpertsov Aug 8, 2024
05a7329
remove commented code
maximpertsov Aug 8, 2024
eed4bed
remove commented code
maximpertsov Aug 8, 2024
c93b936
store statuses instead of names in client
maximpertsov Aug 8, 2024
a2d658d
lint
maximpertsov Aug 8, 2024
d5e9116
fixup client test
maximpertsov Aug 9, 2024
8cb9b03
debug
maximpertsov Aug 9, 2024
fd5da19
return fully-qualified remote names
maximpertsov Aug 9, 2024
58f555f
@dgottlieb: add offline remote test
maximpertsov Aug 9, 2024
6480d49
cache full machine status
maximpertsov Aug 21, 2024
673ea72
use machine status to decide if remotes need to be updated
maximpertsov Aug 21, 2024
2d18da7
remove log test
maximpertsov Aug 21, 2024
49cb0db
framesystem -> frame_system
maximpertsov Sep 16, 2024
278ad51
flyby: minor clean-up + remove conflicting mod/var names
maximpertsov Sep 17, 2024
3e190ef
mock machine status for dummy robot
maximpertsov Sep 17, 2024
c5cc1fd
lint
maximpertsov Sep 17, 2024
af977e5
update comment
maximpertsov Sep 17, 2024
5506d55
mark disconnected
maximpertsov Sep 23, 2024
426e873
REVERT: less logs
maximpertsov Sep 23, 2024
0c778b0
add Disconnected state
maximpertsov Sep 23, 2024
c64fe9d
CR@benjirewis+@cheukt: update remote lookup + tests + MachineStatus l…
maximpertsov Sep 23, 2024
b5c4898
lint
maximpertsov Sep 23, 2024
c32f367
exclude remote apis from machine status resource output
maximpertsov Sep 24, 2024
32d33fd
revert changes from bad rebase
maximpertsov Sep 24, 2024
40e0559
revert to using unhealthy state as disconnected
maximpertsov Sep 24, 2024
d79ebc4
remove log suppression
maximpertsov Sep 24, 2024
d2465ff
update machine status error handling in update remote names
maximpertsov Sep 24, 2024
e401dc0
remove TODO
maximpertsov Sep 24, 2024
c4c0b89
Revert "revert to using unhealthy state as disconnected"
maximpertsov Sep 24, 2024
864e1d2
nolint (temp)
maximpertsov Sep 24, 2024
c5cf943
generate new state
maximpertsov Sep 24, 2024
875332c
fixup one more test
maximpertsov Sep 24, 2024
022194a
remove RDK internal resources from test expectation
maximpertsov Sep 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions resource/graph_node.go
Copy link
Contributor Author

@maximpertsov maximpertsov Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative approach to the changes in this file is to define a new state (e.g. NodeStateDisconnected) - see the "Open questions" section in the PR description for more details and see the revert of this commit for a general idea of how this would look.

Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@ const (

// NodeStateUnhealthy denotes a resource is unhealthy.
NodeStateUnhealthy

// NodeStateDisconnected denotes a resource is disconnected.
NodeStateDisconnected
Comment on lines +39 to +41
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to defining a new state would be to mark disconnected resource nodes as "unhealthy" with an appropriate error message. See "Open questions" in the description of this PR for more details and revert this commit to see how this alternative approach would look.

)

// A GraphNode contains the current state of a resource.
Expand Down Expand Up @@ -361,6 +364,13 @@ func (w *GraphNode) SetNeedsUpdate() {
w.setNeedsReconfigure(w.Config(), false, w.UnresolvedDependencies())
}

// SetDisconnected is used to mark a remote node as disconnected.
func (w *GraphNode) SetDisconnected() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, in the resource graph we have:

  • A node for each remote robot. Where the underlying resource is the robot client
  • A node for each resource on a robot robot. Where the underlying resource is the component client (e.g: camera client).

Does this SetDisconnected state only apply to nodes representing the remote robot client in the resource graph? Or does it also apply to the remote resource objects in the resource graph?

Second:
And when a remote goes from working -> disconnected -> working, what state transition will that take? Is it ready -> disconnected -> ready? Or do we jump through any of the unconfigured/configuring states?

I think the answers would work well in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this SetDisconnected state only apply to nodes representing the remote robot client in the resource graph? Or does it also apply to the remote resource objects in the resource graph?

it applies to the remote resource objects. also note that resources representing full remote machine clients are excluded from the MachineStatus endpoint.

And when a remote goes from working -> disconnected -> working, what state transition will that take? Is it ready -> disconnected -> ready? Or do we jump through any of the unconfigured/configuring states?

thanks for drawing attention to this. the flow I want is for disconnected nodes to go directly back into the ready state (unless they are unhealthy for some other reason). however, I don't think the current logic actually transitions disconnected nodes back to ready, so let me address that.

I think the answers would work well in the documentation.

will do

w.mu.Lock()
defer w.mu.Unlock()
w.transitionTo(NodeStateDisconnected)
}

// setUnresolvedDependencies sets names that are yet to be resolved as
// dependencies for the node. Note that even an empty list will still
// set needsDependencyResolution to true. If no resolution is needed,
Expand Down Expand Up @@ -457,6 +467,7 @@ func (w *GraphNode) replace(other *GraphNode) error {
}

func (w *GraphNode) canTransitionTo(state NodeState) bool {
//nolint // TODO add NodeStateDisconnected if agreed to
switch w.state {
case NodeStateUnknown:
case NodeStateUnconfigured:
Expand Down
5 changes: 3 additions & 2 deletions resource/nodestate_string.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

76 changes: 49 additions & 27 deletions robot/client/client.go
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the high-level change here is that a robot client now caches the results of MachineStatus rather than just ResourceNames. As a result, multiple tests that result on injecting a mock function for ResourceNames now need to mock MachineStatus.

Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ type RobotClient struct {
dialOptions []rpc.DialOption

mu sync.RWMutex
resourceNames []resource.Name
cachedMachineStatus robot.MachineStatus
resourceRPCAPIs []resource.RPCAPI
resourceClients map[resource.Name]resource.Resource
remoteNameMap map[resource.Name]resource.Name
Expand Down Expand Up @@ -415,8 +415,8 @@ func (rc *RobotClient) connectWithLock(ctx context.Context) error {
func (rc *RobotClient) updateResourceClients(ctx context.Context) error {
activeResources := make(map[resource.Name]bool)

for _, name := range rc.resourceNames {
activeResources[name] = true
for _, rs := range rc.cachedMachineStatus.Resources {
activeResources[rs.Name] = true
}

for resourceName, client := range rc.resourceClients {
Expand Down Expand Up @@ -464,7 +464,7 @@ func (rc *RobotClient) checkConnection(ctx context.Context, checkEvery, reconnec
return err
}
} else {
if _, _, err := rc.resources(ctx); err != nil {
if _, _, err := rc.machineStatusAndRPCAPIs(ctx); err != nil {
return err
}
}
Expand Down Expand Up @@ -514,7 +514,8 @@ func (rc *RobotClient) checkConnection(ctx context.Context, checkEvery, reconnec
}
}

// Close closes the underlying client connections to the machine and stops any periodic tasks running in the client.
// Close closes the underlying client connections to the machine and stops any
// periodic tasks running in the client.
//
// err := machine.Close(ctx.Background())
func (rc *RobotClient) Close(ctx context.Context) error {
Expand Down Expand Up @@ -586,8 +587,8 @@ func (rc *RobotClient) ResourceByName(name resource.Name) (resource.Resource, er
}

// finally, before adding a new resource, make sure this name exists and is known
for _, knownName := range rc.resourceNames {
if name == knownName {
for _, rs := range rc.cachedMachineStatus.Resources {
if name == rs.Name {
resourceClient, err := rc.createClient(name)
if err != nil {
return nil, err
Expand All @@ -608,7 +609,7 @@ func (rc *RobotClient) createClient(name resource.Name) (resource.Resource, erro
return apiInfo.RPCClient(rc.backgroundCtx, &rc.conn, rc.remoteName, name, logger)
}

func (rc *RobotClient) resources(ctx context.Context) ([]resource.Name, []resource.RPCAPI, error) {
func (rc *RobotClient) machineStatusAndRPCAPIs(ctx context.Context) (robot.MachineStatus, []resource.RPCAPI, error) {
// RSDK-5356 If we are in a testing environment, never apply
// defaultResourcesTimeout. Tests run in parallel, and if execution of a test
// pauses for longer than 5s, below calls to ResourceNames or
Expand All @@ -620,22 +621,29 @@ func (rc *RobotClient) resources(ctx context.Context) ([]resource.Name, []resour
defer cancel()
}

resp, err := rc.client.ResourceNames(ctx, &pb.ResourceNamesRequest{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, are you changing robots to use machine status instead of resource names to populate remote resources?

resp, err := rc.machineStatus(ctx)
if err != nil {
return nil, nil, err
return robot.MachineStatus{}, nil, err
}

var resTypes []resource.RPCAPI

resources := make([]resource.Name, 0, len(resp.Resources))
for _, name := range resp.Resources {
newName := rprotoutils.ResourceNameFromProto(name)
resources = append(resources, newName)
mStatus := resp
resources := make([]resource.Status, 0, len(resp.Resources))
for _, rs := range resp.Resources {
if rs.Name.API == RemoteAPI {
continue
}
if rs.Name.API.Type.Namespace == resource.APINamespaceRDKInternal {
continue
}
resources = append(resources, rs)
}
mStatus.Resources = resources

// resource has previously returned an unimplemented response, skip rpc call
if rc.rpcSubtypesUnimplemented {
return resources, resTypes, nil
return mStatus, resTypes, nil
}

typesResp, err := rc.client.ResourceRPCSubtypes(ctx, &pb.ResourceRPCSubtypesRequest{})
Expand All @@ -656,7 +664,7 @@ func (rc *RobotClient) resources(ctx context.Context) ([]resource.Name, []resour
}
svcDesc, ok := symDesc.(*desc.ServiceDescriptor)
if !ok {
return nil, nil, fmt.Errorf("expected descriptor to be service descriptor but got %T", symDesc)
return robot.MachineStatus{}, nil, fmt.Errorf("expected descriptor to be service descriptor but got %T", symDesc)
}
resTypes = append(resTypes, resource.RPCAPI{
API: rprotoutils.ResourceNameFromProto(resAPI.Subtype).API,
Expand All @@ -665,13 +673,13 @@ func (rc *RobotClient) resources(ctx context.Context) ([]resource.Name, []resour
}
} else {
if s, ok := status.FromError(err); !(ok && (s.Code() == codes.Unimplemented)) {
return nil, nil, err
return robot.MachineStatus{}, nil, err
}
// prevent future calls to ResourceRPCSubtypes
rc.rpcSubtypesUnimplemented = true
}

return resources, resTypes, nil
return mStatus, resTypes, nil
}

// Refresh manually updates the underlying parts of this machine.
Expand All @@ -686,13 +694,12 @@ func (rc *RobotClient) Refresh(ctx context.Context) (err error) {
func (rc *RobotClient) updateResources(ctx context.Context) error {
// call metadata service.

names, rpcAPIs, err := rc.resources(ctx)
mStatus, rpcAPIs, err := rc.machineStatusAndRPCAPIs(ctx)
if err != nil && status.Code(err) != codes.Unimplemented {
return fmt.Errorf("error updating resources: %w", err)
}

rc.resourceNames = make([]resource.Name, 0, len(names))
rc.resourceNames = append(rc.resourceNames, names...)
rc.cachedMachineStatus = mStatus
rc.resourceRPCAPIs = rpcAPIs

rc.updateRemoteNameMap()
Expand All @@ -703,17 +710,17 @@ func (rc *RobotClient) updateResources(ctx context.Context) error {
func (rc *RobotClient) updateRemoteNameMap() {
tempMap := make(map[resource.Name]resource.Name)
dupMap := make(map[resource.Name]bool)
for _, n := range rc.resourceNames {
if err := n.Validate(); err != nil {
for _, rs := range rc.cachedMachineStatus.Resources {
if err := rs.Name.Validate(); err != nil {
rc.Logger().Error(err)
continue
}
tempName := resource.RemoveRemoteName(n)
tempName := resource.RemoveRemoteName(rs.Name)
// If the short name already exists in the map then there is a collision and we make the long name empty.
if _, ok := tempMap[tempName]; ok {
dupMap[tempName] = true
} else {
tempMap[tempName] = n
tempMap[tempName] = rs.Name
}
}
for key := range dupMap {
Expand Down Expand Up @@ -759,8 +766,10 @@ func (rc *RobotClient) ResourceNames() []resource.Name {
}
rc.mu.RLock()
defer rc.mu.RUnlock()
names := make([]resource.Name, 0, len(rc.resourceNames))
names = append(names, rc.resourceNames...)
names := make([]resource.Name, 0, len(rc.cachedMachineStatus.Resources))
for _, rs := range rc.cachedMachineStatus.Resources {
names = append(names, rs.Name)
}
return names
}

Expand Down Expand Up @@ -1058,8 +1067,21 @@ func (rc *RobotClient) Shutdown(ctx context.Context) error {
return nil
}

// ErrDisconnected that a robot is disconnected.
var ErrDisconnected = errors.New("disconnected")

// MachineStatus returns the current status of the robot.
func (rc *RobotClient) MachineStatus(ctx context.Context) (robot.MachineStatus, error) {
if rc.checkConnected() != nil {
return robot.MachineStatus{}, ErrDisconnected
}

rc.mu.Lock()
defer rc.mu.Unlock()
return rc.cachedMachineStatus, nil
}

func (rc *RobotClient) machineStatus(ctx context.Context) (robot.MachineStatus, error) {
mStatus := robot.MachineStatus{}

req := &pb.GetMachineStatusRequest{}
Expand Down
1 change: 1 addition & 0 deletions robot/client/client_session.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ var exemptFromSession = map[string]bool{
"/proto.rpc.webrtc.v1.SignalingService/OptionalWebRTCConfig": true,
"/proto.rpc.v1.AuthService/Authenticate": true,
"/proto.rpc.v1.ExternalAuthService/AuthenticateTo": true,
"/viam.robot.v1.RobotService/GetMachineStatus": true,
"/viam.robot.v1.RobotService/ResourceNames": true,
"/viam.robot.v1.RobotService/ResourceRPCSubtypes": true,
"/viam.robot.v1.RobotService/StartSession": true,
Expand Down
18 changes: 15 additions & 3 deletions robot/client/client_session_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"go.viam.com/rdk/robot/client"
"go.viam.com/rdk/robot/web"
"go.viam.com/rdk/session"
rdktestutils "go.viam.com/rdk/testutils"
"go.viam.com/rdk/testutils/inject"
"go.viam.com/rdk/testutils/robottestutils"
)
Expand Down Expand Up @@ -101,8 +102,12 @@ func TestClientSessionOptions(t *testing.T) {

sessMgr := &sessionManager{}
arbName := resource.NewName(echoAPI, "woo")
injectResources := []resource.Name{arbName}
injectRobot := &inject.Robot{
ResourceNamesFunc: func() []resource.Name { return []resource.Name{arbName} },
ResourceNamesFunc: func() []resource.Name { return injectResources },
MachineStatusFunc: func(ctx context.Context) (robot.MachineStatus, error) {
return rdktestutils.ResourcesToMachineStatus(injectResources), nil
},
ResourceByNameFunc: func(name resource.Name) (resource.Resource, error) {
return &dummyEcho{Named: arbName.AsNamed()}, nil
},
Expand Down Expand Up @@ -284,8 +289,12 @@ func TestClientSessionExpiration(t *testing.T) {
arbName := resource.NewName(echoAPI, "woo")

var dummyEcho1 dummyEcho
injectResources := []resource.Name{arbName}
injectRobot := &inject.Robot{
ResourceNamesFunc: func() []resource.Name { return []resource.Name{arbName} },
ResourceNamesFunc: func() []resource.Name { return injectResources },
MachineStatusFunc: func(ctx context.Context) (robot.MachineStatus, error) {
return rdktestutils.ResourcesToMachineStatus(injectResources), nil
},
ResourceByNameFunc: func(name resource.Name) (resource.Resource, error) {
return &dummyEcho1, nil
},
Expand Down Expand Up @@ -478,7 +487,10 @@ func TestClientSessionResume(t *testing.T) {

sessMgr := &sessionManager{}
injectRobot := &inject.Robot{
ResourceNamesFunc: func() []resource.Name { return []resource.Name{} },
ResourceNamesFunc: func() []resource.Name { return []resource.Name{} },
MachineStatusFunc: func(ctx context.Context) (robot.MachineStatus, error) {
return robot.MachineStatus{}, nil
},
ResourceRPCAPIsFunc: func() []resource.RPCAPI { return nil },
LoggerFunc: func() logging.Logger { return logger },
SessMgr: sessMgr,
Expand Down
Loading
Loading