RSDK-12570: Allow caller to lock Operation Manager #5436

aldenh-viam · 2025-11-03T16:01:05Z

The Operation Manager is currently often being used in this manner, both internally and externally:

Lines 55 to 68 in deee8a2

    
           func (o *Operation) CancelOtherWithLabel(label string) { 
        
           	all := o.myManager.All() 
        
           	for id, op := range all { 
        
           		if op == nil { 
        
           			// TODO(RSDK-12330): Remove this log once we've found how a nil operation can be 
        
           			// encountered here. 
        
           			o.myManager.logger.Errorw("nil operation encountered within CancelOtherWithLabel method", "id", id) 
        
           			continue 
        
           		} 
        
           		if op == o { 
        
           			continue 
        
           		} 
        
           		if op.HasLabel(label) { 
        
           			op.Cancel()

In the above snippet:

o.myManager.All(): acquires the Operation Manager's lock and returns a []*Operation of all operations known to the Operation Manager.
The caller code, without holding the lock, loops over this slice of pointers and calls op.Cancel() for each.

There is no guarantee that the pointers returned still point to valid memory.

Users have reported seeing panic: runtime error: invalid memory address ... from the Operation Manager when running highly concurrent code like:

	for {
		_, err = utils.RunInParallel(ctx, []utils.SimpleFunc{
			func(ctx context.Context) error {
				return servo1.Move(ctx, uint32(random), nil)
			},
			func(ctx context.Context) error {
				return servo2.Move(ctx, uint32(random), nil)
			},
			func(ctx context.Context) error {
				return servo3.Move(ctx, uint32(random), nil)
			},
		})
		time.Sleep(time.Microsecond)
	}

With this PR, I've added new public Lock(), Unlock(), and AllWithoutLock() methods to the Operation Manager. AllWithoutLock that functions similarly to All(), but leaves it to the caller to lock/unlock the Operation Manager if they want to process the returned operations.

This does appear to resolve the reported panic, but I realize this design of asking the caller to manage the locking may not be ideal, so let me know if you have any better suggestions.

aldenh-viam · 2025-11-03T17:09:21Z

op.Cancel is a context.CancelFunc and in the current code looks to just be from context.WithCancel. But if the user passes in a custom CancelFunc that locks op.myManager.lock, then it could deadlock.

rdk/operation/opid.go

Line 35 in deee8a2

cancel context.CancelFunc

jmatth · 2025-11-03T22:40:07Z

I'll leave it to the team whether this is a better solution but you could convert .All() to return an iter.Seq2 that locks during the iteration:

// All returns a [iter.Seq2] of OperationID/*Operation key/value pairs of all
// running operations known to the Manager. A lock is held on the manager for
// the duration of the iteration, so callers must not retain copies of the
// Operation pointers as there is no guarantee they will remain valid after
// iteration has ended.
func (m *Manager) All() iter.Seq2[string, *Operation] {
	return func (yield func(string, *Operation) bool)  {
		m.lock.Lock()
		defer m.lock.Unlock()
		for i, o := range m.ops {
			if !yield(i, o) {
				break
			}
		}
	}
}

Callers can then mostly use this normally with for id, op := range mgr.All() and everything should be fine as long as they don't hold on to the operation pointers outside the loop.

benjirewis · 2025-11-03T23:05:26Z

Thanks! I think the general solution is good, but I do like Josh's suggestion more since we guarantee that the lock holds through the iteration.

aldenh-viam · 2025-11-04T14:26:08Z

Thanks! Never used iter.Seq2 before, but will give it a try.

dgottlieb · 2025-11-04T15:32:17Z

I'll leave it to the team whether this is a better solution but you could convert .All() to return an iter.Seq2 that locks during the iteration:

That feels dangerous no? These mutexes are not reentrant. Without a strong understood convention, it feels easy to write a deadlock.

dgottlieb · 2025-11-04T15:36:19Z

operation/opid.go

+	// If providing a custom context.CancelFunc, do not lock Manager's lock, as it will
+	// conflict with other batch-cancel methods: CancelOtherWithLabel, Manager.CancelAll
+	cancel context.CancelFunc
+	labels []string


@benjirewis and I looked at this yesterday and I think what's actually happening is that multiple goroutines are accessing labels concurrently. Specifically:

HasLabel ranging over the labels and

The append mutating it.

What I think happened in the race is that a slice is made up of two member fields. An integer size field and a pointer to the underlying growable array of data. An append (specifically when going from the nil empty slice to a non-empty slice of size 1) increments the counter and allocates a new array. But a racing reader can observe the size being 0 and deciding to dereference the (still nil) pointer to the backing array.

Now, I don't understand what this whole label business is. But if we need labels, we should lock individual operations rather than the entire manager.

That also sounds right to me; here's toy program that can panic for, I think, the reason we're talking about:

package main import ( "sync" ) type Operation struct { labels []string } // HasLabel returns true if this operation has a specific label. func (o *Operation) HasLabel(label string) bool { for _, l := range o.labels { if l == label { return true } } return false } func main() { for { op := &Operation{} var wg sync.WaitGroup wg.Add(2) go func() { defer wg.Done() op.HasLabel("benny") }() go func() { defer wg.Done() op.labels = append(op.labels, "benny") }() wg.Wait() println("Finished run") } }

But if we need labels, we should lock individual operations rather than the entire manager.

I'm a little confused what you mean here, though, @dgottlieb . You mean using a (potentially separate) mutex around writes/reads of labels?

Now, I don't understand what this whole label business is. But if we need labels, we should lock individual operations rather than the entire manager.

From what I can tell a label an arbitrary name to refer to an operation that's lazily set by CancelOtherWithLabel (it cancels any others with the same label before assigning it to the calling Operation. Also see the tests from the original commit [link]). I agree that locking individual operations is ideal, but the current design is a little messy with single Operations being able to read and mutate the Manager that manages all Operations. I'm happy to spend more time looking into this if we feel it will be beneficial.

That feels dangerous no? These mutexes are not reentrant. Without a strong understood convention, it feels easy to write a deadlock.

FWIW I tried migrating all the use cases of the existing All() to the proposed iter.Seq2 version and don't see any deadlocks.

You mean using a (potentially separate) mutex around writes/reads of labels?

Discussed offline: yes, that is what is meant.

aldenh-viam · 2025-11-04T19:15:23Z

(Cont. from offline discussion): Adding a new lock around labels (instead of reusing the Operation Manager's lock) indeed appears to be enough to avoid the panic.

The issue described in the first comment still remains though. Shall we keep these in-progress changes, and I'll make a new PR to add the labels lock?

type Operation struct {
	...
	labelsLock sync.Mutex
	labels     []string
}

func (o *Operation) HasLabel(label string) bool {
	o.labelsLock.Lock()
	defer o.labelsLock.Unlock()
	for _, l := range o.labels {
	...
}

func (o *Operation) CancelOtherWithLabel(label string) {
	...
	o.labelsLock.Lock()
	defer o.labelsLock.Unlock()
	o.labels = append(o.labels, label)
}

jmatth · 2025-11-14T20:29:14Z

Can this be closed now that #5444 was merged?

aldenh-viam · 2025-11-14T20:37:21Z

Can this be closed now that #5444 was merged?

I'll change it to Draft for now. I made a new ticket to track possible correctness issues (possibly from future misuse) with the current OpMgr: https://viam.atlassian.net/browse/RSDK-12570; the merged PR is just a quickfix for the panic.

viambot added the safe to test This pull request is marked safe to test from a trusted zone label Nov 3, 2025

aldenh-viam force-pushed the push-zvlsvrnkzowm branch from 8c9ace3 to 38bc10b Compare November 3, 2025 16:24

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 3, 2025

aldenh-viam changed the title ~~WIP: Allow caller to lock Operation Manager~~ RSDK-12330: WIP: Allow caller to lock Operation Manager Nov 3, 2025

aldenh-viam requested a review from a team November 3, 2025 16:53

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 3, 2025

aldenh-viam force-pushed the push-zvlsvrnkzowm branch from a9286c3 to 016aea5 Compare November 3, 2025 17:31

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 3, 2025

aldenh-viam added 2 commits November 3, 2025 17:34

allow caller to lock Operation Manager

793d133

add note about locking when providing custom cancel function

d83fe03

aldenh-viam force-pushed the push-zvlsvrnkzowm branch from 016aea5 to d83fe03 Compare November 3, 2025 17:34

viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 3, 2025

aldenh-viam changed the title ~~RSDK-12330: WIP: Allow caller to lock Operation Manager~~ RSDK-12330: Allow caller to lock Operation Manager Nov 3, 2025

dgottlieb reviewed Nov 4, 2025

View reviewed changes

aldenh-viam mentioned this pull request Nov 5, 2025

RSDK-12330: Operation Manager: add labelsLock mutex #5444

Merged

aldenh-viam changed the title ~~RSDK-12330: Allow caller to lock Operation Manager~~ RSDK-12570: Allow caller to lock Operation Manager Nov 11, 2025

aldenh-viam marked this pull request as draft November 14, 2025 20:37

	func (o *Operation) CancelOtherWithLabel(label string) {
	all := o.myManager.All()
	for id, op := range all {
	if op == nil {
	// TODO(RSDK-12330): Remove this log once we've found how a nil operation can be
	// encountered here.
	o.myManager.logger.Errorw("nil operation encountered within CancelOtherWithLabel method", "id", id)
	continue
	}
	if op == o {
	continue
	}
	if op.HasLabel(label) {
	op.Cancel()

RSDK-12570: Allow caller to lock Operation Manager #5436

Are you sure you want to change the base?

RSDK-12570: Allow caller to lock Operation Manager #5436

Uh oh!

Conversation

aldenh-viam commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldenh-viam commented Nov 3, 2025

Uh oh!

jmatth commented Nov 3, 2025

Uh oh!

benjirewis commented Nov 3, 2025

Uh oh!

aldenh-viam commented Nov 4, 2025

Uh oh!

dgottlieb commented Nov 4, 2025

Uh oh!

dgottlieb Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

benjirewis Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

aldenh-viam Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

benjirewis Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

aldenh-viam commented Nov 4, 2025

Uh oh!

jmatth commented Nov 14, 2025

Uh oh!

aldenh-viam commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aldenh-viam commented Nov 3, 2025 •

edited

Loading