Skip to content

Conversation

@aldenh-viam
Copy link
Contributor

@aldenh-viam aldenh-viam commented Nov 3, 2025

The Operation Manager is currently often being used in this manner, both internally and externally:

rdk/operation/opid.go

Lines 55 to 68 in deee8a2

func (o *Operation) CancelOtherWithLabel(label string) {
all := o.myManager.All()
for id, op := range all {
if op == nil {
// TODO(RSDK-12330): Remove this log once we've found how a nil operation can be
// encountered here.
o.myManager.logger.Errorw("nil operation encountered within CancelOtherWithLabel method", "id", id)
continue
}
if op == o {
continue
}
if op.HasLabel(label) {
op.Cancel()

In the above snippet:

  • o.myManager.All(): acquires the Operation Manager's lock and returns a []*Operation of all operations known to the Operation Manager.
  • The caller code, without holding the lock, loops over this slice of pointers and calls op.Cancel() for each.

There is no guarantee that the pointers returned still point to valid memory.

Users have reported seeing panic: runtime error: invalid memory address ... from the Operation Manager when running highly concurrent code like:

	for {
		_, err = utils.RunInParallel(ctx, []utils.SimpleFunc{
			func(ctx context.Context) error {
				return servo1.Move(ctx, uint32(random), nil)
			},
			func(ctx context.Context) error {
				return servo2.Move(ctx, uint32(random), nil)
			},
			func(ctx context.Context) error {
				return servo3.Move(ctx, uint32(random), nil)
			},
		})
		time.Sleep(time.Microsecond)
	}

With this PR, I've added new public Lock(), Unlock(), and AllWithoutLock() methods to the Operation Manager. AllWithoutLock that functions similarly to All(), but leaves it to the caller to lock/unlock the Operation Manager if they want to process the returned operations.

This does appear to resolve the reported panic, but I realize this design of asking the caller to manage the locking may not be ideal, so let me know if you have any better suggestions.

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Nov 3, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 3, 2025
@aldenh-viam aldenh-viam changed the title WIP: Allow caller to lock Operation Manager RSDK-12330: WIP: Allow caller to lock Operation Manager Nov 3, 2025
@aldenh-viam aldenh-viam requested a review from a team November 3, 2025 16:53
@aldenh-viam
Copy link
Contributor Author

op.Cancel is a context.CancelFunc and in the current code looks to just be from context.WithCancel. But if the user passes in a custom CancelFunc that locks op.myManager.lock, then it could deadlock.

cancel context.CancelFunc

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 3, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 3, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Nov 3, 2025
@aldenh-viam aldenh-viam changed the title RSDK-12330: WIP: Allow caller to lock Operation Manager RSDK-12330: Allow caller to lock Operation Manager Nov 3, 2025
@jmatth
Copy link
Member

jmatth commented Nov 3, 2025

I'll leave it to the team whether this is a better solution but you could convert .All() to return an iter.Seq2 that locks during the iteration:

// All returns a [iter.Seq2] of OperationID/*Operation key/value pairs of all
// running operations known to the Manager. A lock is held on the manager for
// the duration of the iteration, so callers must not retain copies of the
// Operation pointers as there is no guarantee they will remain valid after
// iteration has ended.
func (m *Manager) All() iter.Seq2[string, *Operation] {
	return func (yield func(string, *Operation) bool)  {
		m.lock.Lock()
		defer m.lock.Unlock()
		for i, o := range m.ops {
			if !yield(i, o) {
				break
			}
		}
	}
}

Callers can then mostly use this normally with for id, op := range mgr.All() and everything should be fine as long as they don't hold on to the operation pointers outside the loop.

@benjirewis
Copy link
Member

Thanks! I think the general solution is good, but I do like Josh's suggestion more since we guarantee that the lock holds through the iteration.

@aldenh-viam
Copy link
Contributor Author

Thanks! Never used iter.Seq2 before, but will give it a try.

@dgottlieb
Copy link
Member

I'll leave it to the team whether this is a better solution but you could convert .All() to return an iter.Seq2 that locks during the iteration:

That feels dangerous no? These mutexes are not reentrant. Without a strong understood convention, it feels easy to write a deadlock.

// If providing a custom context.CancelFunc, do not lock Manager's lock, as it will
// conflict with other batch-cancel methods: CancelOtherWithLabel, Manager.CancelAll
cancel context.CancelFunc
labels []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjirewis and I looked at this yesterday and I think what's actually happening is that multiple goroutines are accessing labels concurrently. Specifically:

What I think happened in the race is that a slice is made up of two member fields. An integer size field and a pointer to the underlying growable array of data. An append (specifically when going from the nil empty slice to a non-empty slice of size 1) increments the counter and allocates a new array. But a racing reader can observe the size being 0 and deciding to dereference the (still nil) pointer to the backing array.

Now, I don't understand what this whole label business is. But if we need labels, we should lock individual operations rather than the entire manager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also sounds right to me; here's toy program that can panic for, I think, the reason we're talking about:

package main

import (
	"sync"
)

type Operation struct {
	labels []string
}

// HasLabel returns true if this operation has a specific label.
func (o *Operation) HasLabel(label string) bool {
	for _, l := range o.labels {
		if l == label {
			return true
		}
	}
	return false
}

func main() {
	for {
		op := &Operation{}

		var wg sync.WaitGroup
		wg.Add(2)
		go func() {
			defer wg.Done()

			op.HasLabel("benny")
		}()
		go func() {
			defer wg.Done()

			op.labels = append(op.labels, "benny")
		}()

		wg.Wait()

		println("Finished run")
	}
}

But if we need labels, we should lock individual operations rather than the entire manager.

I'm a little confused what you mean here, though, @dgottlieb . You mean using a (potentially separate) mutex around writes/reads of labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, I don't understand what this whole label business is. But if we need labels, we should lock individual operations rather than the entire manager.

From what I can tell a label an arbitrary name to refer to an operation that's lazily set by CancelOtherWithLabel (it cancels any others with the same label before assigning it to the calling Operation. Also see the tests from the original commit [link]). I agree that locking individual operations is ideal, but the current design is a little messy with single Operations being able to read and mutate the Manager that manages all Operations. I'm happy to spend more time looking into this if we feel it will be beneficial.

That feels dangerous no? These mutexes are not reentrant. Without a strong understood convention, it feels easy to write a deadlock.

FWIW I tried migrating all the use cases of the existing All() to the proposed iter.Seq2 version and don't see any deadlocks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean using a (potentially separate) mutex around writes/reads of labels?

Discussed offline: yes, that is what is meant.

@aldenh-viam
Copy link
Contributor Author

(Cont. from offline discussion): Adding a new lock around labels (instead of reusing the Operation Manager's lock) indeed appears to be enough to avoid the panic.

The issue described in the first comment still remains though. Shall we keep these in-progress changes, and I'll make a new PR to add the labels lock?

type Operation struct {
	...
	labelsLock sync.Mutex
	labels     []string
}

func (o *Operation) HasLabel(label string) bool {
	o.labelsLock.Lock()
	defer o.labelsLock.Unlock()
	for _, l := range o.labels {
	...
}

func (o *Operation) CancelOtherWithLabel(label string) {
	...
	o.labelsLock.Lock()
	defer o.labelsLock.Unlock()
	o.labels = append(o.labels, label)
}

@aldenh-viam aldenh-viam changed the title RSDK-12330: Allow caller to lock Operation Manager RSDK-12570: Allow caller to lock Operation Manager Nov 11, 2025
@jmatth
Copy link
Member

jmatth commented Nov 14, 2025

Can this be closed now that #5444 was merged?

@aldenh-viam
Copy link
Contributor Author

Can this be closed now that #5444 was merged?

I'll change it to Draft for now. I made a new ticket to track possible correctness issues (possibly from future misuse) with the current OpMgr: https://viam.atlassian.net/browse/RSDK-12570; the merged PR is just a quickfix for the panic.

@aldenh-viam aldenh-viam marked this pull request as draft November 14, 2025 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test This pull request is marked safe to test from a trusted zone

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants