Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in sotw v3 server #503

Closed
lobkovilya opened this issue Oct 20, 2021 · 10 comments
Closed

Deadlock in sotw v3 server #503

lobkovilya opened this issue Oct 20, 2021 · 10 comments
Labels

Comments

@lobkovilya
Copy link

The problem occurs in our testing after the update to the latest go-control-plane version, but I believe it might happen in real-world use cases.

Stack trace for goroutines looks like this:

1 @ 0x103ca25 0x100759a 0x10072f5 0x1bb58af 0x1bb4507 0x1bbcdb9 0x1ea2695 0x1bb3ec6 0x1074121
#	0x1bb58ae	github.com/kumahq/kuma/pkg/util/xds/v3.(*snapshotCache).respond+0x12e		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/xds/v3/cache.go:311
#	0x1bb4506	github.com/kumahq/kuma/pkg/util/xds/v3.(*snapshotCache).SetSnapshot+0x5a6	/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/xds/v3/cache.go:168
#	0x1bbcdb8	github.com/kumahq/kuma/pkg/kds/reconcile.(*reconciler).Reconcile+0x1f8		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/kds/reconcile/reconciler.go:46
#	0x1ea2694	github.com/kumahq/kuma/pkg/kds/server.newSyncTracker.func1.2+0x194		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/kds/server/components.go:93
#	0x1bb3ec5	github.com/kumahq/kuma/pkg/util/watchdog.(*SimpleWatchdog).Start+0xe5		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/watchdog/watchdog.go:25

...

1 @ 0x103ca25 0x104e6c5 0x104e6ae 0x106fd67 0x107f225 0x1080990 0x1080922 0x1bb8a06 0x1bab955 0x1bad89a 0x1e9fecb 0x284e9b3 0x1074121
#	0x106fd66	sync.runtime_SemacquireMutex+0x46							/usr/local/Cellar/go/1.16.5/libexec/src/runtime/sema.go:71
#	0x107f224	sync.(*Mutex).lockSlow+0x104								/usr/local/Cellar/go/1.16.5/libexec/src/sync/mutex.go:138
#	0x108098f	sync.(*Mutex).Lock+0x8f									/usr/local/Cellar/go/1.16.5/libexec/src/sync/mutex.go:81
#	0x1080921	sync.(*RWMutex).Lock+0x21								/usr/local/Cellar/go/1.16.5/libexec/src/sync/rwmutex.go:111
#	0x1bb8a05	github.com/kumahq/kuma/pkg/util/xds/v3.(*snapshotCache).cancelWatch.func1+0x65		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/util/xds/v3/cache.go:283
#	0x1bab954	github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3.(*server).process+0x7b4	/Users/lobkovilya/go/src/github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3/server.go:418
#	0x1bad899	github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3.(*server).StreamHandler+0xb9	/Users/lobkovilya/go/src/github.com/envoyproxy/go-control-plane/pkg/server/sotw/v3/server.go:449
#	0x1e9feca	github.com/kumahq/kuma/pkg/kds/server.(*server).StreamKumaResources+0x8a		/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/kds/server/kds.go:30
#	0x284e9b2	github.com/kumahq/kuma/pkg/test/kds/setup.StartServer.func1+0x72			/Users/lobkovilya/go/src/github.com/Kong/kuma/pkg/test/kds/setup/server.go:60

While the first goroutine tries to call SetSnapshot and update all watchers link, the server goroutine receives DiscoveryRequest and tries to call cancel. Both SetSnapshot and cancel call cache.mu.Lock().

The first goroutine in SetSnapshot can't update watchers, because the values.responses channel is full it has capacity 5 so it blocks while the server goroutine won't read something from this channel. But the server goroutine can't read something from values.responses because it's in cancel and waits while cache.mu lock will be unlocked.

cc: @jpeach

@lobkovilya
Copy link
Author

Probably worth adding, that we have 19 different typeURL and that's why values.responses channel with capacity 5 can't guarantee SetSnapshot won't block.

@jpeach
Copy link
Contributor

jpeach commented Oct 21, 2021

Not sure I'm seeing all the problems here yet, but cancelWatch holds too many locks. It only needs a read lock on the cache, then a write lock to modify the statusInfo to remove the watch. This doesn't fix the deadlock BTW.

@jpeach
Copy link
Contributor

jpeach commented Oct 21, 2021

This doesn't fix the deadlock BTW.

One possible fix is to drop the cache lock when sending responses. While holding the lock in SetSnapshot, generate all the watch responses into a slice, then drop the lock and send them all. That's simpler than trying to fix the channel management in the server IMHO (though that would be a more general fix).

Looking at SetSnapshot, it *mostly doesn't need to hold the write mutex (except where it does). It only needs the write mutex when it adds the snapshot to the cache, and when it calls ContstructVersionMap. The first case is easy - we only need to hold the lock while we set the map entry. The second case we can refactor away, but that's a larger change.

Unfortunately, to fully fix the deadlock with this approach all calls to cache.respond and cache.respondDelta need to be changed, which is a number of places.

jpeach added a commit to jpeach/go-control-plane that referenced this issue Oct 21, 2021
Since canceling a watch only modified the corresponding statusInfo, we
don't need to hold a write lock on the snapshot cache. Holding a read
lock is sufficient to ensure that the status entry is stable.

This updates envoyproxy#503.

Signed-off-by: James Peach <[email protected]>
alecholmez pushed a commit that referenced this issue Oct 25, 2021
Holding a read lock is sufficient to ensure that the status entry is stable when cancelling watches

This updates #503.

Signed-off-by: James Peach <[email protected]>
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 20, 2021
@jpeach
Copy link
Contributor

jpeach commented Nov 22, 2021

not stale

@github-actions github-actions bot removed the stale label Nov 22, 2021
@rueian
Copy link

rueian commented Dec 20, 2021

Hi @jpeach, I just came into the same problem.
I think it would be better if the sotw server handles the <-reqCh case (https://github.com/envoyproxy/go-control-plane/blob/main/pkg/server/sotw/v3/server.go#L321) in another goroutine. Is it possible?

I also wonder that overriding each chan cache.Response once <-reqCh received is not necessary if the <-reqCh case handled in another goroutine

@jpeach
Copy link
Contributor

jpeach commented Dec 20, 2021

@rueian Pretty sure that #451 will fix the deadlock.

@rueian
Copy link

rueian commented Dec 21, 2021

Hi @jpeach, Thank you. The PR 451 looks great.

However I found there is another issue that might causing deadlock. I opened a new issue here #530

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 20, 2022
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants