Description
Description
While executing the MCAD e2e tests with the golang race detector turned on the following race conditions have been reported:
Sample 1
mcad-controller-c4f85dbb6-4246d mcad-controller WARNING: DATA RACE
mcad-controller-c4f85dbb6-4246d mcad-controller Write at 0x00c0005a82c0 by goroutine 131:
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/controller/v1beta1.(*AppWrapper).DeepCopyInto()
mcad-controller-c4f85dbb6-4246d mcad-controller /workdir/pkg/apis/controller/v1beta1/zz_generated.deepcopy.go:47 +0x4c
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).ScheduleNext.func1()
mcad-controller-c4f85dbb6-4246d mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:945 +0xe98
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/eapache/go-resiliency/retrier.(*Retrier).Run.func1()
mcad-controller-c4f85dbb6-4246d mcad-controller /opt/app-root/src/go/pkg/mod/github.com/eapache/[email protected]/retrier/retrier.go:41 +0x34
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/eapache/go-resiliency/retrier.(*Retrier).RunCtx()
mcad-controller-c4f85dbb6-4246d mcad-controller /opt/app-root/src/go/pkg/mod/github.com/eapache/[email protected]/retrier/retrier.go:53 +0x4c
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/eapache/go-resiliency/retrier.(*Retrier).Run()
mcad-controller-c4f85dbb6-4246d mcad-controller /opt/app-root/src/go/pkg/mod/github.com/eapache/[email protected]/retrier/retrier.go:39 +0x6c
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).ScheduleNext()
mcad-controller-c4f85dbb6-4246d mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:928 +0x2f0
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).ScheduleNext-fm()
mcad-controller-c4f85dbb6-4246d mcad-controller <autogenerated>:1 +0x38
mcad-controller-c4f85dbb6-4246d mcad-controller k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
mcad-controller-c4f85dbb6-4246d mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x4c
mcad-controller-c4f85dbb6-4246d mcad-controller k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
mcad-controller-c4f85dbb6-4246d mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x94
mcad-controller-c4f85dbb6-4246d mcad-controller k8s.io/apimachinery/pkg/util/wait.JitterUntil()
mcad-controller-c4f85dbb6-4246d mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x114
mcad-controller-c4f85dbb6-4246d mcad-controller k8s.io/apimachinery/pkg/util/wait.Until()
mcad-controller-c4f85dbb6-4246d mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x44
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).Run.func3()
mcad-controller-c4f85dbb6-4246d mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1419 +0x4c
mcad-controller-c4f85dbb6-4246d mcad-controller
mcad-controller-c4f85dbb6-4246d mcad-controller Previous read at 0x00c0005a82c0 by goroutine 95:
mcad-controller-c4f85dbb6-4246d mcad-controller runtime.convT()
mcad-controller-c4f85dbb6-4246d mcad-controller /usr/lib/golang/src/runtime/iface.go:321 +0x0
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).backoff()
mcad-controller-c4f85dbb6-4246d mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1401 +0xacc
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).PreemptQueueJobs.func3()
mcad-controller-c4f85dbb6-4246d mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:377 +0x68
mcad-controller-c4f85dbb6-4246d mcad-controller
mcad-controller-c4f85dbb6-4246d mcad-controller Goroutine 131 (running) created at:
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).Run()
mcad-controller-c4f85dbb6-4246d mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1419 +0x36c
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/cmd/kar-controllers/app.Run()
mcad-controller-c4f85dbb6-4246d mcad-controller /workdir/cmd/kar-controllers/app/server.go:67 +0xd0
mcad-controller-c4f85dbb6-4246d mcad-controller main.main()
mcad-controller-c4f85dbb6-4246d mcad-controller /workdir/cmd/kar-controllers/main.go:52 +0xf8
mcad-controller-c4f85dbb6-4246d mcad-controller
mcad-controller-c4f85dbb6-4246d mcad-controller Goroutine 95 (finished) created at:
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).PreemptQueueJobs()
mcad-controller-c4f85dbb6-4246d mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:377 +0x1d04
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).PreemptQueueJobs-fm()
mcad-controller-c4f85dbb6-4246d mcad-controller <autogenerated>:1 +0x38
mcad-controller-c4f85dbb6-4246d mcad-controller k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
mcad-controller-c4f85dbb6-4246d mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x4c
mcad-controller-c4f85dbb6-4246d mcad-controller k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
mcad-controller-c4f85dbb6-4246d mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x94
mcad-controller-c4f85dbb6-4246d mcad-controller k8s.io/apimachinery/pkg/util/wait.JitterUntil()
mcad-controller-c4f85dbb6-4246d mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x114
mcad-controller-c4f85dbb6-4246d mcad-controller k8s.io/apimachinery/pkg/util/wait.Until()
mcad-controller-c4f85dbb6-4246d mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x44
mcad-controller-c4f85dbb6-4246d mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).Run.func4()
mcad-controller-c4f85dbb6-4246d mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1424 +0x50
mcad-controller-c4f85dbb6-4246d mcad-controller ==================
Sample 2
mcad-controller-55cdd74d67-wd87p mcad-controller WARNING: DATA RACE
mcad-controller-55cdd74d67-wd87p mcad-controller Write at 0x00c0000776d8 by goroutine 10:
mcad-controller-55cdd74d67-wd87p mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).manageQueueJob()
mcad-controller-55cdd74d67-wd87p mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1836 +0x25c8
mcad-controller-55cdd74d67-wd87p mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).syncQueueJob()
mcad-controller-55cdd74d67-wd87p mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1795 +0x22b8
mcad-controller-55cdd74d67-wd87p mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).worker.func2()
mcad-controller-55cdd74d67-wd87p mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1694 +0x42c
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/client-go/tools/cache.(*FIFO).Pop()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/fifo.go:303 +0x2d8
mcad-controller-55cdd74d67-wd87p mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).worker()
mcad-controller-55cdd74d67-wd87p mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1674 +0x7c
mcad-controller-55cdd74d67-wd87p mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).worker-fm()
mcad-controller-55cdd74d67-wd87p mcad-controller <autogenerated>:1 +0x38
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x4c
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x94
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.JitterUntil()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x114
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.Until()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x44
mcad-controller-55cdd74d67-wd87p mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).Run.func9()
mcad-controller-55cdd74d67-wd87p mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1437 +0x48
mcad-controller-55cdd74d67-wd87p mcad-controller
mcad-controller-55cdd74d67-wd87p mcad-controller Previous read at 0x00c0000776d8 by goroutine 74:
mcad-controller-55cdd74d67-wd87p mcad-controller runtime.convT()
mcad-controller-55cdd74d67-wd87p mcad-controller /usr/lib/golang/src/runtime/iface.go:321 +0x0
mcad-controller-55cdd74d67-wd87p mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).enqueue()
mcad-controller-55cdd74d67-wd87p mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1595 +0x6b8
mcad-controller-55cdd74d67-wd87p mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).addQueueJob()
mcad-controller-55cdd74d67-wd87p mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1526 +0xc74
mcad-controller-55cdd74d67-wd87p mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).addQueueJob-fm()
mcad-controller-55cdd74d67-wd87p mcad-controller <autogenerated>:1 +0x48
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:231 +0x60
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnAdd()
mcad-controller-55cdd74d67-wd87p mcad-controller <autogenerated>:1 +0x24
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:264 +0x74
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/client-go/tools/cache.(*FilteringResourceEventHandler).OnAdd()
mcad-controller-55cdd74d67-wd87p mcad-controller <autogenerated>:1 +0x60
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/client-go/tools/cache.(*processorListener).run.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:777 +0x108
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x4c
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x94
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.JitterUntil()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x114
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.Until()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x70
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/client-go/tools/cache.(*processorListener).run()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:771 +0x1c
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/client-go/tools/cache.(*processorListener).run-fm()
mcad-controller-55cdd74d67-wd87p mcad-controller <autogenerated>:1 +0x38
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x70
mcad-controller-55cdd74d67-wd87p mcad-controller
mcad-controller-55cdd74d67-wd87p mcad-controller Goroutine 10 (running) created at:
mcad-controller-55cdd74d67-wd87p mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).Run()
mcad-controller-55cdd74d67-wd87p mcad-controller /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1437 +0x7b8
mcad-controller-55cdd74d67-wd87p mcad-controller github.com/project-codeflare/multi-cluster-app-dispatcher/cmd/kar-controllers/app.Run()
mcad-controller-55cdd74d67-wd87p mcad-controller /workdir/cmd/kar-controllers/app/server.go:67 +0xd0
mcad-controller-55cdd74d67-wd87p mcad-controller main.main()
mcad-controller-55cdd74d67-wd87p mcad-controller /workdir/cmd/kar-controllers/main.go:52 +0xf8
mcad-controller-55cdd74d67-wd87p mcad-controller
mcad-controller-55cdd74d67-wd87p mcad-controller Goroutine 74 (running) created at:
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.(*Group).Start()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:71 +0xd8
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/client-go/tools/cache.(*sharedProcessor).run.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:623 +0x154
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/client-go/tools/cache.(*sharedProcessor).run()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:627 +0x30
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/client-go/tools/cache.(*sharedProcessor).run-fm()
mcad-controller-55cdd74d67-wd87p mcad-controller <autogenerated>:1 +0x40
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.(*Group).StartWithChannel.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:56 +0x40
mcad-controller-55cdd74d67-wd87p mcad-controller k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x70
Impact
There's a design (coding) flaw in the current implementation of MCAD in which pointers to the arbv1.AppWrapper
structure is shared between multiple threads. These threads modify the contents of the structure (they generally reload the state of the app wrapper CRD from etcd before they apply their logic) concurrently without any sort of synchronisation mechanism.
This flaw can produce incorrect / inconsistent behaviours under heavy load (large number of appwrappers, memory pressure, etc.) which can cause for app wrappers not be dispatched or their status be reported incorrectly. It's very hard to reason about the correctness of the MCAD behaviour while this condition persist. The correlation between the flaky e2e tests and this race condition has not been fully established.
How to reproduce the output:
make images GO_BUILD_ARGS=-race
make make run-e2e
# while the end to end tests are running the log from MCAD can be captured using the stern tool. Adjust the log file path to suit your needs.
stern -n kube-system mcad-controller --color never | tee ~/work/mcad/logs/mcad-controller.user.log.1
Environment
- MacOS Apple Silicon
- Git branch /hash:
586eb13351efe3cfac68cdb007ac5ab4aec2be02 refs/heads/main
Metadata
Metadata
Assignees
Labels
Type
Projects
Status