Skip to content

[core] Race conditions in main logic of MCAD. #513

Open
@z103cb

Description

@z103cb

Description

While executing the MCAD e2e tests with the golang race detector turned on the following race conditions have been reported:

Sample 1

mcad-controller-c4f85dbb6-4246d mcad-controller WARNING: DATA RACE
mcad-controller-c4f85dbb6-4246d mcad-controller Write at 0x00c0005a82c0 by goroutine 131:
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/apis/controller/v1beta1.(*AppWrapper).DeepCopyInto()
mcad-controller-c4f85dbb6-4246d mcad-controller       /workdir/pkg/apis/controller/v1beta1/zz_generated.deepcopy.go:47 +0x4c
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).ScheduleNext.func1()
mcad-controller-c4f85dbb6-4246d mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:945 +0xe98
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/eapache/go-resiliency/retrier.(*Retrier).Run.func1()
mcad-controller-c4f85dbb6-4246d mcad-controller       /opt/app-root/src/go/pkg/mod/github.com/eapache/[email protected]/retrier/retrier.go:41 +0x34
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/eapache/go-resiliency/retrier.(*Retrier).RunCtx()
mcad-controller-c4f85dbb6-4246d mcad-controller       /opt/app-root/src/go/pkg/mod/github.com/eapache/[email protected]/retrier/retrier.go:53 +0x4c
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/eapache/go-resiliency/retrier.(*Retrier).Run()
mcad-controller-c4f85dbb6-4246d mcad-controller       /opt/app-root/src/go/pkg/mod/github.com/eapache/[email protected]/retrier/retrier.go:39 +0x6c
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).ScheduleNext()
mcad-controller-c4f85dbb6-4246d mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:928 +0x2f0
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).ScheduleNext-fm()
mcad-controller-c4f85dbb6-4246d mcad-controller       <autogenerated>:1 +0x38
mcad-controller-c4f85dbb6-4246d mcad-controller   k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
mcad-controller-c4f85dbb6-4246d mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x4c
mcad-controller-c4f85dbb6-4246d mcad-controller   k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
mcad-controller-c4f85dbb6-4246d mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x94
mcad-controller-c4f85dbb6-4246d mcad-controller   k8s.io/apimachinery/pkg/util/wait.JitterUntil()
mcad-controller-c4f85dbb6-4246d mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x114
mcad-controller-c4f85dbb6-4246d mcad-controller   k8s.io/apimachinery/pkg/util/wait.Until()
mcad-controller-c4f85dbb6-4246d mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x44
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).Run.func3()
mcad-controller-c4f85dbb6-4246d mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1419 +0x4c
mcad-controller-c4f85dbb6-4246d mcad-controller
mcad-controller-c4f85dbb6-4246d mcad-controller Previous read at 0x00c0005a82c0 by goroutine 95:
mcad-controller-c4f85dbb6-4246d mcad-controller   runtime.convT()
mcad-controller-c4f85dbb6-4246d mcad-controller       /usr/lib/golang/src/runtime/iface.go:321 +0x0
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).backoff()
mcad-controller-c4f85dbb6-4246d mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1401 +0xacc
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).PreemptQueueJobs.func3()
mcad-controller-c4f85dbb6-4246d mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:377 +0x68
mcad-controller-c4f85dbb6-4246d mcad-controller
mcad-controller-c4f85dbb6-4246d mcad-controller Goroutine 131 (running) created at:
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).Run()
mcad-controller-c4f85dbb6-4246d mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1419 +0x36c
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/cmd/kar-controllers/app.Run()
mcad-controller-c4f85dbb6-4246d mcad-controller       /workdir/cmd/kar-controllers/app/server.go:67 +0xd0
mcad-controller-c4f85dbb6-4246d mcad-controller   main.main()
mcad-controller-c4f85dbb6-4246d mcad-controller       /workdir/cmd/kar-controllers/main.go:52 +0xf8
mcad-controller-c4f85dbb6-4246d mcad-controller
mcad-controller-c4f85dbb6-4246d mcad-controller Goroutine 95 (finished) created at:
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).PreemptQueueJobs()
mcad-controller-c4f85dbb6-4246d mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:377 +0x1d04
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).PreemptQueueJobs-fm()
mcad-controller-c4f85dbb6-4246d mcad-controller       <autogenerated>:1 +0x38
mcad-controller-c4f85dbb6-4246d mcad-controller   k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
mcad-controller-c4f85dbb6-4246d mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x4c
mcad-controller-c4f85dbb6-4246d mcad-controller   k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
mcad-controller-c4f85dbb6-4246d mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x94
mcad-controller-c4f85dbb6-4246d mcad-controller   k8s.io/apimachinery/pkg/util/wait.JitterUntil()
mcad-controller-c4f85dbb6-4246d mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x114
mcad-controller-c4f85dbb6-4246d mcad-controller   k8s.io/apimachinery/pkg/util/wait.Until()
mcad-controller-c4f85dbb6-4246d mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x44
mcad-controller-c4f85dbb6-4246d mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).Run.func4()
mcad-controller-c4f85dbb6-4246d mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1424 +0x50
mcad-controller-c4f85dbb6-4246d mcad-controller ==================

Sample 2

mcad-controller-55cdd74d67-wd87p mcad-controller WARNING: DATA RACE
mcad-controller-55cdd74d67-wd87p mcad-controller Write at 0x00c0000776d8 by goroutine 10:
mcad-controller-55cdd74d67-wd87p mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).manageQueueJob()
mcad-controller-55cdd74d67-wd87p mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1836 +0x25c8
mcad-controller-55cdd74d67-wd87p mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).syncQueueJob()
mcad-controller-55cdd74d67-wd87p mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1795 +0x22b8
mcad-controller-55cdd74d67-wd87p mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).worker.func2()
mcad-controller-55cdd74d67-wd87p mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1694 +0x42c
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/client-go/tools/cache.(*FIFO).Pop()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/fifo.go:303 +0x2d8
mcad-controller-55cdd74d67-wd87p mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).worker()
mcad-controller-55cdd74d67-wd87p mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1674 +0x7c
mcad-controller-55cdd74d67-wd87p mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).worker-fm()
mcad-controller-55cdd74d67-wd87p mcad-controller       <autogenerated>:1 +0x38
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x4c
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x94
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.JitterUntil()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x114
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.Until()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x44
mcad-controller-55cdd74d67-wd87p mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).Run.func9()
mcad-controller-55cdd74d67-wd87p mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1437 +0x48
mcad-controller-55cdd74d67-wd87p mcad-controller 
mcad-controller-55cdd74d67-wd87p mcad-controller Previous read at 0x00c0000776d8 by goroutine 74:
mcad-controller-55cdd74d67-wd87p mcad-controller   runtime.convT()
mcad-controller-55cdd74d67-wd87p mcad-controller       /usr/lib/golang/src/runtime/iface.go:321 +0x0
mcad-controller-55cdd74d67-wd87p mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).enqueue()
mcad-controller-55cdd74d67-wd87p mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1595 +0x6b8
mcad-controller-55cdd74d67-wd87p mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).addQueueJob()
mcad-controller-55cdd74d67-wd87p mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1526 +0xc74
mcad-controller-55cdd74d67-wd87p mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).addQueueJob-fm()
mcad-controller-55cdd74d67-wd87p mcad-controller       <autogenerated>:1 +0x48
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:231 +0x60
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnAdd()
mcad-controller-55cdd74d67-wd87p mcad-controller       <autogenerated>:1 +0x24
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/controller.go:264 +0x74
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/client-go/tools/cache.(*FilteringResourceEventHandler).OnAdd()
mcad-controller-55cdd74d67-wd87p mcad-controller       <autogenerated>:1 +0x60
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/client-go/tools/cache.(*processorListener).run.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:777 +0x108
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x4c
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x94
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.JitterUntil()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x114
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.Until()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x70
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/client-go/tools/cache.(*processorListener).run()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:771 +0x1c
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/client-go/tools/cache.(*processorListener).run-fm()
mcad-controller-55cdd74d67-wd87p mcad-controller       <autogenerated>:1 +0x38
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x70
mcad-controller-55cdd74d67-wd87p mcad-controller 
mcad-controller-55cdd74d67-wd87p mcad-controller Goroutine 10 (running) created at:
mcad-controller-55cdd74d67-wd87p mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/pkg/controller/queuejob.(*XController).Run()
mcad-controller-55cdd74d67-wd87p mcad-controller       /workdir/pkg/controller/queuejob/queuejob_controller_ex.go:1437 +0x7b8
mcad-controller-55cdd74d67-wd87p mcad-controller   github.com/project-codeflare/multi-cluster-app-dispatcher/cmd/kar-controllers/app.Run()
mcad-controller-55cdd74d67-wd87p mcad-controller       /workdir/cmd/kar-controllers/app/server.go:67 +0xd0
mcad-controller-55cdd74d67-wd87p mcad-controller   main.main()
mcad-controller-55cdd74d67-wd87p mcad-controller       /workdir/cmd/kar-controllers/main.go:52 +0xf8
mcad-controller-55cdd74d67-wd87p mcad-controller 
mcad-controller-55cdd74d67-wd87p mcad-controller Goroutine 74 (running) created at:
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.(*Group).Start()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:71 +0xd8
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/client-go/tools/cache.(*sharedProcessor).run.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:623 +0x154
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/client-go/tools/cache.(*sharedProcessor).run()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/tools/cache/shared_informer.go:627 +0x30
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/client-go/tools/cache.(*sharedProcessor).run-fm()
mcad-controller-55cdd74d67-wd87p mcad-controller       <autogenerated>:1 +0x40
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.(*Group).StartWithChannel.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:56 +0x40
mcad-controller-55cdd74d67-wd87p mcad-controller   k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
mcad-controller-55cdd74d67-wd87p mcad-controller       /opt/app-root/src/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:73 +0x70

Impact

There's a design (coding) flaw in the current implementation of MCAD in which pointers to the arbv1.AppWrapper structure is shared between multiple threads. These threads modify the contents of the structure (they generally reload the state of the app wrapper CRD from etcd before they apply their logic) concurrently without any sort of synchronisation mechanism.

This flaw can produce incorrect / inconsistent behaviours under heavy load (large number of appwrappers, memory pressure, etc.) which can cause for app wrappers not be dispatched or their status be reported incorrectly. It's very hard to reason about the correctness of the MCAD behaviour while this condition persist. The correlation between the flaky e2e tests and this race condition has not been fully established.

How to reproduce the output:

make images GO_BUILD_ARGS=-race
make make run-e2e
# while the end to end tests are running the log from MCAD can be captured using the stern tool. Adjust the log file path to suit your needs.
stern  -n kube-system mcad-controller --color never | tee ~/work/mcad/logs/mcad-controller.user.log.1

Environment

  • MacOS Apple Silicon
  • Git branch /hash: 586eb13351efe3cfac68cdb007ac5ab4aec2be02 refs/heads/main

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions