Is it possible to deploy opentelemetry-operator without clusterrole, using role instead #3042

chunglun · 2024-06-17T09:20:30Z

Component(s)

No response

Describe the issue you're reporting

Is it possible to deploy opentelemetry-operator without clusterrole, using role instead?
My company doesn't want to grant me cluster admin privilege and ask me to use role and rolebing only to deploy this operator; however, first error happened

[email protected]/tools/cache/reflector.go:229: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:serviceaccount:..." cannot list resource "persistentvolumes" in API group "" at the cluster scope

How could I solve it.....

swiatekm · 2024-06-19T10:19:36Z

You mean for the operator itself? You should be able to, but as a Role only covers RBAC in a specific namespace, the operator deployed this way will only be able to affect resources in this namespace. You'll want to set the WATCH_NAMESPACE env variable to that namespace as well.

pavolloffay · 2024-06-20T09:11:41Z

Let us know if that worked. I have seen this request before and it would be great to document it.

chunglun · 2024-06-20T11:00:51Z

In fact, I moved most of the setting in ClusterRole to Role but keep the get/list/watch persistentVolume in ClusterRole and remove delete/create/update/patch persistentVolume due to privilege limitation in my company.

Operator was successfully deployed and Center Collector was successfully deployed too.

Currently I’m solving the problem of auto instrumentation due to Instrument deployed but no pods generated. Not sure what makes auto instrumentation not work, still work on it.

iblancasa · 2024-06-20T14:40:47Z

Currently I’m solving the problem of auto instrumentation due to Instrument deployed but no pods generated. Not sure what makes auto instrumentation not work, still work on it.

The Instrumentation CR doesn't deploy any pod. It creates an InitContainer in those pods where you annotate to inject the autoinstrumentation.

chunglun · 2024-06-20T15:40:10Z

Currently I’m solving the problem of auto instrumentation due to Instrument deployed but no pods generated. Not sure what makes auto instrumentation not work, still work on it.

The Instrumentation CR doesn't deploy any pod. It creates an InitContainer in those pods where you annotate to inject the autoinstrumentation.

!!! I will try again tomorrow to see if there is an InitContainer. Thanks so much for this helpful information.

alita1991 · 2024-06-27T11:43:50Z

Hi, I created a role + rolebinding to cover the RBAC issues, but the operator cannot run in namespaced mode, even if the WATCH_NAMESPACE is configured.

# Custom values for the helm chart
crds:
  create: false
clusterRole:
  create: false
imagePullSecrets: []
manager:
  leaderElection:
    enabled: false
  env:
    WATCH_NAMESPACE: argocd-openshift

{"level":"INFO","timestamp":"2024-06-27T13:45:53Z","logger":"setup","message":"watching namespace(s)","namespaces":"argocd-openshift"}
{"level":"INFO","timestamp":"2024-06-27T13:45:53Z","logger":"setup","message":"Prometheus CRDs are installed, adding to scheme."}
{"level":"INFO","timestamp":"2024-06-27T13:45:53Z","logger":"setup","message":"Openshift CRDs are not installed, skipping adding to scheme."}

W0627 13:45:55.996662       1 reflector.go:539] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:serviceaccount:argocd-openshift:opentelemetry-operator" cannot list resource "persistentvolumes" in API group "" at the cluster scope

In addition to this, after some time, the operator crashes and becomes unstable, which causes a lot of stream errors printed to the stderr:

{"level":"ERROR","timestamp":"2024-06-27T13:57:43Z","message":"Could not wait for Cache to sync","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","error":"failed to wait for opampbridge caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.OpAMPBridge","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:203\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Stopping and waiting for non leader election runnables"}
{"level":"ERROR","timestamp":"2024-06-27T13:57:43Z","message":"Could not wait for Cache to sync","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","error":"failed to wait for opentelemetrycollector caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.OpenTelemetryCollector","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:203\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Stopping and waiting for leader election runnables"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Stopping and waiting for caches"}
{"level":"ERROR","timestamp":"2024-06-27T13:57:43Z","message":"error received after stop sequence was engaged","error":"failed to wait for opentelemetrycollector caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.OpenTelemetryCollector","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:490"}
W0627 13:57:43.283512       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Deployment ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.283711       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.283830       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.DaemonSet ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.283899       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.PodMonitor ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.283967       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v2.HorizontalPodAutoscaler ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
{"level":"ERROR","timestamp":"2024-06-27T13:57:43Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.PersistentVolume Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:68\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:56"}
W0627 13:57:43.284018       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.PodDisruptionBudget ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.284216       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Ingress ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.284312       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1alpha1.Instrumentation ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Stopping and waiting for webhooks"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Stopping and waiting for HTTP servers"}
W0627 13:57:43.284703       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","logger":"controller-runtime.metrics","message":"Shutting down metrics server with timeout of 1 minute"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"shutting down server","kind":"health probe","addr":"[::]:8081"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Wait completed, proceeding to shutdown the manager"}
{"level":"ERROR","timestamp":"2024-06-27T13:57:43Z","logger":"setup","message":"problem running manager","error":"failed to wait for opampbridge caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.OpAMPBridge","stacktrace":"main.main\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/main.go:431\nruntime.main\n\t/opt/hostedtoolcache/go/1.21.10/x64/src/runtime/proc.go:267"}

alita1991 · 2024-06-27T16:38:44Z

After I spent a little bit of time, I found out that even though the operator is configured to watch namespace(s), there is some logic in the code, where the manager is doing a watch on PersistentVolume resources, which are cluster-wide.

func (r *OpenTelemetryCollectorReconciler) SetupWithManager(mgr ctrl.Manager) error {
	builder := ctrl.NewControllerManagedBy(mgr).
		For(&v1beta1.OpenTelemetryCollector{}).
		Owns(&corev1.ConfigMap{}).
		Owns(&corev1.ServiceAccount{}).
		Owns(&corev1.Service{}).
		Owns(&appsv1.Deployment{}).
		Owns(&appsv1.DaemonSet{}).
		Owns(&appsv1.StatefulSet{}).
                 // Owns(&corev1.PersistentVolume{}). <-- I comment this line
		Owns(&corev1.PersistentVolumeClaim{}).
		Owns(&networkingv1.Ingress{}).
		Owns(&autoscalingv2.HorizontalPodAutoscaler{}).
		Owns(&policyV1.PodDisruptionBudget{})

After I removed that specific line, the operator had no runtime issues. Because I never did any development on the otel-operator, I do not know the impact, any feedback/solution would be highly appreciated.

pavolloffay · 2024-06-28T09:57:54Z

If a single namespace is watched the cluster scope objects could be excluded. But it does seem that it would change the operator's feature set.

chunglun · 2024-06-28T10:07:35Z

In fact, I moved most of the setting in ClusterRole to Role but keep the get/list/watch persistentVolume in ClusterRole and remove delete/create/update/patch persistentVolume due to privilege limitation in my company.

Operator was successfully deployed and Center Collector was successfully deployed too.

Currently I’m solving the problem of auto instrumentation due to Instrument deployed but no pods generated. Not sure what makes auto instrumentation not work, still work on it.

In fact, I still fail on the auto instrumentation, no any initial container appeared on my application. I kinds of want to drop the operator, use the combination of plain yaml collector deployment + ingest the cleint agent jar to application vai initial comtainer setting in application's deployment yaml to let the application fulfill no any rebuild applciation image needed

swiatekm · 2024-07-02T17:46:12Z

After I spent a little bit of time, I found out that even though the operator is configured to watch namespace(s), there is some logic in the code, where the manager is doing a watch on PersistentVolume resources, which are cluster-wide.
func (r *OpenTelemetryCollectorReconciler) SetupWithManager(mgr ctrl.Manager) error {
	builder := ctrl.NewControllerManagedBy(mgr).
		For(&v1beta1.OpenTelemetryCollector{}).
		Owns(&corev1.ConfigMap{}).
		Owns(&corev1.ServiceAccount{}).
		Owns(&corev1.Service{}).
		Owns(&appsv1.Deployment{}).
		Owns(&appsv1.DaemonSet{}).
		Owns(&appsv1.StatefulSet{}).
                 // Owns(&corev1.PersistentVolume{}). <-- I comment this line
		Owns(&corev1.PersistentVolumeClaim{}).
		Owns(&networkingv1.Ingress{}).
		Owns(&autoscalingv2.HorizontalPodAutoscaler{}).
		Owns(&policyV1.PodDisruptionBudget{})
After I removed that specific line, the operator had no runtime issues. Because I never did any development on the otel-operator, I do not know the impact, any feedback/solution would be highly appreciated.

This is very puzzling to me. There isn't any difference between PersistentVolume and other namespaced resources on that list, and I can't see any reason why it specifically would lead to a problem.

On a separate note, @pavolloffay @jaronoff97 do you know why we try to own Volumes and Claims here? I don't think we create any.

jaronoff97 · 2024-07-02T18:25:52Z

good question... initially I thought it was related to statefulsets, but any PVs and PVCs would be under the statefulset management tree. I don't think there's a reason we need to keep that in there...

pavolloffay · 2024-07-03T08:09:01Z

Yes, it's perhaps bc of the statefulset. +1 on removing it if it is not required.

chunglun added the needs triage label Jun 17, 2024

swiatekm added question Further information is requested and removed needs triage labels Jun 19, 2024

swiatekm mentioned this issue Jul 3, 2024

Don't unnecessarily take ownership of PersistentVolumes and PersistentVolumeClaims #3097

Open

jaronoff97 mentioned this issue Jul 16, 2024

Unable to run opentelemetry-collector in namespaced mode open-telemetry/opentelemetry-collector#10482

Closed

jaronoff97 added enhancement New feature or request area:rbac Issues relating to RBAC labels Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to deploy opentelemetry-operator without clusterrole, using role instead #3042

Is it possible to deploy opentelemetry-operator without clusterrole, using role instead #3042

chunglun commented Jun 17, 2024

swiatekm commented Jun 19, 2024 •

edited

Loading

pavolloffay commented Jun 20, 2024

chunglun commented Jun 20, 2024 •

edited

Loading

iblancasa commented Jun 20, 2024

chunglun commented Jun 20, 2024

alita1991 commented Jun 27, 2024 •

edited

Loading

alita1991 commented Jun 27, 2024 •

edited

Loading

pavolloffay commented Jun 28, 2024

chunglun commented Jun 28, 2024

swiatekm commented Jul 2, 2024

jaronoff97 commented Jul 2, 2024

pavolloffay commented Jul 3, 2024

Is it possible to deploy opentelemetry-operator without clusterrole, using role instead #3042

Is it possible to deploy opentelemetry-operator without clusterrole, using role instead #3042

Comments

chunglun commented Jun 17, 2024

Component(s)

Describe the issue you're reporting

swiatekm commented Jun 19, 2024 • edited Loading

pavolloffay commented Jun 20, 2024

chunglun commented Jun 20, 2024 • edited Loading

iblancasa commented Jun 20, 2024

chunglun commented Jun 20, 2024

alita1991 commented Jun 27, 2024 • edited Loading

alita1991 commented Jun 27, 2024 • edited Loading

pavolloffay commented Jun 28, 2024

chunglun commented Jun 28, 2024

swiatekm commented Jul 2, 2024

jaronoff97 commented Jul 2, 2024

pavolloffay commented Jul 3, 2024

swiatekm commented Jun 19, 2024 •

edited

Loading

chunglun commented Jun 20, 2024 •

edited

Loading

alita1991 commented Jun 27, 2024 •

edited

Loading

alita1991 commented Jun 27, 2024 •

edited

Loading