Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to deploy opentelemetry-operator without clusterrole, using role instead #3042

Open
chunglun opened this issue Jun 17, 2024 · 12 comments
Labels
area:rbac Issues relating to RBAC enhancement New feature or request question Further information is requested

Comments

@chunglun
Copy link

Component(s)

No response

Describe the issue you're reporting

Is it possible to deploy opentelemetry-operator without clusterrole, using role instead?
My company doesn't want to grant me cluster admin privilege and ask me to use role and rolebing only to deploy this operator; however, first error happened

[email protected]/tools/cache/reflector.go:229: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:serviceaccount:..." cannot list resource "persistentvolumes" in API group "" at the cluster scope

How could I solve it.....

@swiatekm
Copy link
Contributor

swiatekm commented Jun 19, 2024

You mean for the operator itself? You should be able to, but as a Role only covers RBAC in a specific namespace, the operator deployed this way will only be able to affect resources in this namespace. You'll want to set the WATCH_NAMESPACE env variable to that namespace as well.

@swiatekm swiatekm added question Further information is requested and removed needs triage labels Jun 19, 2024
@pavolloffay
Copy link
Member

Let us know if that worked. I have seen this request before and it would be great to document it.

@chunglun
Copy link
Author

chunglun commented Jun 20, 2024

In fact, I moved most of the setting in ClusterRole to Role but keep the get/list/watch persistentVolume in ClusterRole and remove delete/create/update/patch persistentVolume due to privilege limitation in my company.

Operator was successfully deployed and Center Collector was successfully deployed too.

Currently I’m solving the problem of auto instrumentation due to Instrument deployed but no pods generated. Not sure what makes auto instrumentation not work, still work on it.

@iblancasa
Copy link
Contributor

Currently I’m solving the problem of auto instrumentation due to Instrument deployed but no pods generated. Not sure what makes auto instrumentation not work, still work on it.

The Instrumentation CR doesn't deploy any pod. It creates an InitContainer in those pods where you annotate to inject the autoinstrumentation.

@chunglun
Copy link
Author

Currently I’m solving the problem of auto instrumentation due to Instrument deployed but no pods generated. Not sure what makes auto instrumentation not work, still work on it.

The Instrumentation CR doesn't deploy any pod. It creates an InitContainer in those pods where you annotate to inject the autoinstrumentation.

!!! I will try again tomorrow to see if there is an InitContainer. Thanks so much for this helpful information.

@alita1991
Copy link

alita1991 commented Jun 27, 2024

Hi, I created a role + rolebinding to cover the RBAC issues, but the operator cannot run in namespaced mode, even if the WATCH_NAMESPACE is configured.

# Custom values for the helm chart
crds:
  create: false
clusterRole:
  create: false
imagePullSecrets: []
manager:
  leaderElection:
    enabled: false
  env:
    WATCH_NAMESPACE: argocd-openshift
{"level":"INFO","timestamp":"2024-06-27T13:45:53Z","logger":"setup","message":"watching namespace(s)","namespaces":"argocd-openshift"}
{"level":"INFO","timestamp":"2024-06-27T13:45:53Z","logger":"setup","message":"Prometheus CRDs are installed, adding to scheme."}
{"level":"INFO","timestamp":"2024-06-27T13:45:53Z","logger":"setup","message":"Openshift CRDs are not installed, skipping adding to scheme."}

W0627 13:45:55.996662       1 reflector.go:539] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:serviceaccount:argocd-openshift:opentelemetry-operator" cannot list resource "persistentvolumes" in API group "" at the cluster scope

In addition to this, after some time, the operator crashes and becomes unstable, which causes a lot of stream errors printed to the stderr:

{"level":"ERROR","timestamp":"2024-06-27T13:57:43Z","message":"Could not wait for Cache to sync","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","error":"failed to wait for opampbridge caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.OpAMPBridge","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:203\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Stopping and waiting for non leader election runnables"}
{"level":"ERROR","timestamp":"2024-06-27T13:57:43Z","message":"Could not wait for Cache to sync","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","error":"failed to wait for opentelemetrycollector caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.OpenTelemetryCollector","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:203\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Stopping and waiting for leader election runnables"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Stopping and waiting for caches"}
{"level":"ERROR","timestamp":"2024-06-27T13:57:43Z","message":"error received after stop sequence was engaged","error":"failed to wait for opentelemetrycollector caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.OpenTelemetryCollector","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:490"}
W0627 13:57:43.283512       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Deployment ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.283711       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.283830       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.DaemonSet ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.283899       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.PodMonitor ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.283967       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v2.HorizontalPodAutoscaler ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
{"level":"ERROR","timestamp":"2024-06-27T13:57:43Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.PersistentVolume Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:68\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:56"}
W0627 13:57:43.284018       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.PodDisruptionBudget ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.284216       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.Ingress ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0627 13:57:43.284312       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1alpha1.Instrumentation ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Stopping and waiting for webhooks"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Stopping and waiting for HTTP servers"}
W0627 13:57:43.284703       1 reflector.go:462] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","logger":"controller-runtime.metrics","message":"Shutting down metrics server with timeout of 1 minute"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"shutting down server","kind":"health probe","addr":"[::]:8081"}
{"level":"INFO","timestamp":"2024-06-27T13:57:43Z","message":"Wait completed, proceeding to shutdown the manager"}
{"level":"ERROR","timestamp":"2024-06-27T13:57:43Z","logger":"setup","message":"problem running manager","error":"failed to wait for opampbridge caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.OpAMPBridge","stacktrace":"main.main\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/main.go:431\nruntime.main\n\t/opt/hostedtoolcache/go/1.21.10/x64/src/runtime/proc.go:267"}

@alita1991
Copy link

alita1991 commented Jun 27, 2024

After I spent a little bit of time, I found out that even though the operator is configured to watch namespace(s), there is some logic in the code, where the manager is doing a watch on PersistentVolume resources, which are cluster-wide.

func (r *OpenTelemetryCollectorReconciler) SetupWithManager(mgr ctrl.Manager) error {
	builder := ctrl.NewControllerManagedBy(mgr).
		For(&v1beta1.OpenTelemetryCollector{}).
		Owns(&corev1.ConfigMap{}).
		Owns(&corev1.ServiceAccount{}).
		Owns(&corev1.Service{}).
		Owns(&appsv1.Deployment{}).
		Owns(&appsv1.DaemonSet{}).
		Owns(&appsv1.StatefulSet{}).
                 // Owns(&corev1.PersistentVolume{}). <-- I comment this line
		Owns(&corev1.PersistentVolumeClaim{}).
		Owns(&networkingv1.Ingress{}).
		Owns(&autoscalingv2.HorizontalPodAutoscaler{}).
		Owns(&policyV1.PodDisruptionBudget{})

After I removed that specific line, the operator had no runtime issues. Because I never did any development on the otel-operator, I do not know the impact, any feedback/solution would be highly appreciated.

@pavolloffay
Copy link
Member

If a single namespace is watched the cluster scope objects could be excluded. But it does seem that it would change the operator's feature set.

@chunglun
Copy link
Author

In fact, I moved most of the setting in ClusterRole to Role but keep the get/list/watch persistentVolume in ClusterRole and remove delete/create/update/patch persistentVolume due to privilege limitation in my company.

Operator was successfully deployed and Center Collector was successfully deployed too.

Currently I’m solving the problem of auto instrumentation due to Instrument deployed but no pods generated. Not sure what makes auto instrumentation not work, still work on it.

In fact, I still fail on the auto instrumentation, no any initial container appeared on my application. I kinds of want to drop the operator, use the combination of plain yaml collector deployment + ingest the cleint agent jar to application vai initial comtainer setting in application's deployment yaml to let the application fulfill no any rebuild applciation image needed

@swiatekm
Copy link
Contributor

swiatekm commented Jul 2, 2024

After I spent a little bit of time, I found out that even though the operator is configured to watch namespace(s), there is some logic in the code, where the manager is doing a watch on PersistentVolume resources, which are cluster-wide.

func (r *OpenTelemetryCollectorReconciler) SetupWithManager(mgr ctrl.Manager) error {
	builder := ctrl.NewControllerManagedBy(mgr).
		For(&v1beta1.OpenTelemetryCollector{}).
		Owns(&corev1.ConfigMap{}).
		Owns(&corev1.ServiceAccount{}).
		Owns(&corev1.Service{}).
		Owns(&appsv1.Deployment{}).
		Owns(&appsv1.DaemonSet{}).
		Owns(&appsv1.StatefulSet{}).
                 // Owns(&corev1.PersistentVolume{}). <-- I comment this line
		Owns(&corev1.PersistentVolumeClaim{}).
		Owns(&networkingv1.Ingress{}).
		Owns(&autoscalingv2.HorizontalPodAutoscaler{}).
		Owns(&policyV1.PodDisruptionBudget{})

After I removed that specific line, the operator had no runtime issues. Because I never did any development on the otel-operator, I do not know the impact, any feedback/solution would be highly appreciated.

This is very puzzling to me. There isn't any difference between PersistentVolume and other namespaced resources on that list, and I can't see any reason why it specifically would lead to a problem.

On a separate note, @pavolloffay @jaronoff97 do you know why we try to own Volumes and Claims here? I don't think we create any.

@jaronoff97
Copy link
Contributor

good question... initially I thought it was related to statefulsets, but any PVs and PVCs would be under the statefulset management tree. I don't think there's a reason we need to keep that in there...

@pavolloffay
Copy link
Member

Yes, it's perhaps bc of the statefulset. +1 on removing it if it is not required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rbac Issues relating to RBAC enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants