Distributed collector configuration #1906

swiatekm · 2023-07-11T12:11:15Z

Note: This issue is intended to state the problem and collect use cases to anchor the design. It's neither a proposal nor even a high-level design doc.

Currently, configuration for a single Collector CR is monolithic. I'd like to explore the idea of allowing it to be defined in a distributed way, possibly by different users. It would be the operator's job to collect and assemble the disparate configuration CRs and create an equivalent collector configuration - much like how prometheus-operator creates a Prometheus configuration based on ServiceMonitors.

Prior art for similar solutions are prometheus operator with its Monitor CRs, or logging-operator.

Broadly speaking, the benefits of doing this could be:

Decoupling operational aspects of running the collector from functional aspects of configuring it.
Application developers could only write some piece of the configuration for their application, whereas a platform team would be responsible for running the collector.
Allowing users to share pieces of the configuration, for example exporters which might depend on some global set of Secrets.
Through the above two points, allowing the collector to be configured in a decentralized way, making it much easier to scale to a large amount of cluster users with different telemetry needs.

Potential problems doing this that are unique to the otel operator:

The Otel collector does way more than either Prometheus or Fluent-Bit. Our solution should at minimum support all three signal types.
Depending on signal type, the collector mode of operation is different. We want a DaemonSet for logs, but a StatefulSet for Prometheus metrics.
It may be difficult to create configuration CRs which guarantee validity of the generated collector configuration, given the number of possible components.

Somewhat related issues regarding new CRs for collector configuration: #1477

I'd like to request that anyone who would be interested in this kind of feature, post a comment in this issue describing their use case.

rupeshnemade · 2023-07-13T15:17:36Z

Based on our products, I feel this would be a much needed feature.

Our setup has 30 kubernetes clusters as of today with more than 4000 nodes and 70K pods.
We have a multiple use case which are difficult to implement as of now but will be easier if OTEL has ability to have distributed configuration -

We need dynamic Kafka exporter configuration but as OTEL is purely static config it is very difficult to update the OTEL config dynamically based on different value of Kafka brokers.
Right now OTEL static config makes it tightly coupled to single set of config rules. If other team needs to add their own OTEL rule in different namespace then its not possible as there is no option of distributed config option in OTEL like Prometheus has ServiceMonitor feature with service discovery.

Our teams have growing needs of forwarding logs to their own destination for analysis and reporting and filtering out logs. They need to frequently add/remove the destinations from the pipeline and therefore dynamic configuration is really required to enable it at large scale.

wreed4 · 2023-07-14T20:28:27Z

This feature would be very advantageous to us. As we grow as a company, it is our desire to move away from a central team needing to know about the many hundreds of other services running on our clusters. Each team that writes a service is responsible for deploying their service and exposing any custom metrics or logs they want to pull off-cluster. We want a central team to manage the pipeline of how those metrics and logs get pushed to our central observability platform, but we do not want the owner of that pipeline to have to know about which endpoint or which logs or which metrics should be forwarded off cluster and which should not.. or what services exist in the first place. As stated in the initial problem statement of this issue, this is very similar to how the prometheus operator works today, and in fact that is what we use today. In order to move to an OTEL based solution and replace prometheus as a forwarding agent, we really require this decentralization ability.

jaronoff97 · 2023-08-16T23:55:53Z

Thanks everyone for your feedback here. I've come around to this idea and think it would be beneficial to the community @swiatekm-sumo i'm going to self assign and work on this after #1876 is complete. Do you want to collaborate on the design?

lsolovey · 2023-09-06T16:27:07Z

I totally support this initiative and agree with use-cases already mentioned above.

Another use-case that I'd like to add is ability for developers to manage Tail Sampling configuration. We run hundreds of applications in cluster, with all observability data collected into the centralized platform. We want application developers to be able to configure Tail Sampling policies for their applications without touching OpenTelemetryCollector CRD (which contains a lot of infrastructure-related settings and is managed by the platform team).

frzifus · 2023-09-11T16:36:44Z

@lsolovey Could you give an example what way of configuration you would expect? Since I am working on a proposal.

frzifus · 2023-09-15T08:59:14Z

In summary, a good first step would be to separate the configuration of exporters from the collector configuration.
I had a conversation about this with @jaronoff97 yesterday. One possibility would be to start with a gateway and exporter CR. Here is an example of how these CRDs relate to each other.

graph TD;
    OpenTelemetryKafkaExporter-->OpenTelemetryExporter;
    OpenTelemetryOtlpExporter-->OpenTelemetryExporter;
    OpenTelemetryExporter-->OpenTelemetryGateway;
    OpenTelemetryExporter-->OpenTelemetryAgent;
    OpenTelemetryAgent-->OpenTelemetryCollector;
    OpenTelemetryGateway-->OpenTelemetryCollector;

Since all these CRDs are based on the OpenTelemetryCollector definition, it seems to me a requirement to support a native yaml configuration.

[collector] Specify the collector's configuration with structure yaml instead of a string #1707

Once this is done, we can start prototyping the gateway and exporter CRD.

Add a Gateway and Exporter CRD #2128

luolong · 2023-09-15T15:46:14Z

My attempts so far at setting up and configuring OTel Collector Operator have lead me to somewhat similar thoughts mentioned here and #1477.

The Prometheus Operator has the correct idea here, I believe.

There are basically two or three concerns here that would be useful to separate:

Running Operating OpenTelemetry collector instances with all the best practice boilerplate of monitoring a Kubernetes cluster baked in.
- Running agent and gateway instances
- Provisioning "preset" K8s telemetry collection (nodestat receiver, k8s_cluster receiver, k8s_attributes processr, etc)
- Perhaps a UI component to visualize and edit OTel configuration (low priority).
Collector configuration:
- OpentelemetryReceiver resources for declaratively configuring telemetry sources
- OpenTelemetryExporter resources for declaratively configuring telemetry destinations
- OpenTelemetryPipeline resources for binding it all together
AutoInstrumentation resources

pavolloffay · 2024-03-11T18:26:07Z

I would like to restart this thread with a very simple proposal. The foundation for distributed collector configuration is the config merging feature of the collector. However, merging overrides arrays - proposal for append merging flag open-telemetry/opentelemetry-collector#8754.

Merge of configuration is order dependent (e.g. the order of processors in the pipeline matters). Therefore the proposal is to introduce a new CRD collectorgroup.opentelemetry.io. The CollectorGroup and collector CRs would need to be initially in the same namespace to play well with the k8s RBAC model.

apiVersion: opentelemetry.io/v1beta1
kind: CollectorGroup
metadata:
  name: simplest
spec:
  root: platform-collector
  collectors:
    - name: receivers
    - name: pii-remove-users
    - name: pii-remove-credit-cards
    - name: export-to-vendor
---
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: platform-collector
spec:
  collectorGroup: true
  config:

spec.root defines root collector that defines deployment mode, scaling..
spec.collectors defines list of collector CRs which configs will be merged (maybe other fields as well e.g. env vars)
operator deploys a single collector per CollectorGroup
OpenTelemetryCollector's spec.collectorGroup indicates that the collector is part of the group and should not be deployed independently

The operator could do some validation of the collector configs to make sure each config contains only unique components to avoid overrides.

frzifus · 2024-03-12T23:42:01Z

I like the idea, but Ive a few open questions / thoughts:

What would happen if export-to-vendor is based on a different image/version then platform-collector? Maybe it uses a component that does not exist in the image used by the platform collector.
How would we handle env variable, volume, ... conflicts? We could limit this to the platform-collector. But this could become weird when only export-to-vendor requires for example specific TLS certs.
If the CollectorGroups are limited by a namespace, what is the benefit compared to a single collector configuration?

frzifus added the question Further information is requested label Jul 11, 2023

jaronoff97 self-assigned this Aug 16, 2023

jaronoff97 assigned frzifus Aug 30, 2023

swiatekm mentioned this issue Sep 6, 2023

Restart OTEL Collector on change to imported ConfigMap #2070

Closed

jaronoff97 mentioned this issue Sep 18, 2023

Extend AgentHealth message to accommodate component health open-telemetry/opamp-spec#165

Closed

joshdover mentioned this issue Oct 5, 2023

Proposal: Opinionated Operator CRDs #1477

Open

jaronoff97 added this to the v1alpha2 CRD release milestone Nov 28, 2023

pavolloffay added the area:collector Issues for deploying collector label Mar 11, 2024

swiatekm removed this from the v1beta1 CRD release milestone Apr 4, 2024

pavolloffay mentioned this issue Apr 5, 2024

Add a Gateway and Exporter CRD #2128

Closed

swiatekm mentioned this issue Apr 29, 2024

Add blog post for OTel Operator Q&A open-telemetry/opentelemetry.io#4359

Merged

iblancasa mentioned this issue May 9, 2024

Proposal: allow setting default configurations for all the OpenTelemetryCollector instances #2942

Open

jaronoff97 mentioned this issue Jul 22, 2024

Specify the collector configuration in an external ConfigMap #2871

Closed

jaronoff97 mentioned this issue Oct 30, 2024

Add support for multiple config files #3410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed collector configuration #1906

Distributed collector configuration #1906

swiatekm commented Jul 11, 2023 •

edited

Loading

rupeshnemade commented Jul 13, 2023 •

edited

Loading

wreed4 commented Jul 14, 2023

jaronoff97 commented Aug 16, 2023 •

edited

Loading

lsolovey commented Sep 6, 2023 •

edited

Loading

frzifus commented Sep 11, 2023

frzifus commented Sep 15, 2023 •

edited

Loading

luolong commented Sep 15, 2023

pavolloffay commented Mar 11, 2024 •

edited

Loading

frzifus commented Mar 12, 2024

Distributed collector configuration #1906

Distributed collector configuration #1906

Comments

swiatekm commented Jul 11, 2023 • edited Loading

rupeshnemade commented Jul 13, 2023 • edited Loading

wreed4 commented Jul 14, 2023

jaronoff97 commented Aug 16, 2023 • edited Loading

lsolovey commented Sep 6, 2023 • edited Loading

frzifus commented Sep 11, 2023

frzifus commented Sep 15, 2023 • edited Loading

luolong commented Sep 15, 2023

pavolloffay commented Mar 11, 2024 • edited Loading

frzifus commented Mar 12, 2024

swiatekm commented Jul 11, 2023 •

edited

Loading

rupeshnemade commented Jul 13, 2023 •

edited

Loading

jaronoff97 commented Aug 16, 2023 •

edited

Loading

lsolovey commented Sep 6, 2023 •

edited

Loading

frzifus commented Sep 15, 2023 •

edited

Loading

pavolloffay commented Mar 11, 2024 •

edited

Loading