Proposal: Opinionated Operator CRDs #1477

frzifus · 2023-02-16T15:44:52Z

Based on feedback we received at FOSDEM, the barrier initially use the Open Telemetry operator seems to be quite high.

Therefore, I would like to discuss the current status of the v1alpha1 API here. The OpenTelemetryCollectorSpec spec provides a lot of configuration options, but it is from my point of view quite easy to misconfigure it.

I would be interested in a few things:

What are the things the OpenTelemetryCollector CR mainly is used for?
Has anyone else already received feedback on the current api?
Would we benefit from offering simpler crds for simple usecases?

An example of what i think is quite "challenging". ~~In order to use the target-allocator, an image must be specified that contains the prometheus receiver (this is not present in the default image).~~ [Thats wrong, see here] There are 4 modes available, but only one supports the desired function. The collector configuration is specified as a string and we can not ensure that the configuration works with the selected image. Finally we build something like this.

Depending on the use of the collector, I thought it would be useful to provide certain easy-to-configure crds for certain use cases such as metrics collection.

Here is an example of what this could look like. (I am still a little unsatisfied that the problem here is partly moved to the destination resource).

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollectorDestination
metadata:
  name: dst1
spec:
  processor:
    batch:
      ...
    resourcedetection/test:
      detector:
        ...
  exporter:
    ...
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollectorMetrics
metadata:
  name: something
spec:
  replicas: 2
  destination:
    - dst1
   prometheusCR:
    serviceMonitorSelector:
      abc: def
   static_configs: ...

Update

This topic has been discussed during the 2023/02/16, 9 AM PT/6 PM CET SIG-Call. [Notes]

tldr

Agreed on collector config is hard.
We need better docs with examples.
Outline common usecases and start with an opinionated helm chart. [example]
Config as a string isnt nice - values can not be modified withkustomize.

The text was updated successfully, but these errors were encountered:

frzifus · 2023-04-18T09:34:06Z

cc @puckpuck

puckpuck · 2023-04-18T11:58:16Z

I like the approach that the Collector's Helm chart has taken with presets. When combined with a sensible default, with just a few lines I can configure a collector to do multiple common scenarios.

spec:
  mode: daemonset
  presets:
    logsCollection:
      enabled: true
    hostMetrics:
      enabled: true
    k8sAttributes:
      enabled: true

Using presets, I could have the above spec, which delivers a collector configured as a Daemonset with a default sensible configuration, including functionality to collector logs from all pods on the node, kubelet stats metrics, and enriches all telemetry with Kubernetes attributes.

swiatekm · 2023-04-25T10:48:57Z

I see three distinct problems here:

Existing CRDS are difficult for common use cases, and require the user to understand details they shouldn't have to.
Otel configuration as a string is unfriendly, and may need to be unbundled into smaller CRDs.
(Mine) Configurations are monolithic, which is unwieldy for large clusters with a lot of users.

I'd expect that 2 and 3 would eventually need their own separate issues, although they're all related to some extent. For now I'm going to focus on 1.

Use cases

I'm aware of three major common use cases for users wanting to collect telemetry data in K8s:

Container logs

This is pretty obvious and intuitive from a user perspective - they want to ingest logs from all of the containers running in their cluster, with the right metadata.

Right now, actually doing this requires an understanding of how the kubelet and the container runtime manage logs, the respective log formats, mounting Node filesystem directories as volumes into a DaemonSet with the right permissions, and a somewhat gnarly filelog receiver configuration.

Prometheus metrics

Unlike the logs case, this is quite complicated and has additional dependencies depending on what the user wants. The metrics we could collect here may include:

container metrics from cAdvisor
kube-state-metrics
Node metrics from node-exporter
control plane metrics
application metrics as defined by ServiceMonitors and PodMonitors

Intuitively, I think it should be easy to collect the "infrastructure" metrics, in the same way that the hostmetrics receiver makes this easy on a single host. When it comes to replacing prometheus-operator, I think the choice for users with existing CRDs is straightforward, but I'm not 100% sure if we should also recommend this for others.

Right now, doing any of this requires installing additional dependencies, adding (fairly complex) Prometheus config directly to the receiver, enabling target allocator and giving it the right RBAC permissions, and manually translating from Prometheus metric schema to otel schema.

Traces from instrumentation

I think this is reasonably well-covered right now. We might benefit from better documentation, and maybe some way to directly link Instrumentation to OpenTelemetryCollector to avoid having to manually configure the exporter/receiver pair, but this in my opinion is much less of a problem than the logs and metrics case.

We could later consider a higher-level CRD for a tail sampling setup, where you need to make sure all spans for a given trace are delivered to the same collector.

Solutions

Your proposed new CRDs are roughly the way to go @frzifus, though it does feel a bit backwards that the highest-level resource is really just a receiver + mode, and the rest of the pipeline sits in a subresource. Intuitively it'd make more sense to me if the receiver + mode is the sub resource, kind of like:

---
apiVersion: opentelemetry.io/v1alpha1
kind: Source
metadata:
  name: source1
spec:
  type: metrics
  replicas: 2
  destination:
    - dst1
   prometheusCR:
    serviceMonitorSelector:
      abc: def
   static_configs: ...
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: metrics-collector
spec:
  source: source1
  processors:
    batch:
      ...
    resourcedetection/test:
      detector:
        ...
  exporters:
    ...

I don't like overloading the Collector resource even more, so maybe this should be a new CR, but composing things this way - replacing the receiver with a receiver + mode pair, and keeping everything roughly the same - makes most sense to me as an approach.

joshdover · 2023-10-05T10:59:30Z

I've been thinking about how to solve this usability problem as well and I'm curious about leveraging the ongoing work to add a Template converter to the Collector core to help solve part of this problem: open-telemetry/opentelemetry-collector#8507

The Operator could provide OOTB templates for container logs and the common metrics that need to be collected. That said, there's more to this issue than just simplifying the collector configuration, since we also need to automate the deploymentMode based on what is being collected (eg. logs need to be collected with a DaemonSet collector on each node, but control plane metrics need to be collected by a Deployment for the whole cluster).

So the total solution is probably some combination of templated collector configuration and operator-level logic for selecting the right deployment mode of a collector. This is somewhat similar to the Helm chart's presets option mentioned above, but I think an advantage of using templates is that the user could override the template or chain templates together.

For example, in combination with #1906, a user could define a log processing template for containers that match specific metadata, and the collectors deployed by the operator for log collection could be updated by the operator to route logs from these containers through a pipeline that applies this processing.

frzifus added the question Further information is requested label Feb 16, 2023

pavolloffay added the area:collector Issues for deploying collector label Feb 17, 2023

swiatekm mentioned this issue May 26, 2023

REQUEST: New membership for @swiatekm-sumo open-telemetry/community#1512

Closed

6 tasks

frzifus changed the title ~~Opinionated Operator CRDs~~ Proposal: Opinionated Operator CRDs Jun 21, 2023

swiatekm mentioned this issue Jul 11, 2023

Distributed collector configuration #1906

Open

frzifus mentioned this issue May 23, 2024

No sidecar Containers created for static pods, e.g. kube-apiserver. #2949

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Opinionated Operator CRDs #1477

Proposal: Opinionated Operator CRDs #1477

frzifus commented Feb 16, 2023 •

edited

Loading

frzifus commented Apr 18, 2023

puckpuck commented Apr 18, 2023

swiatekm commented Apr 25, 2023 •

edited

Loading

joshdover commented Oct 5, 2023

Proposal: Opinionated Operator CRDs #1477

Proposal: Opinionated Operator CRDs #1477

Comments

frzifus commented Feb 16, 2023 • edited Loading

frzifus commented Apr 18, 2023

puckpuck commented Apr 18, 2023

swiatekm commented Apr 25, 2023 • edited Loading

Use cases

Container logs

Prometheus metrics

Traces from instrumentation

Solutions

joshdover commented Oct 5, 2023

frzifus commented Feb 16, 2023 •

edited

Loading

swiatekm commented Apr 25, 2023 •

edited

Loading