Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Opinionated Operator CRDs #1477

Open
frzifus opened this issue Feb 16, 2023 · 4 comments
Open

Proposal: Opinionated Operator CRDs #1477

frzifus opened this issue Feb 16, 2023 · 4 comments
Labels
area:collector Issues for deploying collector question Further information is requested

Comments

@frzifus
Copy link
Member

frzifus commented Feb 16, 2023

Based on feedback we received at FOSDEM, the barrier initially use the Open Telemetry operator seems to be quite high.

Therefore, I would like to discuss the current status of the v1alpha1 API here. The OpenTelemetryCollectorSpec spec provides a lot of configuration options, but it is from my point of view quite easy to misconfigure it.

I would be interested in a few things:

  • What are the things the OpenTelemetryCollector CR mainly is used for?
  • Has anyone else already received feedback on the current api?
  • Would we benefit from offering simpler crds for simple usecases?

An example of what i think is quite "challenging". In order to use the target-allocator, an image must be specified that contains the prometheus receiver (this is not present in the default image). [Thats wrong, see here] There are 4 modes available, but only one supports the desired function. The collector configuration is specified as a string and we can not ensure that the configuration works with the selected image. Finally we build something like this.

Depending on the use of the collector, I thought it would be useful to provide certain easy-to-configure crds for certain use cases such as metrics collection.

Here is an example of what this could look like. (I am still a little unsatisfied that the problem here is partly moved to the destination resource).

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollectorDestination
metadata:
  name: dst1
spec:
  processor:
    batch:
      ...
    resourcedetection/test:
      detector:
        ...
  exporter:
    ...
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollectorMetrics
metadata:
  name: something
spec:
  replicas: 2
  destination:
    - dst1
   prometheusCR:
    serviceMonitorSelector:
      abc: def
   static_configs: ...

Update

This topic has been discussed during the 2023/02/16, 9 AM PT/6 PM CET SIG-Call. [Notes]

tldr

  • Agreed on collector config is hard.

  • We need better docs with examples.

  • Outline common usecases and start with an opinionated helm chart. [example]

  • Config as a string isnt nice - values can not be modified withkustomize.

@frzifus frzifus added the question Further information is requested label Feb 16, 2023
@pavolloffay pavolloffay added the area:collector Issues for deploying collector label Feb 17, 2023
@frzifus
Copy link
Member Author

frzifus commented Apr 18, 2023

cc @puckpuck

@puckpuck
Copy link

I like the approach that the Collector's Helm chart has taken with presets. When combined with a sensible default, with just a few lines I can configure a collector to do multiple common scenarios.

spec:
  mode: daemonset
  presets:
    logsCollection:
      enabled: true
    hostMetrics:
      enabled: true
    k8sAttributes:
      enabled: true

Using presets, I could have the above spec, which delivers a collector configured as a Daemonset with a default sensible configuration, including functionality to collector logs from all pods on the node, kubelet stats metrics, and enriches all telemetry with Kubernetes attributes.

@swiatekm
Copy link
Contributor

swiatekm commented Apr 25, 2023

I see three distinct problems here:

  1. Existing CRDS are difficult for common use cases, and require the user to understand details they shouldn't have to.
  2. Otel configuration as a string is unfriendly, and may need to be unbundled into smaller CRDs.
  3. (Mine) Configurations are monolithic, which is unwieldy for large clusters with a lot of users.

I'd expect that 2 and 3 would eventually need their own separate issues, although they're all related to some extent. For now I'm going to focus on 1.

Use cases

I'm aware of three major common use cases for users wanting to collect telemetry data in K8s:

Container logs

This is pretty obvious and intuitive from a user perspective - they want to ingest logs from all of the containers running in their cluster, with the right metadata.

Right now, actually doing this requires an understanding of how the kubelet and the container runtime manage logs, the respective log formats, mounting Node filesystem directories as volumes into a DaemonSet with the right permissions, and a somewhat gnarly filelog receiver configuration.

Prometheus metrics

Unlike the logs case, this is quite complicated and has additional dependencies depending on what the user wants. The metrics we could collect here may include:

  • container metrics from cAdvisor
  • kube-state-metrics
  • Node metrics from node-exporter
  • control plane metrics
  • application metrics as defined by ServiceMonitors and PodMonitors

Intuitively, I think it should be easy to collect the "infrastructure" metrics, in the same way that the hostmetrics receiver makes this easy on a single host. When it comes to replacing prometheus-operator, I think the choice for users with existing CRDs is straightforward, but I'm not 100% sure if we should also recommend this for others.

Right now, doing any of this requires installing additional dependencies, adding (fairly complex) Prometheus config directly to the receiver, enabling target allocator and giving it the right RBAC permissions, and manually translating from Prometheus metric schema to otel schema.

Traces from instrumentation

I think this is reasonably well-covered right now. We might benefit from better documentation, and maybe some way to directly link Instrumentation to OpenTelemetryCollector to avoid having to manually configure the exporter/receiver pair, but this in my opinion is much less of a problem than the logs and metrics case.

We could later consider a higher-level CRD for a tail sampling setup, where you need to make sure all spans for a given trace are delivered to the same collector.

Solutions

Your proposed new CRDs are roughly the way to go @frzifus, though it does feel a bit backwards that the highest-level resource is really just a receiver + mode, and the rest of the pipeline sits in a subresource. Intuitively it'd make more sense to me if the receiver + mode is the sub resource, kind of like:

---
apiVersion: opentelemetry.io/v1alpha1
kind: Source
metadata:
  name: source1
spec:
  type: metrics
  replicas: 2
  destination:
    - dst1
   prometheusCR:
    serviceMonitorSelector:
      abc: def
   static_configs: ...
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: metrics-collector
spec:
  source: source1
  processors:
    batch:
      ...
    resourcedetection/test:
      detector:
        ...
  exporters:
    ...

I don't like overloading the Collector resource even more, so maybe this should be a new CR, but composing things this way - replacing the receiver with a receiver + mode pair, and keeping everything roughly the same - makes most sense to me as an approach.

@frzifus frzifus changed the title Opinionated Operator CRDs Proposal: Opinionated Operator CRDs Jun 21, 2023
@joshdover
Copy link

I've been thinking about how to solve this usability problem as well and I'm curious about leveraging the ongoing work to add a Template converter to the Collector core to help solve part of this problem: open-telemetry/opentelemetry-collector#8507

The Operator could provide OOTB templates for container logs and the common metrics that need to be collected. That said, there's more to this issue than just simplifying the collector configuration, since we also need to automate the deploymentMode based on what is being collected (eg. logs need to be collected with a DaemonSet collector on each node, but control plane metrics need to be collected by a Deployment for the whole cluster).

So the total solution is probably some combination of templated collector configuration and operator-level logic for selecting the right deployment mode of a collector. This is somewhat similar to the Helm chart's presets option mentioned above, but I think an advantage of using templates is that the user could override the template or chain templates together.

For example, in combination with #1906, a user could define a log processing template for containers that match specific metadata, and the collectors deployed by the operator for log collection could be updated by the operator to route logs from these containers through a pipeline that applies this processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:collector Issues for deploying collector question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants