Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alloy DaemonSet stopped working properly with was collected before with the same name and label values #1179

Open
alex-berger opened this issue Jul 1, 2024 · 3 comments
Labels
bug Something isn't working needs-attention

Comments

@alex-berger
Copy link

What's wrong?

A few days ago, suddenly some of my Alloy DameonSet Pods stopped working properly, resulting in server errors (500) when accessing their /metrics endpoint. The DaemonSet was still running, no container crashes/restarts (readinessProbe was doing fine), but it appeared that it was no longer properly working.

Looks like this was caused by errors of the kind was collected before with the same name and label values. I have not observed this before and after restarting the Pods the problem disappeared. Therefore, I suspect that there must be some kind of race-condition bug in Alloy. Here an example of the output from the /metrics endpoint.

An error has occurred while serving metrics:

15 error(s) occurred:
* collected metric "workload_log_lines_total" { label:{name:"app" value:"loki"} label:{name:"component" value:"gateway"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"nginx"} label:{name:"controller" value:"loki-gateway-58bdc5cfc9"} label:{name:"controller_kind" value:"ReplicaSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"loki-system-loki-distributed"} label:{name:"ip" value:"10.176.44.205"} label:{name:"job" value:"loki-system/loki-gateway-58bdc5cfc9-xt256"} label:{name:"namespace" value:"loki-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"loki-gateway-58bdc5cfc9-xt256"} label:{name:"stream" value:"stderr"} counter:{value:1682 created_timestamp:{seconds:1719579313 nanos:848869559}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"cilium-agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"cilium-agent"} label:{name:"controller" value:"cilium"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"ip" value:"10.176.45.209"} label:{name:"job" value:"kube-system/cilium-9gcgw"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"part_of" value:"cilium"} label:{name:"pod" value:"cilium-9gcgw"} label:{name:"stream" value:"stderr"} counter:{value:104 created_timestamp:{seconds:1719579314 nanos:850782828}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"datadog"} label:{name:"component" value:"agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"agent"} label:{name:"controller" value:"datadog"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"datadog"} label:{name:"ip" value:"10.176.32.131"} label:{name:"job" value:"kube-system/datadog-7wj5j"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"datadog-7wj5j"} label:{name:"stream" value:"stdout"} counter:{value:582 created_timestamp:{seconds:1719579319 nanos:108932331}}} was collected before with the same name and label values
* collected metric "workload_log_lines_total" { label:{name:"app" value:"datadog"} label:{name:"component" value:"agent"} label:{name:"component_id" value:"loki.process.pod_logs"} label:{name:"component_path" value:"/"} label:{name:"container" value:"trace-agent"} label:{name:"controller" value:"datadog"} label:{name:"controller_kind" value:"DaemonSet"} label:{name:"flags" value:"F"} label:{name:"host_ip" value:"10.176.45.209"} label:{name:"instance" value:"datadog"} label:{name:"ip" value:"10.176.32.131"} label:{name:"job" value:"kube-system/datadog-7wj5j"} label:{name:"namespace" value:"kube-system"} label:{name:"node_name" value:"ip-10-176-45-209.eu-central-1.compute.internal"} label:{name:"pod" value:"datadog-7wj5j"} label:{name:"stream" value:"stdout"} counter:{value:55 created_timestamp:{seconds:1719579370 nanos:107981085}}} was collected before with the same name and label values
...

What immediately caught my attention is, that the affected metrics are all from stage.metrics blocks (inside a loki.process component). See below configuration for more details.

Steps to reproduce

This happens sporadically once in a while and I have not yet figured out how to reliably reproduce it. That's also why I suspect that this is a kind of a "race condition" bug.

System information

Linux 6.1.92, amd64 and arm64, Bottlerocket OS 1.20.3 (aws-k8s-1.29)

Software version

Grafana Alloy 1.2.0

Configuration

logging {
  level  = "warn"
  format = "json"
}

loki.write "default" {
  endpoint {
    url = "http://loki-gateway.loki-system.svc.cluster.local/loki/api/v1/push"
    remote_timeout = "5s"
    batch_wait = "10s"
  }
  wal {
    enabled = true
    max_segment_age = "15m"
  }
}

prometheus.remote_write "default" {
    endpoint {
      url = "http://cortex-nginx.cortex-system/api/v1/push"
      remote_timeout = "5s"
      // Disable exemplars as work around for https://github.com/grafana/alloy/issues/1117
      send_exemplars = false
      metadata_config {
        send = true
      }
      queue_config {
        capacity = 10000
        max_samples_per_send = 10000
        batch_send_deadline = "5s"
        min_backoff = "30ms"
        max_backoff = "5s"
        retry_on_http_429 = true
      }
    }
    wal {
      truncate_frequency = "15m"
      max_keepalive_time = "30m"
    }
}

prometheus.exporter.self "alloy" {}

prometheus.scrape "alloy" {
  targets    = prometheus.exporter.self.alloy.targets
  forward_to = [prometheus.relabel.alloy.receiver]
  honor_labels = true
  scrape_interval = "60s"
}

prometheus.relabel "alloy" {
  forward_to = [prometheus.remote_write.default.receiver]
  rule {
    source_labels = ["instance"]
    target_label = "node_name"
  }
  rule {
    replacement = env("POD_NAME")
    target_label = "pod"
  }
  rule {
    replacement = env("POD_NAMESPACE")
    target_label = "namespace"
  }
  rule {
    action = "labeldrop"
    regex = "instance"
  }
  rule {
    source_labels = ["instance"]
    replacement = "alloy-daemonset/" + env("POD_NAME")
    target_label = "job"
  }
}

discovery.kubelet "local_pods" {
  url = "https://" + env("HOSTNAME") + ":10250"
  bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  tls_config {
    ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
    insecure_skip_verify = false
  }
}

discovery.relabel "pod_logs" {
  targets = discovery.kubelet.local_pods.targets
  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label  = "namespace"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label  = "pod"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_container_name"]
    target_label  = "container"
  }
  rule {
    source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_name"]
    separator     = "/"
    target_label  = "job"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
    separator     = "/"
    action        = "replace"
    replacement   = "/var/log/pods/*$1/*.log"
    target_label  = "__path__"
  }
  rule {
    action = "replace"
    source_labels = ["__meta_kubernetes_pod_container_id"]
    regex = "^(\\w+):\\/\\/.+$"
    replacement = "$1"
    target_label = "tmp_container_runtime"
  }
  rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_ip"]
        target_label = "ip"
  }
  rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_host_ip"]
        target_label = "host_ip"
  }
  rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_node_name"]
        target_label = "node_name"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_controller_kind"]
        target_label = "controller_kind"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_controller_name"]
        target_label = "controller"
  }
  rule {
        action = "replace"
        regex = "^(.+?)(-[a-z0-9]+){0,2}$"
        source_labels = ["__meta_kubernetes_pod_name"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "^(.+?)(-[a-z0-9]+){0,1}$"
        source_labels = ["__meta_kubernetes_pod_controller_name"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
        target_label = "app"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_component"]
        target_label = "component"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_component"]
        target_label = "component"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_version"]
        target_label = "version"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_part_of"]
        target_label = "part_of"
  }
  rule {
        action = "replace"
        regex = "(.+)"
        source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_instance"]
        target_label = "instance"
  }
}

local.file_match "pod_logs" {
  path_targets = discovery.relabel.pod_logs.output
}

loki.source.file "pod_logs" {
  targets    = local.file_match.pod_logs.targets
  forward_to = [loki.process.pod_logs.receiver]
}

// basic processing to parse the container format. You can add additional processing stages
// to match your application logs.
loki.process "pod_logs" {
  stage.match {
    selector = "{tmp_container_runtime=\"containerd\"}"
    // the cri processing stage extracts the following k/v pairs: log, stream, time, flags
    stage.cri {}
    // Set the extract flags and stream values as labels
    stage.labels {
      values = {
        flags   = "",
        stream  = "",
      }
    }
  }

  // drop the temporary container runtime label as it is no longer needed
  stage.label_drop {
    values = ["tmp_container_runtime", "filename"]
  }

  // Remove the trailing newlines from JSON.
  // While trailing newlines are stripped by logcli, they would lead to intermittent empty lines for multiline logs.
  stage.replace {
    expression = "(\n)$"
    replace = ""
  }

  stage.multiline {
    firstline = "^\\x{200B}\\x{200D}\\x{200B}"
    max_wait_time = "3s"
  }

  stage.metrics {
    metric.counter {
        name        = "log_lines_total"
        description = "total number of log lines"
        prefix      = "workload_"
        match_all         = true
        action            = "inc"
        max_idle_duration = "1h"
    }
    metric.counter {
        name        = "raw_log_bytes_total"
        description = "total bytes of log lines before truncating long lines"
        prefix      = "workload_"
        match_all         = true
        count_entry_bytes = true
        action            = "add"
        max_idle_duration = "1h"
    }
  }

  /* //TODO enable this later
  // Truncate log lines longer than 256KiB
  stage.replace {
    expression = ".+"
    replace = "{{if gt (.Value | len) 262144 }}{{slice .Value 0 262144}}...(truncated after 262144 of {{(.Value | len)}} characters){{else}}{{.Value}}{{end}}"
  }
  */

  stage.metrics {
    metric.counter {
        name        = "truncated_log_bytes_total"
        description = "total bytes of log lines after tuncating long lines"
        prefix      = "workload_"
        match_all         = true
        count_entry_bytes = true
        action            = "add"
        max_idle_duration = "1h"
    }
  }

  forward_to = [loki.write.default.receiver]
}

Logs

<<empty log, no errors or warnings in the logs>>
@alex-berger alex-berger added the bug Something isn't working label Jul 1, 2024
@alex-berger
Copy link
Author

alex-berger commented Jul 1, 2024

No 100% sure yet, but it looks like I can provoke this by updating the configuration (ConfigMap), the resulting config reload seems to trigger the above described behavior. In my case I am simply changing the logging.level back and forth from info to warn and vise versa to make sure the config changes. It's still racy as not all Pods from the DaemonSet become nonfunctional, but some do.

@alex-berger
Copy link
Author

As a work-around we added a livenessProbe to the Alloy DaemonSet, which will make sure the alloy container is restarted if it (the metrics endpoint) becomes unhealthy (starts responding with status code 5XX).

    containers:
    - name: alloy
      livenessProbe:
        httpGet:
          path: /metrics
          port: 12345
          scheme: HTTP
        initialDelaySeconds: 30
        timeoutSeconds: 2
        periodSeconds: 30
        successThreshold: 1
        failureThreshold: 3

Copy link
Contributor

github-actions bot commented Aug 2, 2024

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-attention
Projects
None yet
Development

No branches or pull requests

1 participant