Weird problem related to memory usage #320

vitobotta · 2024-06-09T08:48:29Z

Hi! We have been using this lovely gem for a while now and it made it easier to export relevant metrics that we can consume with Grafana dashboards.

One problem we are having is that our pods (we host our app in Kubernetes) get randomly OOMKilled and from reading past issues I am wondering if the problem could be that we push many values for a particular metric.

So basically we use the request queue time as metric used for the autoscaling, since it's more accurate than autoscaling based on resources usage.

This is the code I am using in a middleware to determine the request queue time with Puma:

require 'prometheus_exporter/client'

class MetricsMiddleware
  X_REQUEST_START_HEADER_KEY = "HTTP_X_REQUEST_START".freeze
  NGINX_REQUEST_START_PREFIX = "t=".freeze
  PUMA_REQUEST_BODY_WAIT_KEY = "puma.request_body_wait".freeze
  EMPTY_STRING = "".freeze

  def initialize(app)
    @prometheus_exporter_client = PrometheusExporter::Client.default
    @app = app
  end

  def call(env)
    start = env[X_REQUEST_START_HEADER_KEY].
      to_s.
      gsub(NGINX_REQUEST_START_PREFIX, EMPTY_STRING).
      to_f * 1000

    wait = env[PUMA_REQUEST_BODY_WAIT_KEY] || 0
    current = Time.now.to_f * 1000

    queue_time = (current - wait - start).to_i

    env["start_time"] = start
    env["wait_time"] = wait
    env["current_time"] = current
    env["queue_time"] = queue_time

    if start != 0 && Rails.env.production?
      @prometheus_exporter_client.send_json(
        type: "queue_time",
        queue_time:,
      )
    end

    @app.call(env)
  end
end

As you can see, we are pushing the value for this metric on each web request. Since we handle around 450K requests per hour average, I am suspecting - again from reading past issues here - that this may be a problem with the prometheus client using too many resources.

Can this a problem as I suspect or is it something we don't need to worry about? If it is a problem, are there any workarounds apart from reducing the number of values by sampling the requests instead of exporting the metric for all the requests?

Thanks in advance!

Edit: forgot to add the code for the collector for this custom metric:

require "prometheus_exporter/server/type_collector"

module PrometheusCollectors
  class QueueTimeCollector < PrometheusExporter::Server::TypeCollector
    LATENCY_BUCKETS = [
      2.5,
      5,
      10,
      15,
      20,
      25,
      30,
      35,
      40,
      45,
      50,
      75,
      100,
      200,
      300,
      500,
      1000,
      60_000,
    ].freeze

    def initialize
      super
      @queue_time = PrometheusExporter::Metric::Histogram.new(
        "queue_time",
        "Time requests waited before Rails service began",
        buckets: LATENCY_BUCKETS,
      )
    end

    def type
      "queue_time"
    end

    def collect(obj)
      if (latency = obj["queue_time"])
        @queue_time.observe(latency)
      end
    end

    def metrics
      [@queue_time]
    end
  end
end

vitobotta · 2024-06-09T09:09:47Z

Adding a little more info. I just did a request to /metrics for a pod and got this: https://dpaste.org/yuWy1

Is it too much data?

SamSaffron · 2024-06-11T00:30:23Z

Looks pretty enormous to me. maybe only collect detailed metrics on critical routes?

…

On Sun, Jun 9, 2024 at 7:10 PM Vito Botta ***@***.***> wrote: Adding a little more info. I just did a request to /metrics for a pod and got this: https://dpaste.org/yuWy1 Is it too much data? — Reply to this email directly, view it on GitHub <#320 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABIXIBWPYLCBU5H64TUCLZGQLXDAVCNFSM6AAAAABJAWRQ5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJWGQYDINZSGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

vitobotta · 2024-06-11T05:55:59Z

Hi @SamSaffron , thanks for confirming. I have made a change now to sample 20% of the requests instead of collecting that metric for all requests. Let's see if it helps. Thanks!

SamSaffron · 2024-06-11T06:17:13Z

The amount of data is not the problem, it is the amount of cardinality, you need less metrics, sampling will not resolve this imo

…

On Tue, 11 Jun 2024 at 3:56 PM, Vito Botta ***@***.***> wrote: Hi @SamSaffron <https://github.com/SamSaffron> , thanks for confirming. I have made a change now to sample 20% of the requests instead of collecting that metric for all requests. Let's see if it helps. Thanks! — Reply to this email directly, view it on GitHub <#320 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABIXPUKIBXIGCOVQKL6CTZG2GQLAVCNFSM6AAAAABJAWRQ5OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZHA2TKNJYGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

vitobotta · 2024-06-11T09:23:57Z

Are you referring to the metrics collected on a per controller/action basis? How do I limit those?

vitobotta · 2024-06-11T09:48:05Z

It seems that for each combination of controller/action there are metrics for different percentiles. How do I limit these to only p95?

SamSaffron · 2024-06-12T23:27:56Z

pretty sure you can by setting options on the metric, this is configurable.

this all should be documented in the readme I think?

vitobotta · 2024-06-13T07:41:39Z

Hi Sam, in the meantime I have reduced the number of buckets for my custom metric (queue time) and it seems to be working without any issues even without sampling the requests.

But the other metrics that have a lot of controller/action combinations are not custom metrics, they are some metrics exported by default with the gem. Do you mean this section of the README on overriding labels? https://github.com/discourse/prometheus_exporter?tab=readme-ov-file#metrics-collected-by-rails-integration-middleware

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird problem related to memory usage #320

Weird problem related to memory usage #320

vitobotta commented Jun 9, 2024 •

edited

Loading

vitobotta commented Jun 9, 2024

SamSaffron commented Jun 11, 2024 via email

vitobotta commented Jun 11, 2024

SamSaffron commented Jun 11, 2024 via email

vitobotta commented Jun 11, 2024

vitobotta commented Jun 11, 2024

SamSaffron commented Jun 12, 2024

vitobotta commented Jun 13, 2024

Weird problem related to memory usage #320

Weird problem related to memory usage #320

Comments

vitobotta commented Jun 9, 2024 • edited Loading

vitobotta commented Jun 9, 2024

SamSaffron commented Jun 11, 2024 via email

vitobotta commented Jun 11, 2024

SamSaffron commented Jun 11, 2024 via email

vitobotta commented Jun 11, 2024

vitobotta commented Jun 11, 2024

SamSaffron commented Jun 12, 2024

vitobotta commented Jun 13, 2024

vitobotta commented Jun 9, 2024 •

edited

Loading