Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in getting metrics & prometheus plugin after bumping to the 3.7.1 release #14160

Open
1 task done
rodolfobrunner opened this issue Jan 14, 2025 · 12 comments
Open
1 task done

Comments

@rodolfobrunner
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

3.7.1 / 3.9.0

Current Behavior

I am having problems with metrics & prometheus plugin after bumping to the 3.7.1 release. (I already bumped Kong until 3.9.0 and the issue still persists)

I have the following entry in my logs:
[lua] prometheus.lua:1020: log_error(): Error getting 'request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"}': nil, client: 10.145.40.1, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.12.54:8100"

Interesting facts:

  • it is always the same service and that route has that error. In my case it is always the same two routes for the same bucket.
  • When we revert back to 3.6.1 and the problem goes away
  • After a few months, we bumped Kong to version 3.9.0 and the problem started happening again after a couple of hours for the same routes + buckets.
  • Goes away with a pod rotation but comes back after a while.

I already tried:

  • nginx_http_lua_shared_dict: 'prometheus_metrics 15m' Memory now stands at +-20%

One pod contains:

kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="50"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="80"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="100"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="250"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="400"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="700"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="1000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="2000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="5000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="10000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="30000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="60000"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="+Inf"} 2

While another is missing the le "80"

kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="50"} 1
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="100"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="250"} 2
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="400"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="700"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="1000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="2000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="5000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="10000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="30000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="60000"} 3
kong_request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="+Inf"} 3

We are running our Kong in AWS EKS, upgraded from 3.6.1

Expected Behavior

The bucket should not disappear, but if it does for any reason I would expect Kong to be able to recover from an inconsistent state. (maybe metric reset?)

Steps To Reproduce

No response

Anything else?

No response

@ProBrian
Copy link
Contributor

@rodolfobrunner Does this issue happen while using the same deployment as #14144? I'm trying to reproduce it.

@brunomiguelsantos
Copy link

@rodolfobrunner Does this issue happen while using the same deployment as #14144? I'm trying to reproduce it.

Hey @ProBrian, I am part of the same team as @rodolfobrunner. Yes, it's the same deployment.

@jmadureira
Copy link

Hello @ProBrian some additional info on what we're seeing. We added at some point some debug instructions to figure out what was being stored. Something like:

-- Adapted from the prometheus metric_data function
local function collect()
  ngx.header["Content-Type"] = "text/plain; charset=UTF-8"
  ngx.header["Kong-NodeId"] = node_id

  local prometheus = exporter.get_prometheus()

  local write_fn = ngx.print

  local keys = prometheus.dict:get_keys(0)   <--- this is the ngx.shared["prometheus_metrics"]
  local count = #keys

  table_sort(keys)

  local output = buffer.new(DATA_BUFFER_SIZE_HINT)
  local output_count = 0

  local function buffered_print(fmt, ...)
    if fmt then
      output_count = output_count + 1
      output:putf(fmt, ...)
    end

    if output_count >= 100 or not fmt then
      write_fn(output:get())  -- consume the whole buffer
      output_count = 0
    end
  end

  for i = 1, count do
    local key = keys[i]
    local value = prometheus.dict[key]

    buffered_print("%s: %s\n", key, value)
  end

  buffered_print(nil)

  output:free()
end

... which outputs (when the error occurs):

request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00050.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00100.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00250.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00400.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00700.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="01000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="02000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="05000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="10000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="30000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="60000.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="Inf"} null

How can a dictionary support duplicate keys? Even if it's a shared dictionary?

@ProBrian
Copy link
Contributor

request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} null <----------
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00050.0"} null
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00100.0"} null

That's weird, from the print logs even the table_sort works abnormally, "00080.0" appears before "00050.0".

@jmadureira
Copy link

jmadureira commented Feb 17, 2025

Hello @ProBrian some more information. Added more information to the introspection and got this.

request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"} - local_metrics: nil shared_dict: nil  <---- the important dict
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00100.0"} - local_metrics: nil shared_dict: 1
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00250.0"} - local_metrics: nil shared_dict: 1
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00400.0"} - local_metrics: nil shared_dict: 1
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00700.0"} - local_metrics: nil shared_dict: 1
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="01000.0"} - local_metrics: nil shared_dict: 1
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="02000.0"} - local_metrics: nil shared_dict: 1
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="05000.0"} - local_metrics: nil shared_dict: 1
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="10000.0"} - local_metrics: nil shared_dict: 1
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="30000.0"} - local_metrics: nil shared_dict: 1
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="60000.0"} - local_metrics: nil shared_dict: 1

Which explains the error described earlier (the value is indeed nil).

[lua] prometheus.lua:1020: log_error(): Error getting 'request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"}': nil, client: 10.145.40.1, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.12.54:8100"

The disturbing part here is that the code that dumps the counters to the nginx shared dict:

local function sync(_, self)
  local err, _
  local ok = true
  for k, v in pairs(self.increments) do
    _, err, _ = self.dict:incr(k, v, 0)
    if err then
      ngx.log(ngx.WARN, "error increasing counter in shdict key: ", k, ", err: ", err)
      ok = false
    end
  end

  clear_tab(self.increments)
  return ok
end

the self.dict:incr(k, v, 0) according to the documentation:

Increments the (numerical) value for key in the shm-based dictionary ngx.shared.DICT by the step value value. Returns the new resulting number if the operation is successfully completed or nil and an error message otherwise.

When the key does not exist or has already expired in the shared dictionary,

if the init argument is not specified or takes the value nil, this method will return nil and the error string "not found", or
if the init argument takes a number value, this method will create a new key with the value init + value.

Under which scenarios will this operation write nil?

@ProBrian
Copy link
Contributor

ProBrian commented Feb 17, 2025

@jmadureira For the ngx.shared.DICT.incr(k,v), v will only take a number, so there should be no way incr write an nil, I started to doubt if there is issue in this API.
By the way, for your debug function, seems the dict value get is not correct, could you fix it and also check if there are err message got when getting a nil value from shared dict?


for i = 1, count do
    local key = keys[i]
 - local value = prometheus.dict[key]
 + local value, err = prometheus.dict:get(key)
  -- if `err` is not nil, try to print out the err here

    buffered_print("%s: %s\n", key, value)
  end

@ProBrian
Copy link
Contributor

@jmadureira Could you do more debug to clarify that:

  1. are there warning logs to indicate write error when doing self.dict:incr(k, v, 0) operation.
  2. could you use ngx.shared.DICT.free_space to check if the shared dict still has free memory for the metrics.
    What I'm wondering is that is there a opportunity that some key(like the le=80 slot) has been evicted due to memory insufficient and LRU logic. That's keys = get_keys() --> keys[i] evicted due to LRU --> get(keys[i]) got an nil. so we can try another get_keys right after the get value iteration like:
local keys = prometheus.dict:get_keys(0)   <--- this is the ngx.shared["prometheus_metrics"]
...
 --- the loop of get value
  for i = 1, count do
    local key = keys[i]
    local value = prometheus.dict:get(key)

    buffered_print("%s: %s\n", key, value)
  end

--- get keys again to verify if the keys are identical, this helps us figure out if there are keys evicted due to the loop of get value.
keys = prometheus.dict:get_keys(0)   

@jmadureira
Copy link

jmadureira commented Feb 18, 2025

@ProBrian you are probably on to something. I changed the sync logic a little bit

local function sync(_, self)
  local err, _
  local ok = true
  local count = 0
  local counter_logs = {}

  for k, v in pairs(self.increments) do
    new_val, err, _ = self.dict:incr(k, v, 0)
    if err then
      ngx.log(ngx.WARN, "error increasing counter in shdict key: ", k, ", err: ", err)
      ok = false
    end
    count = count + 1

    -- Only log counter names that contain "request_latency_ms_bucket" along with their values
    if string.find(k, "request_latency_ms_bucket", 1, true) then
      table.insert(counter_logs, k .. "=old_value+" .. v .. "=" .. new_val .." current value=" .. self.dict:get(k)) <---- also print the current value on the shared dictionary
    end
  end

  if count > 0 then
    if #counter_logs > 0 then
      ngx.log(ngx.INFO, "Synced ", count, " counters from worker ", self.id, ": ", table.concat(counter_logs, ", "))
    else
      ngx.log(ngx.INFO, "Synced ", count, " counters from worker ", self.id)
    end
  end

  clear_tab(self.increments)
  return ok
end

The error no longer shows up, most likely because since the value gets read immediately it won't be affected by the LRU logic.

On the other hand if the le=80 slot had been evicted due to memory constraints, the prometheus exporter wouldn't pick it up and no error would be logged. The metric_data function on the prometheus.lua loops through all keys on the shared dict.

@ProBrian
Copy link
Contributor

ProBrian commented Feb 19, 2025

The error no longer shows up, most likely because since the value gets read immediately it won't be affected by the LRU logic.

I think incr and get both make a key fresh, if we already incr a specific key "le=xx" in current sync, it should not be evicted(so get(le=xx) immediately again is not the reason the key not evicted). I feel like It's more possible to be a concurrent + timing issue like:

coroutine1(metric_data): get_keys got [le=80, le=100,...] -- looping keys : get(le=**) ok, ..., get(le=100) ok, get(le=80) nil  ----
coroutine2(sync):          -----------------------------    incr value may evict "le=80" which is old ---------

So when eviction of key le=80 just happen between get_keys of metric_data and get(le=80) of sync, you got a nil. When you add something in sync, that might just changed the timing, e.g. eviction now happens before get_keys, so you don't even get the "le=80" key in keys, and therefore in the loop of keys no get(le=80) invoked and no error happened.
That's why I suggested adding another get_keys right after dict looping in metric_data and see if there are keys missing comparing to the get_keys before dict looping, that would help us confirm if this is the problem.

@jmadureira
Copy link

@ProBrian another round of debugging produced the following messages:

2025/02/20 13:56:56 ... counter.lua:65: Synced 46 counters from worker nil: request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"}=old_value+1=1 current value=nil, ...
	
2025/02/20 13:57:04 ... prometheus.lua:937: metric_data(): Going to export 5559 shared metrics, ... request: "GET /metrics ...

2025/02/20 13:57:04 ... prometheus.lua:1027: log_error(): Error getting 'request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"}': nil, ... request: "GET /metrics ...

2025/02/20 13:57:04 ... prometheus.lua:1008: metric_data(): Expected to have exported 5559 shared metrics, ... request: "GET /metrics ...

The current_value logged on the first line is supposed to be the current value on the shared cache right after changing it so I wouldn't expect it to be nil. Instead it should be 1 since that is the value returned by self.dict:incr(k, v, 0).
The rest is your suggestion to check the number of keys before and after the dict_loop on the metric_data function which remains the same.

Since this custom log message I'm including the code as well.

On the counter.lua if have:

local function sync(_, self)
  local err, _
  local ok = true
  local count = 0
  local counter_logs = {}

  for k, v in pairs(self.increments) do
    new_val, err, _ = self.dict:incr(k, v, 0)
    if err then
      ngx.log(ngx.WARN, "error increasing counter in shdict key: ", k, ", err: ", err)
      ok = false
    end
    count = count + 1

    -- Only log counter names that contain "request_latency_ms_bucket" along with their values
    if string.find(k, "request_latency_ms_bucket", 1, true) then
      local k_str = k or "nil"
      local v_str = v or "nil"
      local new_val_str = new_val or "nil"
      local current_value = self.dict:get(k) or "nil"
      table.insert(counter_logs, k_str .. "=old_value+" .. v_str .. "=" .. new_val_str .." current value=" .. current_value)
    end
  end

  if count > 0 then
    if #counter_logs > 0 then
      ngx.log(ngx.INFO, "Synced ", count, " counters from worker ", self.id, ": ", table.concat(counter_logs, ", "))
    else
      ngx.log(ngx.INFO, "Synced ", count, " counters from worker ", self.id)
    end
  end

  clear_tab(self.increments)
  return ok
end

On the prometheus.lua I made the following changes:

function Prometheus:metric_data(write_fn, local_only)
  -- ...
  local count = #keys
  ngx_log(ngx.INFO, "Going to export " .. count .. " shared metrics")
  for k, v in pairs(self.local_metrics) do
    keys[count+1] = k
    count = count + 1
  end

  -- ...

  for i = 1, count do
    yield()

    -- ...
  end

  keys = self.dict:get_keys(0)
  local count = #keys
  ngx_log(ngx.INFO, "Expected to have exported " .. count .. " shared metrics")

  buffered_print(nil)

  output:free()
end

@ProBrian
Copy link
Contributor

ProBrian commented Feb 21, 2025

2025/02/20 13:56:56 ... counter.lua:65: Synced 46 counters from worker nil: request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"}=old_value+1=1 current value=nil, ...

Emm... That makes me feel suspicious about the ngx.shared.DICT.incr API. I think if that API works well, the get should not get a nil right after the incr. Could you also record the err of get if exists? Due to the document of openresty:

Retrieving the value in the dictionary ngx.shared.DICT for the key key. If the key does not exist or has expired, then nil will be returned.
In case of errors, nil and a string describing the error will be returned.

So for the nil get, we could confirm whether newly incr key is evicted again, or there's something else wrong happen in get function. the change could be like:

      local new_val_str = new_val or "nil"
      local current_value, err = self.dict:get(k)
      table.insert(counter_logs, k_str .. "=old_value+" .. v_str .. "=" .. new_val_str .." current value=" .. tostring(current_value) .. " err: " .. err)

By the way, how about the memory size you set for the shm? As @rodolfobrunner mentioned

I already tried:

nginx_http_lua_shared_dict: 'prometheus_metrics 15m' Memory now stands at +-20%

Are you still using that memory size config in current test?

@jmadureira
Copy link

@ProBrian another round of tests

2025/02/25 16:23:41 ... counter.lua:71: Worker: nil, initial_free_memory: 13557760, final_free_memory: 13557760 bytes, counters_synced: 93, counters:
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"}=old_value+1=1 current value=nil err=nil,
request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00100.0"}=old_value+1=1 current value=1 err=nil,...
2025/02/25 16:23:43 ... prometheus.lua:937: metric_data(): Going to export 7686 shared metrics, ..., request: "GET /metrics HTTP/1.1", ...
2025/02/25 16:23:43 ... prometheus.lua:1027: log_error(): Error getting 'request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"}': nil, ..., request: "GET /metrics HTTP/1.1", ...
2025/02/25 16:23:43 ... prometheus.lua:1008: metric_data(): Expected to have exported 7686 shared metrics, ..., request: "GET /metrics HTTP/1.1", ..."

No changes seem to occur on the shared dict size.

The code that originated this entry.

local function sync(_, self)
  local err, _
  local ok = true
  local count = 0
  local counter_logs = {}

  local initial_free_memory = self.dict:free_space()
  for k, v in pairs(self.increments) do
    local new_val, err, _ = self.dict:incr(k, v, 0)
    if err then
      ngx.log(ngx.WARN, "error increasing counter in shdict key: ", k, ", err: ", err)
      ok = false
    end
    count = count + 1

    local current_value, current_err
    -- Only log counter names that contain "request_latency_ms_bucket" along with their values
    if string.find(k, "request_latency_ms_bucket", 1, true) then
      local k_str = k or "nil"
      local v_str = v or "nil"
      local new_val_str = new_val or "nil"
      current_value, current_err = self.dict:get(k)
      current_value = current_value or "nil"
      current_err = current_err or "nil"
      table.insert(counter_logs, k_str .. "=old_value+" .. v_str .. "=" .. new_val_str .." current value=" .. current_value .. " err=" .. current_err)
    end
  end

  if count > 0 then
    local worker_id = self.id
    local available_memory = self.dict:free_space()
    if #counter_logs > 0 then
      ngx.log(ngx.INFO, "Worker: ", worker_id, ", initial_free_memory: ", initial_free_memory, ", final_free_memory: ", available_memory, " bytes, counters_synced: ", count, ", counters: ", table.concat(counter_logs, ", "))
    else
      ngx.log(ngx.INFO, "Worker: ", worker_id, ", initial_free_memory: ", initial_free_memory, ", final_free_memory: ", available_memory, " bytes, counters_synced: ", count)
    end
  end

  clear_tab(self.increments)
  return ok
end

Are you still using that memory size config in current test?

yes they remain the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants