-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in getting metrics & prometheus plugin after bumping to the 3.7.1 release #14160
Comments
@rodolfobrunner Does this issue happen while using the same deployment as #14144? I'm trying to reproduce it. |
Hey @ProBrian, I am part of the same team as @rodolfobrunner. Yes, it's the same deployment. |
Hello @ProBrian some additional info on what we're seeing. We added at some point some debug instructions to figure out what was being stored. Something like: -- Adapted from the prometheus metric_data function
local function collect()
ngx.header["Content-Type"] = "text/plain; charset=UTF-8"
ngx.header["Kong-NodeId"] = node_id
local prometheus = exporter.get_prometheus()
local write_fn = ngx.print
local keys = prometheus.dict:get_keys(0) <--- this is the ngx.shared["prometheus_metrics"]
local count = #keys
table_sort(keys)
local output = buffer.new(DATA_BUFFER_SIZE_HINT)
local output_count = 0
local function buffered_print(fmt, ...)
if fmt then
output_count = output_count + 1
output:putf(fmt, ...)
end
if output_count >= 100 or not fmt then
write_fn(output:get()) -- consume the whole buffer
output_count = 0
end
end
for i = 1, count do
local key = keys[i]
local value = prometheus.dict[key]
buffered_print("%s: %s\n", key, value)
end
buffered_print(nil)
output:free()
end ... which outputs (when the error occurs):
How can a dictionary support duplicate keys? Even if it's a shared dictionary? |
That's weird, from the print logs even the |
Hello @ProBrian some more information. Added more information to the introspection and got this.
Which explains the error described earlier (the value is indeed
The disturbing part here is that the code that dumps the counters to the nginx shared dict: local function sync(_, self)
local err, _
local ok = true
for k, v in pairs(self.increments) do
_, err, _ = self.dict:incr(k, v, 0)
if err then
ngx.log(ngx.WARN, "error increasing counter in shdict key: ", k, ", err: ", err)
ok = false
end
end
clear_tab(self.increments)
return ok
end the
Under which scenarios will this operation write |
@jmadureira For the
|
@jmadureira Could you do more debug to clarify that:
|
@ProBrian you are probably on to something. I changed the sync logic a little bit local function sync(_, self)
local err, _
local ok = true
local count = 0
local counter_logs = {}
for k, v in pairs(self.increments) do
new_val, err, _ = self.dict:incr(k, v, 0)
if err then
ngx.log(ngx.WARN, "error increasing counter in shdict key: ", k, ", err: ", err)
ok = false
end
count = count + 1
-- Only log counter names that contain "request_latency_ms_bucket" along with their values
if string.find(k, "request_latency_ms_bucket", 1, true) then
table.insert(counter_logs, k .. "=old_value+" .. v .. "=" .. new_val .." current value=" .. self.dict:get(k)) <---- also print the current value on the shared dictionary
end
end
if count > 0 then
if #counter_logs > 0 then
ngx.log(ngx.INFO, "Synced ", count, " counters from worker ", self.id, ": ", table.concat(counter_logs, ", "))
else
ngx.log(ngx.INFO, "Synced ", count, " counters from worker ", self.id)
end
end
clear_tab(self.increments)
return ok
end The error no longer shows up, most likely because since the value gets read immediately it won't be affected by the LRU logic. On the other hand if the |
I think
So when eviction of key |
@ProBrian another round of debugging produced the following messages:
The Since this custom log message I'm including the code as well. On the local function sync(_, self)
local err, _
local ok = true
local count = 0
local counter_logs = {}
for k, v in pairs(self.increments) do
new_val, err, _ = self.dict:incr(k, v, 0)
if err then
ngx.log(ngx.WARN, "error increasing counter in shdict key: ", k, ", err: ", err)
ok = false
end
count = count + 1
-- Only log counter names that contain "request_latency_ms_bucket" along with their values
if string.find(k, "request_latency_ms_bucket", 1, true) then
local k_str = k or "nil"
local v_str = v or "nil"
local new_val_str = new_val or "nil"
local current_value = self.dict:get(k) or "nil"
table.insert(counter_logs, k_str .. "=old_value+" .. v_str .. "=" .. new_val_str .." current value=" .. current_value)
end
end
if count > 0 then
if #counter_logs > 0 then
ngx.log(ngx.INFO, "Synced ", count, " counters from worker ", self.id, ": ", table.concat(counter_logs, ", "))
else
ngx.log(ngx.INFO, "Synced ", count, " counters from worker ", self.id)
end
end
clear_tab(self.increments)
return ok
end On the function Prometheus:metric_data(write_fn, local_only)
-- ...
local count = #keys
ngx_log(ngx.INFO, "Going to export " .. count .. " shared metrics")
for k, v in pairs(self.local_metrics) do
keys[count+1] = k
count = count + 1
end
-- ...
for i = 1, count do
yield()
-- ...
end
keys = self.dict:get_keys(0)
local count = #keys
ngx_log(ngx.INFO, "Expected to have exported " .. count .. " shared metrics")
buffered_print(nil)
output:free()
end |
Emm... That makes me feel suspicious about the
So for the
By the way, how about the memory size you set for the shm? As @rodolfobrunner mentioned
Are you still using that memory size config in current test? |
@ProBrian another round of tests
No changes seem to occur on the shared dict size. The code that originated this entry. local function sync(_, self)
local err, _
local ok = true
local count = 0
local counter_logs = {}
local initial_free_memory = self.dict:free_space()
for k, v in pairs(self.increments) do
local new_val, err, _ = self.dict:incr(k, v, 0)
if err then
ngx.log(ngx.WARN, "error increasing counter in shdict key: ", k, ", err: ", err)
ok = false
end
count = count + 1
local current_value, current_err
-- Only log counter names that contain "request_latency_ms_bucket" along with their values
if string.find(k, "request_latency_ms_bucket", 1, true) then
local k_str = k or "nil"
local v_str = v or "nil"
local new_val_str = new_val or "nil"
current_value, current_err = self.dict:get(k)
current_value = current_value or "nil"
current_err = current_err or "nil"
table.insert(counter_logs, k_str .. "=old_value+" .. v_str .. "=" .. new_val_str .." current value=" .. current_value .. " err=" .. current_err)
end
end
if count > 0 then
local worker_id = self.id
local available_memory = self.dict:free_space()
if #counter_logs > 0 then
ngx.log(ngx.INFO, "Worker: ", worker_id, ", initial_free_memory: ", initial_free_memory, ", final_free_memory: ", available_memory, " bytes, counters_synced: ", count, ", counters: ", table.concat(counter_logs, ", "))
else
ngx.log(ngx.INFO, "Worker: ", worker_id, ", initial_free_memory: ", initial_free_memory, ", final_free_memory: ", available_memory, " bytes, counters_synced: ", count)
end
end
clear_tab(self.increments)
return ok
end
yes they remain the same. |
Is there an existing issue for this?
Kong version (
$ kong version
)3.7.1 / 3.9.0
Current Behavior
I am having problems with metrics & prometheus plugin after bumping to the 3.7.1 release. (I already bumped Kong until 3.9.0 and the issue still persists)
I have the following entry in my logs:
[lua] prometheus.lua:1020: log_error(): Error getting 'request_latency_ms_bucket{service="customer-support",route="customer-support_getcards",workspace="default",le="00080.0"}': nil, client: 10.145.40.1, server: kong_status, request: "GET /metrics HTTP/1.1", host: "10.145.12.54:8100"
Interesting facts:
I already tried:
One pod contains:
While another is missing the le "80"
We are running our Kong in AWS EKS, upgraded from 3.6.1
Expected Behavior
The bucket should not disappear, but if it does for any reason I would expect Kong to be able to recover from an inconsistent state. (maybe metric reset?)
Steps To Reproduce
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: