You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To start, please note that this issue has not led to any cluster/node instability. It is an observation of a periodic increase in allocated objects on the heap, which could be a good candidate for optimization.
About a year ago, work was done to migrate our histogram libraries away from the HDR histogram, to instead be backed by Prometheus histograms (see #85990). In short, this was done because the HDR histograms generated hundreds of histogram buckets, which strained downstream systems such as Prometheus. The Prometheus-backed histogram uses a static number of buckets.
More recently, a bug was found with the windowed Prometheus-backed histograms causing periodic gaps in reported data (see #98266). This was fixed in #104088, which updated the logic which merged histogram windows.
It was recently discovered that the migration from HDR to Prometheus-backed histograms increased the number of heap allocated objects. This is triggered every 10 seconds, when the internal TSDB polls all metrics registries to write the latest values to TSDB (it also happens periodically to generate the NodeStatus report). These polls of the metric data require a data transformation from prometheus.Histogram (used in our internal representation) to the prometheusgo.Metric type. It's the call to (*prometheus.Histogram).Write(prometheusgo.Metric) in this process that triggers the allocations - something that happens within the vendored library, not our own code.
This originally appeared when the migration from HDR to Prometheus first happened. However, the issue was exasperated with the fixes made to how histogram windows are merged, which added an additional call to (*prometheus.Histogram).Write(prometheusgo.Metric), effectively doubling the number of allocated objects.
The offending code is here - note the two calls to .Write():
In the extreme case, some heap profiles we looked at where the profile is timed correctly to overlap with the TSDB polling that happens every 10s, these (*histogram).Write() calls accounted for nearly 30% of cumulative allocated objects on the heap. For example:
$ for i in $(ls ./memprof*.pprof | sort -R | head -n 30); do go tool pprof -top -alloc_objects -nodecount=1 "$i" | tail -n2 ; done
flat flat% sum% cum cum%
13237168657 27.51% 27.51% 13237168657 27.51% github.com/prometheus/client_golang/prometheus.(*histogram).Write
flat flat% sum% cum cum%
591896349 30.39% 30.39% 591896349 30.39% github.com/prometheus/client_golang/prometheus.(*histogram).Write
flat flat% sum% cum cum%
12157289072 27.69% 27.69% 12157289072 27.69% github.com/prometheus/client_golang/prometheus.(*histogram).Write
flat flat% sum% cum cum%
13236705516 27.51% 27.51% 13236705516 27.51% github.com/prometheus/client_golang/prometheus.(*histogram).Write
flat flat% sum% cum cum%
11887529425 27.76% 27.76% 11887529425 27.76% github.com/prometheus/client_golang/prometheus.(*histogram).Write
flat flat% sum% cum cum%
46995371 24.05% 24.05% 46995371 24.05% github.com/prometheus/client_golang/prometheus.(*histogram).Write
flat flat% sum% cum cum%
319519281 28.70% 28.70% 319519281 28.70% github.com/prometheus/client_golang/prometheus.(*histogram).Write
flat flat% sum% cum cum%
594184736 30.40% 30.40% 594184736 30.40% github.com/prometheus/client_golang/prometheus.(*histogram).Write
flat flat% sum% cum cum%
12968362073 27.90% 27.90% 12968362073 27.90% github.com/prometheus/client_golang/prometheus.(*histogram).Write
While the allocations happen within a vendored library, there is still potential for optimization here. We can for example rewrite the .Write() function ourselves to be more conservative with allocations.
abarganier
removed
the
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
label
Oct 31, 2023
abarganier
changed the title
pkg/util/metric: period spike in alloc_objects from prometheus.Histogram metrics
pkg/util/metric: period spike in alloc_objects from prometheus.Histogram metrics on poll
Oct 31, 2023
Describe the problem
To start, please note that this issue has not led to any cluster/node instability. It is an observation of a periodic increase in allocated objects on the heap, which could be a good candidate for optimization.
About a year ago, work was done to migrate our histogram libraries away from the HDR histogram, to instead be backed by Prometheus histograms (see #85990). In short, this was done because the HDR histograms generated hundreds of histogram buckets, which strained downstream systems such as Prometheus. The Prometheus-backed histogram uses a static number of buckets.
More recently, a bug was found with the windowed Prometheus-backed histograms causing periodic gaps in reported data (see #98266). This was fixed in #104088, which updated the logic which merged histogram windows.
It was recently discovered that the migration from HDR to Prometheus-backed histograms increased the number of heap allocated objects. This is triggered every 10 seconds, when the internal TSDB polls all metrics registries to write the latest values to TSDB (it also happens periodically to generate the NodeStatus report). These polls of the metric data require a data transformation from
prometheus.Histogram
(used in our internal representation) to theprometheusgo.Metric
type. It's the call to(*prometheus.Histogram).Write(prometheusgo.Metric)
in this process that triggers the allocations - something that happens within the vendored library, not our own code.This originally appeared when the migration from HDR to Prometheus first happened. However, the issue was exasperated with the fixes made to how histogram windows are merged, which added an additional call to
(*prometheus.Histogram).Write(prometheusgo.Metric)
, effectively doubling the number of allocated objects.The offending code is here - note the two calls to
.Write()
:cockroach/pkg/util/metric/metric.go
Lines 425 to 435 in 407017d
In the extreme case, some heap profiles we looked at where the profile is timed correctly to overlap with the TSDB polling that happens every 10s, these
(*histogram).Write()
calls accounted for nearly 30% of cumulative allocated objects on the heap. For example:While the allocations happen within a vendored library, there is still potential for optimization here. We can for example rewrite the
.Write()
function ourselves to be more conservative with allocations.Jira issue: CRDB-33029
The text was updated successfully, but these errors were encountered: