pkg/util/metric: period spike in alloc_objects from prometheus.Histogram metrics on poll

**Describe the problem**

To start, please note that this issue has not led to any cluster/node instability. It is an observation of a periodic increase in allocated objects on the heap, which could be a good candidate for optimization.

About a year ago, work was done to migrate our histogram libraries away from the HDR histogram, to instead be backed by Prometheus histograms (see https://github.com/cockroachdb/cockroach/pull/85990). In short, this was done because the HDR histograms generated hundreds of histogram buckets, which strained downstream systems such as Prometheus. The Prometheus-backed histogram uses a static number of buckets.

More recently, a bug was found with the windowed Prometheus-backed histograms causing periodic gaps in reported data (see https://github.com/cockroachdb/cockroach/issues/98266). This was fixed in https://github.com/cockroachdb/cockroach/pull/104088, which updated the logic which merged histogram windows.

It was recently discovered that the migration from HDR to Prometheus-backed histograms increased the number of heap allocated objects. This is triggered every 10 seconds, when the internal TSDB polls all metrics registries to write the latest values to TSDB (it also happens periodically to generate the NodeStatus report). These polls of the metric data require a data transformation from `prometheus.Histogram` (used in our internal representation) to the `prometheusgo.Metric` type. It's the call to `(*prometheus.Histogram).Write(prometheusgo.Metric)` in this process that triggers the allocations - something that happens within the vendored library, not our own code. 

This originally appeared when the migration from HDR to Prometheus first happened. However, the issue was exasperated with the fixes made to how histogram windows are merged, which added an additional call to `(*prometheus.Histogram).Write(prometheusgo.Metric)`, effectively doubling the number of allocated objects. 

The offending code is here - note the two calls to `.Write()`: https://github.com/cockroachdb/cockroach/blob/407017de12eae577a78033100a7aa0ac7d13c6af/pkg/util/metric/metric.go#L425-L435

In the extreme case, some heap profiles we looked at where the profile is timed correctly to overlap with the TSDB polling that happens every 10s, these `(*histogram).Write()` calls accounted for nearly 30% of cumulative allocated objects on the heap. For example:
```
$ for i in $(ls ./memprof*.pprof |  sort -R | head -n 30); do go tool pprof -top -alloc_objects -nodecount=1 "$i" | tail -n2 ; done
      flat  flat%   sum%        cum   cum%
13237168657 27.51% 27.51% 13237168657 27.51%  github.com/prometheus/client_golang/prometheus.(*histogram).Write
      flat  flat%   sum%        cum   cum%
 591896349 30.39% 30.39%  591896349 30.39%  github.com/prometheus/client_golang/prometheus.(*histogram).Write
      flat  flat%   sum%        cum   cum%
12157289072 27.69% 27.69% 12157289072 27.69%  github.com/prometheus/client_golang/prometheus.(*histogram).Write
      flat  flat%   sum%        cum   cum%
13236705516 27.51% 27.51% 13236705516 27.51%  github.com/prometheus/client_golang/prometheus.(*histogram).Write
      flat  flat%   sum%        cum   cum%
11887529425 27.76% 27.76% 11887529425 27.76%  github.com/prometheus/client_golang/prometheus.(*histogram).Write
      flat  flat%   sum%        cum   cum%
  46995371 24.05% 24.05%   46995371 24.05%  github.com/prometheus/client_golang/prometheus.(*histogram).Write
      flat  flat%   sum%        cum   cum%
 319519281 28.70% 28.70%  319519281 28.70%  github.com/prometheus/client_golang/prometheus.(*histogram).Write
      flat  flat%   sum%        cum   cum%
 594184736 30.40% 30.40%  594184736 30.40%  github.com/prometheus/client_golang/prometheus.(*histogram).Write
      flat  flat%   sum%        cum   cum%
12968362073 27.90% 27.90% 12968362073 27.90%  github.com/prometheus/client_golang/prometheus.(*histogram).Write
```

While the allocations happen within a vendored library, there is still potential for optimization here. We can for example rewrite the `.Write()` function ourselves to be more conservative with allocations.

Jira issue: CRDB-33029

	cur := &prometheusgo.Metric{}
	prev := &prometheusgo.Metric{}
	if err := h.windowed.cur.Write(cur); err != nil {
	panic(err)
	}
	if h.windowed.prev != nil {
	if err := h.windowed.prev.Write(prev); err != nil {
	panic(err)
	}
	MergeWindowedHistogram(cur.Histogram, prev.Histogram)
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pkg/util/metric: period spike in alloc_objects from prometheus.Histogram metrics on poll #113508

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pkg/util/metric: period spike in alloc_objects from prometheus.Histogram metrics on poll #113508

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions