You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've found a PeriodicReader shutdown timing bug that can result in the final export of delta metrics accidentally reporting the cumulative value of a metric instead of the delta value since the previous export.
Here is a simplified example of what I'm seeing:
Create a Meter backed by a PeriodicReader with delta temporality.
Register observable counter that always reports the value 10 in its callback.
First PeriodicReader export will report a value of 10.
Second and subsequent PeriodicReader exports will report values of 0 as expected.
Call Shutdown on the MeterProvider. This shuts down the PeriodicReader which forces one final collect+export.
Final PeriodicReader export will almost always report a value of 0 but in rare cases will report a value of 10.
In other words, the final export will sometimes report the cumulative total value instead of the delta since the last export.
In practice, this bug is hit when Shutdown is called during an ongoing collect+export, specifically when this or this error check checks are hit. These branches get hit by Shutdown canceling the background context being checked.
Environment
OS: seen on multiple (Linux and MacOS)
Architecture: seem on multiple (arm64 and amd64)
Go Version: 1.24 (also seen on 1.23)
opentelemetry-go version: v1.34.0
Steps To Reproduce
In the happy path for collecting sum metrics:
Metric measurement from callback is added to valueMap here.
With the example above, this would put a measurement of 10 in the map.
Not sure why this does v.n += value instead of v.n = value for cumulative types, btw.
Subsequent agg computation uses that valueMap measurement here to compute the delta and then clears that valueMap a few lines later.
This would pull 10 from the map and calulate the delta as 10 - lastReported.
After the first export, lastReported is set to 10 and this never changes.
In the buggy shutdown path:
Metric measurement from callback is added to valueMap during an export cycle as part of a periodic reader background loop.
With the example above, this would put a measurement of 10 in the map.
Shutdown is called while getting measurements so instead of doing any agg computations, the export ends early here.
The valueMap still has a value of 10 since it's not until aggregations run later that the value map is cleared.
The PeriodicReader kicks off one final collect+export...
Metric measurement from callback is added to non-empty valueMap during the export cycle
This puts a measurement of 20 in the map.
v.n += value is actually v.n = 10 + 10 in this case.
Subsequent agg computation uses the incorrect valueMap measurement.
This would pull 20 from the map and calulate the delta as 20 - lastReported (or 20 - 10 or 10)
I can reliably reproduce this with this unit test:
package tenantmetricsotel
import (
"context""fmt""strconv""sync""sync/atomic""testing""time""github.com/stretchr/testify/assert""github.com/stretchr/testify/require""go.opentelemetry.io/otel/attribute""go.opentelemetry.io/otel/metric"
sdkmetric "go.opentelemetry.io/otel/sdk/metric""go.opentelemetry.io/otel/sdk/metric/metricdata"
)
funcTestRepro(t*testing.T) {
exp:=&testExporter{}
start:=time.Now()
reader:=sdkmetric.NewPeriodicReader(exp, sdkmetric.WithInterval(5*time.Millisecond))
meterProvider:=sdkmetric.NewMeterProvider(sdkmetric.WithReader(reader))
meter:=meterProvider.Meter("otel-test-metrics")
varwg sync.WaitGroupcs:=&staticCounters{meter: meter}
t.Log("creating counters")
forrange100 {
cs.add(t)
}
require.Equal(t, int64(0), exp.exportCount.Load())
t.Log("shutting down")
wg.Add(1)
gofunc() {
deferwg.Done()
dur:=200*time.Millisecondtime.Sleep(dur-time.Since(start))
assert.NoError(t, meterProvider.Shutdown(context.Background()))
}()
wg.Wait()
// Each counter always reports 10 so the ultimate total count is 10*n, where n is the number of counters.// Since the sum of exported deltas is ultimately the cumulative total,// the exported deltas should add up to 10*n too.assert.Equal(t, cs.numCounters.Load()*10, exp.exportTotal.Load())
}
typestaticCountersstruct {
meter metric.MeternumCounters atomic.Int64
}
func (c*staticCounters) add(t testing.TB) {
ordinal:=int(c.numCounters.Add(1) -1)
attrSet:=metric.WithAttributeSet(attribute.NewSet(attribute.String("ordinal", strconv.Itoa(ordinal))))
oc, err:=c.meter.Int64ObservableCounter("foo")
_, err=c.meter.RegisterCallback(func(ctx context.Context, observer metric.Observer) error {
observer.ObserveInt64(oc, 10, attrSet)
returnnil
}, oc)
require.NoError(t, err)
}
typetestExporterstruct {
exportCount atomic.Int64exportTotal atomic.Int64
}
var_ sdkmetric.Exporter= (*testExporter)(nil)
func (e*testExporter) Temporality(kind sdkmetric.InstrumentKind) metricdata.Temporality {
returnmetricdata.DeltaTemporality// metricdata.CumulativeTemporality
}
func (e*testExporter) Aggregation(kind sdkmetric.InstrumentKind) sdkmetric.Aggregation {
return sdkmetric.AggregationSum{}
}
func (e*testExporter) Export(ctx context.Context, rm*metricdata.ResourceMetrics) error {
e.exportCount.Add(1)
sumTotal:=int64(0)
for_, sm:=rangerm.ScopeMetrics {
for_, m:=rangesm.Metrics {
for_, dp:=rangem.Data.(metricdata.Sum[int64]).DataPoints {
sumTotal+=dp.Value
}
}
}
ife.Temporality(sdkmetric.InstrumentKindCounter) ==metricdata.DeltaTemporality {
e.exportTotal.Add(sumTotal)
ifsumTotal>0 {
fmt.Println(">> total delta is non-zero", e.exportCount.Load(), sumTotal)
}
} else {
e.exportTotal.Store(sumTotal)
}
returnnil
}
func (e*testExporter) ForceFlush(ctx context.Context) error {
returnnil
}
func (e*testExporter) Shutdown(ctx context.Context) error {
returnnil
}
Expected behavior
Exported values of delta metrics should always be the delta since the last export, not the cumulative value since the process started. The latter results in wildly incorrect metrics.
More generally, I would not expect a context being canceled during a collect+export cycle to corrupt state.
The text was updated successfully, but these errors were encountered:
Description
I've found a PeriodicReader shutdown timing bug that can result in the final export of delta metrics accidentally reporting the cumulative value of a metric instead of the delta value since the previous export.
Here is a simplified example of what I'm seeing:
In other words, the final export will sometimes report the cumulative total value instead of the delta since the last export.
In practice, this bug is hit when Shutdown is called during an ongoing collect+export, specifically when this or this error check checks are hit. These branches get hit by Shutdown canceling the background context being checked.
Environment
Steps To Reproduce
In the happy path for collecting sum metrics:
v.n += value
instead ofv.n = value
for cumulative types, btw.10 - lastReported
.lastReported
is set to 10 and this never changes.In the buggy shutdown path:
v.n += value
is actuallyv.n = 10 + 10
in this case.20 - lastReported
(or20 - 10
or10
)I can reliably reproduce this with this unit test:
Expected behavior
Exported values of delta metrics should always be the delta since the last export, not the cumulative value since the process started. The latter results in wildly incorrect metrics.
More generally, I would not expect a context being canceled during a collect+export cycle to corrupt state.
The text was updated successfully, but these errors were encountered: