[Bug] TraceQL Metrics seems to be incorrect when use status=error #4608

g3david405 · 2025-01-24T07:04:50Z

Describe the bug
I use TraceQL metrics to get the count of failed QB query spans.

I found that when using span attributes for counting, the numbers match correctly.

However, when using status = error, the numbers are incorrect.

I used the following query to retrieve a lot of spans(around 500 spans), but in TraceQL metrics, the count is far lower than the spans retrieved.

{ resource.service.name =~ "$Job" && span.db.system =~ "$DB_SYSTEM" && status = error }

{ resource.service.name =~ "$Job" && span.db.system =~ "$DB_SYSTEM" && status = error } | count_over_time() by (resource.service.name)

And there is indeed a lot of status error span in tempo.

Is this a bug, or am I using metrics incorrectly?

Expected behavior
The metrics number should be match to span count.

Environment:

Infrastructure: Kubernetes
Deployment tool: helm (tempo-distributed 1.31.0, tempo version 2.7.0)

The text was updated successfully, but these errors were encountered:

joe-elliott · 2025-01-24T20:31:23Z

I'm sorry but I'm not really able to use these screenshots to debug this issue. I'm not saying there's not a bug, but nothing in your issue seems inconsistent to me.

count_over_time() by (resource.service.name)

count of time is weird b/c it's bound to a hidden step. It will calculate the number of occurrences of spans that match your query over the past step duration. I would recommend using rate() which always shows a per second rate of span occurrences.

g3david405 · 2025-01-25T02:26:56Z

@joe-elliott Thanks for the reply!

Sorry, let me rephrase my question.

Between 10:00 and 11:00, I used the same query:
{ resource.service.name =~ "$Job" && span.db.system =~ "$DB_SYSTEM" && status = error }
and found more than 1000 spans, including around 30 different resource.service.name values.

However, when I tried using:
{ resource.service.name =~ "$Job" && span.db.system =~ "$DB_SYSTEM" && status = error } | count_over_time() by (resource.service.name)
to check which resource.service.name caused the most database query errors, I found that only three resource.service.name had metrics recorded, and the span number of metrics was not match to the actual span count.

This issue does not occur when I only use span.{attribute} to search.
But if my query includes conditions like { duration > 30000 } or { status = error }, the metrics numbers are significantly different from the span count.

And the reason I use count_over_time instead of rate is that I want to calculate the exact number of errors each service encountered per minute. In fact, the graphs generated by rate and count_over_time are the same, except for the units.

joe-elliott · 2025-01-27T18:28:02Z

Service name discrepancy

The service name you have blocked out is the root service name of the trace and not the service name on the span identified by your filters. If you expand those traces and look at the service names on the spans are you still seeing 30?

Rate discrepancy

The final image shows a rate of .33 spans / second over a 1 hour period. .33 * 3600 = 1188 which is over 1k.

g3david405 · 2025-01-28T14:09:02Z

@joe-elliott
Yes, I block the root service span, but in these traces child span has more than 30 different kinds of service name.

g3david405 · 2025-01-28T14:20:33Z

The final image shows a rate of .33 spans / second over a 1 hour period. .33 * 3600 = 1188 which is over 1k.

The caculation is wrong,

The rate is 0.13

and my default time interval is 15s.
the caclulation should be 0.13 * 15(second) = 2
which match my count_over_time function.

But not every 15s interval has spans, or the picture will be more dense.

For example, if two spans occurred between 15 and 30 seconds within a minute, the rate for 15–30 seconds would be 0.13. However, you cannot assume that the rate for the entire minute would also be 0.13.

We can see that the number calculated by "count over time" shows only 5 bars between 09:30 and 09:35, so there were actually 10 spans that occurred during this period.

Let me restate: my calculation method can perfectly verify the number of spans when filtering only with span.{attributes}. However, as soon as the filter conditions include something like { duration > 30000 } or { status = error }, the numbers will completely fail to match.

g3david405 · 2025-02-05T00:43:11Z

@joe-elliott
Any idea?
thanks

joe-elliott · 2025-02-07T20:00:23Z

You are correct. I was moving fast and misread .13 as .33. Thank you for pointing that out. So I have tried to recreate this w/o luck. I'm using the following k6 script:

k6 script

import {sleep} from 'k6';
import tracing from 'k6/x/tracing';

export const options = {
    vus: 1,
    duration: "3h", 
};

const endpoint = __ENV.ENDPOINT || "otel-collector:4317"
const client = new tracing.Client({
    endpoint,
    exporter: tracing.EXPORTER_OTLP_HTTP,
    tls: {
        insecure: true,
    },
    headers: {
        'X-Scope-OrgID': '3',
    },
});

const traceDefaults = {
    attributeSemantics: tracing.SEMANTICS_HTTP,
}

const traceTemplates = [
    // fast internal calls
    {
        defaults: traceDefaults,
        spans: [
            {service: "inventory-service", name: "/item", duration: {min: 100, max: 300}, attributes: {"http.status_code": 500, "application.version": "0.3"}},
        ]
    },
    {
        defaults: traceDefaults,
        spans: [
            {service: "inventory-service", name: "/item", duration: {min: 100, max: 300}, attributes: {"http.status_code": 200, "application.version": "0.3"}},
        ]    
    },
    {
        defaults: traceDefaults,
        spans: [
            {service: "inventory-service", name: "/item", duration: {min: 100, max: 300}, attributes: {"http.status_code": 200, "application.version": "0.3"}},
        ]
    },
]

let idx = 0;

export default function () {
    idx = (idx + 1) % traceTemplates.length;

    const gen = new tracing.TemplatedGenerator(traceTemplates[idx])
    client.push(gen.traces())

    sleep(.1);
}

export function teardown() {
    client.shutdown();
}

This is loaded and pointed at a Tempo instance something like this:

docker-compose

version: '3.0'

services:
  grafana:
    image: grafana/grafana:11.5.1
    ports:
      - 3000:3000/tcp

  tempo:
    image: grafana/tempo:main-d7b1655
    command: [ "-config.file=/etc/tempo.yaml" ]
    volumes:
      - ./devenv/tempo.yaml:/etc/tempo.yaml
    ports:
      - "3200:3200"   # tempo

  k6-tracing:
    image: ghcr.io/grafana/xk6-client-tracing:latest # requires locally built xk6-client-tracing
    command: [ "run", "/var/k6/template.js" ]
    environment:
      - ENDPOINT=http://tempo:4318
    restart: always
    depends_on:
      - tempo
    volumes:
      - ./devenv/template.js:/var/k6/template.js

If you review the script it creates a single span trace and then sleeps 100ms. So it should generate roughly 10 spans per second. 1 out of 3 will have status = error.

Plain rate is showing about that:

Rate of status = error shows about 3.33 as expected:

I do not know what is happening in your case.

g3david405 · 2025-02-08T00:21:36Z

Thanks, I will try it later!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] TraceQL Metrics seems to be incorrect when use status=error #4608

[Bug] TraceQL Metrics seems to be incorrect when use status=error #4608

g3david405 commented Jan 24, 2025

joe-elliott commented Jan 24, 2025

g3david405 commented Jan 25, 2025

joe-elliott commented Jan 27, 2025

g3david405 commented Jan 28, 2025

g3david405 commented Jan 28, 2025 •

edited

Loading

g3david405 commented Feb 5, 2025

joe-elliott commented Feb 7, 2025 •

edited

Loading

g3david405 commented Feb 8, 2025

[Bug] TraceQL Metrics seems to be incorrect when use status=error #4608

[Bug] TraceQL Metrics seems to be incorrect when use status=error #4608

Comments

g3david405 commented Jan 24, 2025

joe-elliott commented Jan 24, 2025

g3david405 commented Jan 25, 2025

joe-elliott commented Jan 27, 2025

g3david405 commented Jan 28, 2025

g3david405 commented Jan 28, 2025 • edited Loading

g3david405 commented Feb 5, 2025

joe-elliott commented Feb 7, 2025 • edited Loading

g3david405 commented Feb 8, 2025

g3david405 commented Jan 28, 2025 •

edited

Loading

joe-elliott commented Feb 7, 2025 •

edited

Loading