Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] TraceQL Metrics seems to be incorrect when use status=error #4608

Open
g3david405 opened this issue Jan 24, 2025 · 8 comments
Open

[Bug] TraceQL Metrics seems to be incorrect when use status=error #4608

g3david405 opened this issue Jan 24, 2025 · 8 comments

Comments

@g3david405
Copy link

Describe the bug
I use TraceQL metrics to get the count of failed QB query spans.

I found that when using span attributes for counting, the numbers match correctly.

However, when using status = error, the numbers are incorrect.

I used the following query to retrieve a lot of spans(around 500 spans), but in TraceQL metrics, the count is far lower than the spans retrieved.

{ resource.service.name =~ "$Job" && span.db.system =~ "$DB_SYSTEM" && status = error }

Image

{ resource.service.name =~ "$Job" && span.db.system =~ "$DB_SYSTEM" && status = error } | count_over_time() by (resource.service.name)

Image

And there is indeed a lot of status error span in tempo.

Image

Is this a bug, or am I using metrics incorrectly?

Expected behavior
The metrics number should be match to span count.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm (tempo-distributed 1.31.0, tempo version 2.7.0)
@joe-elliott
Copy link
Member

I'm sorry but I'm not really able to use these screenshots to debug this issue. I'm not saying there's not a bug, but nothing in your issue seems inconsistent to me.

count_over_time() by (resource.service.name)

count of time is weird b/c it's bound to a hidden step. It will calculate the number of occurrences of spans that match your query over the past step duration. I would recommend using rate() which always shows a per second rate of span occurrences.

@g3david405
Copy link
Author

@joe-elliott Thanks for the reply!

Sorry, let me rephrase my question.

Between 10:00 and 11:00, I used the same query:
{ resource.service.name =~ "$Job" && span.db.system =~ "$DB_SYSTEM" && status = error }
and found more than 1000 spans, including around 30 different resource.service.name values.

Image

However, when I tried using:
{ resource.service.name =~ "$Job" && span.db.system =~ "$DB_SYSTEM" && status = error } | count_over_time() by (resource.service.name)
to check which resource.service.name caused the most database query errors, I found that only three resource.service.name had metrics recorded, and the span number of metrics was not match to the actual span count.

Image

This issue does not occur when I only use span.{attribute} to search.
But if my query includes conditions like { duration > 30000 } or { status = error }, the metrics numbers are significantly different from the span count.

And the reason I use count_over_time instead of rate is that I want to calculate the exact number of errors each service encountered per minute. In fact, the graphs generated by rate and count_over_time are the same, except for the units.

Image Image

@joe-elliott
Copy link
Member

Service name discrepancy

The service name you have blocked out is the root service name of the trace and not the service name on the span identified by your filters. If you expand those traces and look at the service names on the spans are you still seeing 30?

Rate discrepancy

The final image shows a rate of .33 spans / second over a 1 hour period. .33 * 3600 = 1188 which is over 1k.

@g3david405
Copy link
Author

@joe-elliott
Yes, I block the root service span, but in these traces child span has more than 30 different kinds of service name.

@g3david405
Copy link
Author

g3david405 commented Jan 28, 2025

The final image shows a rate of .33 spans / second over a 1 hour period. .33 * 3600 = 1188 which is over 1k.

The caculation is wrong,

The rate is 0.13

Image

and my default time interval is 15s.
the caclulation should be 0.13 * 15(second) = 2
which match my count_over_time function.

But not every 15s interval has spans, or the picture will be more dense.

For example, if two spans occurred between 15 and 30 seconds within a minute, the rate for 15–30 seconds would be 0.13. However, you cannot assume that the rate for the entire minute would also be 0.13.

We can see that the number calculated by "count over time" shows only 5 bars between 09:30 and 09:35, so there were actually 10 spans that occurred during this period.

Let me restate: my calculation method can perfectly verify the number of spans when filtering only with span.{attributes}. However, as soon as the filter conditions include something like { duration > 30000 } or { status = error }, the numbers will completely fail to match.

@g3david405
Copy link
Author

@joe-elliott
Any idea?
thanks

@joe-elliott
Copy link
Member

joe-elliott commented Feb 7, 2025

You are correct. I was moving fast and misread .13 as .33. Thank you for pointing that out. So I have tried to recreate this w/o luck. I'm using the following k6 script:

k6 script
import {sleep} from 'k6';
import tracing from 'k6/x/tracing';

export const options = {
    vus: 1,
    duration: "3h", 
};

const endpoint = __ENV.ENDPOINT || "otel-collector:4317"
const client = new tracing.Client({
    endpoint,
    exporter: tracing.EXPORTER_OTLP_HTTP,
    tls: {
        insecure: true,
    },
    headers: {
        'X-Scope-OrgID': '3',
    },
});

const traceDefaults = {
    attributeSemantics: tracing.SEMANTICS_HTTP,
}

const traceTemplates = [
    // fast internal calls
    {
        defaults: traceDefaults,
        spans: [
            {service: "inventory-service", name: "/item", duration: {min: 100, max: 300}, attributes: {"http.status_code": 500, "application.version": "0.3"}},
        ]
    },
    {
        defaults: traceDefaults,
        spans: [
            {service: "inventory-service", name: "/item", duration: {min: 100, max: 300}, attributes: {"http.status_code": 200, "application.version": "0.3"}},
        ]    
    },
    {
        defaults: traceDefaults,
        spans: [
            {service: "inventory-service", name: "/item", duration: {min: 100, max: 300}, attributes: {"http.status_code": 200, "application.version": "0.3"}},
        ]
    },
]

let idx = 0;

export default function () {
    idx = (idx + 1) % traceTemplates.length;

    const gen = new tracing.TemplatedGenerator(traceTemplates[idx])
    client.push(gen.traces())

    sleep(.1);
}

export function teardown() {
    client.shutdown();
}

This is loaded and pointed at a Tempo instance something like this:

docker-compose
version: '3.0'

services:
  grafana:
    image: grafana/grafana:11.5.1
    ports:
      - 3000:3000/tcp

  tempo:
    image: grafana/tempo:main-d7b1655
    command: [ "-config.file=/etc/tempo.yaml" ]
    volumes:
      - ./devenv/tempo.yaml:/etc/tempo.yaml
    ports:
      - "3200:3200"   # tempo

  k6-tracing:
    image: ghcr.io/grafana/xk6-client-tracing:latest # requires locally built xk6-client-tracing
    command: [ "run", "/var/k6/template.js" ]
    environment:
      - ENDPOINT=http://tempo:4318
    restart: always
    depends_on:
      - tempo
    volumes:
      - ./devenv/template.js:/var/k6/template.js

If you review the script it creates a single span trace and then sleeps 100ms. So it should generate roughly 10 spans per second. 1 out of 3 will have status = error.

Plain rate is showing about that:

Image

Rate of status = error shows about 3.33 as expected:

Image

I do not know what is happening in your case.

@g3david405
Copy link
Author

Thanks, I will try it later!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants