Skip to content

Conversation

@mosche
Copy link
Contributor

@mosche mosche commented Nov 11, 2025

This PR fixes various issues causing an excessive volume of APM events and incorrectly reported transaction & span durations as described in further detail:

  • Only continue tracing in TaskManager if a parent APM trace context exists. Trace headers might
    be set by external transactions as well. If present, it caused APMTracer to report a transaction
    for every span when in fact being sampled out.

  • Correctly report the duration of transactions if not recording by not discarding the root span
    immediately. These transactions might still get reported, but without spans.

  • Discard transient trace start time in newTraceContext when a parent APM trace context
    already exists. If propagated, all spans would start at the same time.

Relates to ES-13389

Only continue tracing in TaskManager if a parent APM trace context exists. Trace headers might
be set by external transactions as well. If present, it caused APMTracer to report a transaction
for every span when in fact being sampled out.

Correctly report the duration of transactions if not recording by not discarding the root span
immediately. These transactions might still get reported, but without spans.

Discard transient trace start time in `newTraceContext` when a parent APM trace context
already exists. If propagated, all spans would start at the same time.

Relates to ES-13389
@mosche mosche requested a review from a team as a code owner November 11, 2025 14:39
@mosche mosche added >refactoring auto-backport Automatically create backport pull requests when merged :Core/Infra/Metrics Metrics and metering infrastructure branch:9.2 branch:9.1 branch:8.19 labels Nov 11, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Nov 12, 2025
@mosche mosche requested a review from jdconrad November 13, 2025 16:54
@mosche
Copy link
Contributor Author

mosche commented Nov 13, 2025

Correctly report the duration of transactions if not recording by not discarding the root span
immediately. These transactions might still get reported, but without spans.

Turns out this might be unnecessary (for serverless at least), the APM agent behavior depends on the apm-server version. For recent apm-server versions such transactions are not reported of unsampled.

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a couple of questions.

final Object previousTraceContext = newTransientHeaders.remove(Task.APM_TRACE_CONTEXT);
if (previousTraceContext != null) {
newTransientHeaders.put(Task.PARENT_APM_TRACE_CONTEXT, previousTraceContext);
// Remove the trace start time override for a previous context if such a context already exists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question - it's guaranteed there will never be overlap in this case? (If we remove we are sure the prior trace is completed?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRACE_START_TIME is a rather unfortunate hack. It should only be used for the root span (and previously it wasn't possible to set this again, but it could be set to late). Overlapping shouldn't ever be an issue. But I'll add an assert to make sure this is never set after a trace context already exists catching the too late case.

Comment on lines +976 to +978
assert key != Task.TRACE_START_TIME || (hasApmTraceContext() || hasParentApmTraceContext()) == false
: "trace.starttime cannot be set after a trace context is present";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdconrad this will prevent wrong usage of trace.starttime

@mosche mosche merged commit 928e584 into elastic:main Nov 14, 2025
34 checks passed
mosche added a commit to mosche/elasticsearch that referenced this pull request Nov 14, 2025
- Only continue tracing in TaskManager if a parent APM trace context exists. Trace headers might
be set by external transactions as well. If present, it caused APMTracer to report a transaction
for every span when in fact being sampled out.

- Correctly report the duration of transactions if not recording by not discarding the root span
immediately. These transactions might still get reported, but without spans.

- Discard transient trace start time in `newTraceContext` when a parent APM trace context
already exists. If propagated, all spans would start at the same time.

Relates to ES-13389
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.2
8.19 Commit could not be cherrypicked due to conflicts
9.1 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 137908

elasticsearchmachine pushed a commit that referenced this pull request Nov 14, 2025
- Only continue tracing in TaskManager if a parent APM trace context exists. Trace headers might
be set by external transactions as well. If present, it caused APMTracer to report a transaction
for every span when in fact being sampled out.

- Correctly report the duration of transactions if not recording by not discarding the root span
immediately. These transactions might still get reported, but without spans.

- Discard transient trace start time in `newTraceContext` when a parent APM trace context
already exists. If propagated, all spans would start at the same time.

Relates to ES-13389
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged backport pending :Core/Infra/Metrics Metrics and metering infrastructure >refactoring serverless-linked Added by automation, don't add manually Team:Core/Infra Meta label for core/infra team v8.19.8 v9.1.8 v9.2.2 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants