Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Automatic propagation of peer.service #247

Conversation

carlosalberto
Copy link
Contributor

@carlosalberto carlosalberto commented Jan 8, 2024

Knowing the service name on the other side of a remote call is valuable troubleshooting information. The semantic conventions represent this via peer.service, which needs to be manually populated.

This information can be effectively derived in the backend using the Resource of the parent Span, but is otherwise not available at Collector processing time, where it could be used for sampling and transformation purposes.

Defining (optional) automated population of peer.service will greatly help adoption of this attribute by users and vendors explicitly interested in this scenario.

Based on open-telemetry/semantic-conventions#439

@carlosalberto carlosalberto requested review from a team January 8, 2024 16:59
Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This OTEP does not discuss the rollout risks. This new field has a very specific semantics that it must be modified or erased before calling a downstream service, neither of which is the default behavior (default is to propagate state unchanged). So if this behavior is introduced into some SDKs, the data in a company for this specific attribute will be a complete mess until every service deploys the updated SDK version.

In general, I am not a fan of using baggage for things that are supposed to change on every hop, it's a very clear abuse of the mechanism.

text/trace/0247-peer-service-propagation.md Outdated Show resolved Hide resolved

Automatic propagation of `peer.service` through `TraceState`.

## Motivation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation does not read convincing to me. Why does a service need to know who called it? If this is part of a transitive root cause isolation workflow, then you just use distributed traces for that. If this is about some business-specific behavior depending on who called you, e.g. multi-tenant behavior. then I think this mechanism is quite inappropriate - relying on a deployment/infra name of the caller is a pretty narrow use case, not suitable for general purpose multi-tenancy. So please describe Users and Jobs To Be Done of this feature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see this being helpful in two scenarios in my work:

  1. Sampling rule where you look at a given event and can see that its caller is X. For example, filtering out a noisy branches in a trace but that caller is something you want to keep.
  2. Folks who want to get this information eventually via tracing, but can't today, and so if there's an easier way to "add otel" without fully adopting tracing and getting this caller info, that'd be helpful for them.

@jmacd
Copy link
Contributor

jmacd commented Jan 18, 2024

Related work, cc @bogdandrutu @kalyanaj
w3c/trace-context#550

@carlosalberto carlosalberto force-pushed the peer-service-propagation branch from 3bf006e to c8b500a Compare January 22, 2024 16:47
@carlosalberto
Copy link
Contributor Author

Hey @yurishkuro

Added sampling scenarios that may throw light into how useful this feature could be. Please review.


### Use scenarios

Sampling can benefit from knowing the calling service, specifically:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As sampling is mentioned as a primary usage scenario for this feature, I wonder if would make sense to bundle this together with other sampling related values for which there's a proposal to propagate them via trace state: #235

All those values could then be consistently used and populated by samplers, one wouldn't need to invent a new configurable mechanism in SDKs (at least for the client side).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value makes sense to propagate with the sampling-related attributes in the tracestate, which are now covered in open-telemetry/semantic-conventions#793. Still, I see it as an independent property.

@tedsuo tedsuo added the triaged label Jan 29, 2024
@jmacd jmacd requested a review from kentquirk March 7, 2024 16:42
@jmacd
Copy link
Contributor

jmacd commented Mar 7, 2024

@carlosalberto I think we should try to build in more protection against accidental propagation of the peer service information. Also, I'm afraid "upstream" can lead to confusion--although unless we do something to avoid accidental propagation, it's literal and true-- the upstream service name would be the nearest ancestor context that happened to set it.

I want us to consider a mechanism that helps us scope tracestate variables to limit their impact in the future. I'm thinking of an entirely new tracestate vendor code for information (e.g, to for "Transient OpenTelemetry") that would purposefully terminate after a propagation event. (I think of this approach as complimentary to the idea in #207, which is for scoping state until the following propagation event.)

That said, this seems like an opportunity to allow peers to exchange more than only their service name. If we had a tracestate field for exchange of arbitrary peer-related variables, then the new SDK configuration knobs would be:

  • which attributes to Inject
  • which variables to apply as "context-scoped" (in the sense of Propose Context-scoped attributes. #207) in such a way they apply to the span (e.g., or metrics or logs) following Extract

Then, a tracestate could be formed to exchange arbitrary attributes, like:

tracestate: to=peer.service:myservice;some.property:somevalue;etc:etc

Receivers would apply these variables to the context and drop them from the tracestate before creating a new context.

@carlosalberto
Copy link
Contributor Author

Closing this as there doesn't seem to be enough interest and I don't plan to work on it myself

(It can always be "resurrected" by copying the OTEP file and opening it as a new instance, etc)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants