-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about tracecontext in load-balancer scenarios #575
Comments
Now, for (1) and (2) one could argue that every cloud provider load balancer should just always behave like this (restore the original trace context for outbound requests), just on the off chance that (1.) and (2.) are true. But that would also mean that our imaginary cloud provider would need to spend the extra complexity and computation cycles for this behavior every time, for every request; when in reality only a small subset of users would benefit. (For example because in most cases when a cloud provider load balancer is involved, A and C really belong to different parties and send telemetry to different observability backends as well, or to no backend at all.) To me this sounds like a somewhat unreasonable expectation. And for (3.) you could argue that since all of B is within the scope of control of one party, they could know which of their services need to save the original context to trace state and which services would need to restore it for downstream propagation. But what if the cloud provider uses general purpose implementations for trace context handling (say an OTel SDK with auto instrumentation). Would you have them interfere with the automatic context propagation? Do custom tracing? This sounds like a high bar to me as well. That being said, it is worth noting that the observability backend that A and C are talking to can still see that all spans belong to one trace, since the trace ID is identical throughout the whole distributed transaction. But it is true that, without further help, it cannot restore the correct tree structure because the direct parent-child relation is lost. An alternative approach could be this: Assuming A and C are under control by the same party: If A knows that it is talking to a load balancer which will break the direct parent-child relationship, it could store the span ID of the span representing the outgoing request in In a former life (while still at Instana) I implemented this latter strategy quite successfully to deal with this type of situations - requests going through third-party services that send spans elsewhere. Unfortunately, these type of situations -- distributed transactions going through some infrastructure that is monitored by a different observability solution -- are quite common in my experience and to the best of my knowledge there is no ideal one-size fits all solution. (Not unless we specify an API where observability backends can talk to each other to get those missing spans.) |
Hello all,
Wanted to get your feedback on the below use case related to tracecontext:
Let's say there's a global load balancer service (say, a L7 load balancer offered by a cloud provider) that has multiple microservices. Let's say it wants to use DT to improve life for its on-call engineers.
Let's say an application uses the above the load-balancer service to efficiently route requests to the right backend. For example, let's say there are two parts of this application level code:
The request goes as:
A (client application) -> B (cloud provider offered load balancer service (with multiple internal microservices)) -> C (backend service)
Now, if B is not just propagating the original context, but is actively participating in the trace, then the parent-child relationship between A and C will get broken. This is because the spans emitted by A and C might be going to a different Observability backend than those of B.
In such a situation, it looks to me that B should be restoring the original trace context (that it got from A) before finally calling C. For this, it will likely need to store and propagate that original trace context (at least A's spanid, A's traceflags) in tracestate so that it can use that when it comes the point where it needs to restore this (and then clear it before actually forwarding the request to C).
Wanted to get your feedback on the above thinking - and if you can think of any problems with this approach, or if there's a better approach.
The text was updated successfully, but these errors were encountered: