-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[azservicebus] Enable distributed tracing #23860
base: main
Are you sure you want to change the base?
Conversation
karenychen
commented
Dec 11, 2024
•
edited
Loading
edited
- The purpose of this PR is explained in this or a referenced issue.
- The PR does not update generated files.
- These files are managed by the codegen framework at Azure/autorest.go.
- Tests are included and/or updated for code changes.
- Updates to module CHANGELOG.md are included.
- MIT license headers are included in each file.
Thank you for your contribution @karenychen! We will review the pull request and get back to you soon. |
Hi @lmolkova ! I had a small question regarding diagnostic-id in https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-end-to-end-tracing?tabs=net-standard-sdk-2 The .NET SDK seems to be hooking them up to a ReceiveMessages trace when the users set the diagnostic-id and linking the list of diagnostic ids from the messages to the Receive trace (code here). I might be misunderstanding how .NET is doing it, but I am wondering what we enable with the diagnostic-ids? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great so far, got some questions for the other experts in the crowd, but from an SB perspective it looks great.
} | ||
|
||
func getSpanAttributesForMessage(message *Message) []tracing.Attribute { | ||
attrs := []tracing.Attribute{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attrs := []tracing.Attribute{} | |
var attrs []tracing.Attribute |
I swear there's some linter that complains if you pre-init the slice and it's not technically needed anyways.
) | ||
defer func() { endSpan(err) }() | ||
|
||
err = s.links.Retry(ctx, EventSender, "SendMessageBatch", func(ctx context.Context, lwid *internal.LinksWithID, args *utils.RetryFnArgs) error { | ||
return lwid.Sender.Send(ctx, batch.toAMQPMessage(), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova, in cases like this where I have retries, should the span reporting be in the retry loop, or outside, like we have here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it works in our HTTP SDKs is that the method span is above the retry layer, and its child HTTP span is below the retry layer. So you'd have spans like this.
Some method call span
HTTP span retry 1
HTTP span retry 2
I presume we'd want to do the same thing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhendrixMSFT I did a bit of digging -- the semantic conventions for HTTP has examples for how the spans should look like for retries: https://opentelemetry.io/docs/specs/semconv/http/http-spans/#http-client-authorization-retry-examples. However, for messaging systems there is not much information besides this chunk:
I am not sure if this implies the messaging spans are 1 per operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging @lmolkova, the authority for this kind of thing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I'd say that we'd do it the same way as what @jhendrixMSFT outlined above - start a span when the whole retry() function starts, and then a new sub-span within each retry operation.
sdk/messaging/azservicebus/sender.go
Outdated
ctx, endSpan := s.startSpan(ctx, "ScheduleAMQPAnnotatedMessages", tracing.ScheduleOperationName, | ||
tracing.Attribute{Key: tracing.BatchMessageCount, Value: int64(len(messages))}, | ||
) | ||
defer func() { endSpan(err) }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhendrixMSFT, would it be worth building this pattern (via a callback, probably) into the tracing library? It can be internal, but it seems like everyone's going to do the "last error gets passed to endSpan before block ends" pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, @karenychen, we can build a helper function - maybe we'd stick it right in the retry function to make things easier since we're passing very similar information to both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it goes in the sdk/internal
module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with Richard offline -- we are moving this to the Retry() layer :)
@@ -186,10 +187,11 @@ func deleteSubscription(t *testing.T, ac *admin.Client, topicName string, subscr | |||
// and fails tests otherwise. | |||
func peekSingleMessageForTest(t *testing.T, receiver *Receiver) *ReceivedMessage { | |||
var msg *ReceivedMessage | |||
// TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question: is it possible for me to test the traces in outside of the local unit tests too? Are there instructions on how I can run the live tests (and potentially the stress tests)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can run the same suite of live tests. I need to update the sample.env to list them properly, but for now you can see the ones we use here: https://github.com/karenychen/azure-sdk-for-go/blob/bd50b2a1e4c72b1d22ba11314d315d939796c201/sdk/messaging/azservicebus/internal/test/test_helpers.go#L81
Just create a .env file and place it in the azservicebus
folder, and run go test
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you can add /azp run go - azservicebus
as a comment on your PR to run it as part of your CI.
sdk/messaging/azservicebus/sender.go
Outdated
ctx, endSpan := s.startSpan(ctx, "ScheduleAMQPAnnotatedMessages", tracing.ScheduleOperationName, | ||
tracing.Attribute{Key: tracing.BatchMessageCount, Value: int64(len(messages))}, | ||
) | ||
defer func() { endSpan(err) }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with Richard offline -- we are moving this to the Retry() layer :)
) | ||
defer func() { endSpan(err) }() | ||
|
||
err = s.links.Retry(ctx, EventSender, "SendMessageBatch", func(ctx context.Context, lwid *internal.LinksWithID, args *utils.RetryFnArgs) error { | ||
return lwid.Sender.Send(ctx, batch.toAMQPMessage(), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhendrixMSFT I did a bit of digging -- the semantic conventions for HTTP has examples for how the spans should look like for retries: https://opentelemetry.io/docs/specs/semconv/http/http-spans/#http-client-authorization-retry-examples. However, for messaging systems there is not much information besides this chunk:
I am not sure if this implies the messaging spans are 1 per operation?
Tracer() tracing.Tracer | ||
|
||
// SetTracer sets the tracer for the AMQPLinks instance. | ||
SetTracer(tracing.Tracer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does AMQPLinks needs to own a Tracer object or should it just be passed in as an argument to each function call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this I am open to your preference :) Currently the tracer starts a span at the Retry()
function level. So we can either have it in the amqpLink layer, or keep it in the Sender/Receiver layer and passing it as an argument to each function call all the way down to the Retry()
function level. Which option do you think is more appropriate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
passing it as an argument to each function call all the way down to the Retry() function level
I'd prefer this, just to eliminate any potential race conditions with state.
I know the argument list is getting pretty gnarly with Retry(), and we can work on that (separately).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if it gets too gnarly though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to move the tracers 1 level up. Now they live in Sender
, Receiver
and Namespace
@@ -433,7 +434,7 @@ func (ns *Namespace) startNegotiateClaimRenewer(ctx context.Context, | |||
return | |||
case <-time.After(nextClaimAt): | |||
for { | |||
err := utils.Retry(refreshCtx, exported.EventAuth, "NegotiateClaimRefresh", func(ctx context.Context, args *utils.RetryFnArgs) error { | |||
err := utils.Retry(refreshCtx, tracing.NewNoOpTracer(), exported.EventAuth, "NegotiateClaimRefresh", func(ctx context.Context, args *utils.RetryFnArgs) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the NoOpTracer here temporary or is there a reason we shouldn't trace this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The claim negotiation is not outlined in the Otel conventions for SB, so I left it to a no-op tracer https://opentelemetry.io/docs/specs/semconv/messaging/azure-messaging/
Definitely open to adding it if we want to support it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I think we could consider it a request, just not user initiated. We can figure it out later, but maybe we could file a separate issue so we pick it up later.
It's definitely useful info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Just noticed we can use tracing.SpanKind
to denote that it is an internal span.
Updated the code to include this everywhere: Sender spans will have Producer
span kind, and Receiver spans will have Consumer
span kind. The NegotiateClaim
function has the Internal
span kind.
|
||
type SetAttributesFn func([]Attribute) []Attribute | ||
|
||
func NewSpanConfig(name SpanName, options ...SetAttributesFn) *SpanConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this indirection (for SetAttributesFn) is a bit odd - it looks like attributes ...Attribute
would cover what we use it for, and helps a bit in showing that's the only thing we use it for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched this one up a bit to directly return list of attributes, let me know what you think!
@@ -361,6 +369,12 @@ func (r *Receiver) DeadLetterMessage(ctx context.Context, message *ReceivedMessa | |||
} | |||
|
|||
func (r *Receiver) receiveMessagesImpl(ctx context.Context, maxMessages int, options *ReceiveMessagesOptions) ([]*ReceivedMessage, error) { | |||
var err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you move this span code into ReceiveMessages
it'll be less tricky since you'll only have a single spot where you need to log the error, and we don't have to be wary of assigning err
in all the code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's coming along really well. I left some comments, but I'm really liking it.