Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing context extraction from SNS-SQS Events does not work with Raw Message Delivery #579

Open
mahwy opened this issue Oct 1, 2024 · 9 comments

Comments

@mahwy
Copy link

mahwy commented Oct 1, 2024

Expected Behavior

Tracing context injected to MessageAttributes._datadog.BinaryValue by dd-trace-js should be extracted.

Actual Behavior

TraceContextExtractor.getTraceEventExtractor() is unable to extract tracing context from SNS-SQS Events with Raw Message Delivery

Steps to Reproduce the Problem

  1. Publish a SNS message with raw message delivery enabled
  2. Push the SNS message to a SQS topic
  3. Subscribe the SQS topic in a AWS lambda function

Root Causes

When publishing a SNS message, dd-trace-js injects the tracing context under MessageAttributes._datadog -> BinaryValue

messageAttributes._datadog = {
  DataType: 'Binary',
  BinaryValue: ddInfo
}

Problem 1

Since an event from raw message delivery is missing Type and TopicArn, isSNSSQSEvent returns false and isSQSEvent returns true. So it picks SQSEventTraceExtractor, which only checks stringValue.

Problem 2

Manually setting Type and TopicArn in the event body would force [SNSSQSEventTraceExtractor] (https://github.com/DataDog/datadog-lambda-js/blob/main/src/trace/context/extractors/sns-sqs.ts) to be picked, which supports both String and Binary data type. But the extractor is reading from MessageAttributes._datadog.Value instead of MessageAttributes._datadog.BinaryValue so no context is extracted.

Specifications

  • Datadog Lambda Layer version: 64
  • Node version: 18
@astuyve
Copy link
Contributor

astuyve commented Oct 2, 2024

Hey @mahwy! Thanks for the note.

We intentionally don't support Raw Message Delivery because, although we can pass trace context, we rely on the SNS and SQS metadata to provide the timestamps used for when the message was first sent to SNS and again when SNS delivered it to SQS.
Without that, we're not able to determine when the message was originally published, only that it showed up to SQS at the time in the SQS wrapper.

Instead we've asked people to disable raw message delivery if they want to trace through both SNS and SQS. We could certainly enable trace context extraction if that's something critical for you, but it won't create a coherent trace the way it does without raw message delivery.

Can you help me understand why raw message delivery is critical to your workflow?

Thanks!
AJ

@mahwy
Copy link
Author

mahwy commented Oct 2, 2024

@astuyve I'm receiving trace context even with raw message delivery. (Screenshot attached)

Image
messageAttributes._datadog was injected by dd-trace-js

I think the issue can be fixed by letting the SQSEventTraceExtractor read binaryValue in addition to stringValue

@astuyve
Copy link
Contributor

astuyve commented Oct 2, 2024

Hi @mahwy yes trace context is there but that is not enough. We need the timestamps for tracing to fully work, see: https://github.com/DataDog/datadog-lambda-js/blob/main/src/trace/span-inferrer.ts#L295

which isn't present when raw message delivery is enabled.

Can you help me understand why it's required for your use case?

@mahwy
Copy link
Author

mahwy commented Oct 2, 2024

We have existing AWS Lambda functions subscribing a SQS topic. Events are first published to a SNS topic and the SQS topic is subscribing the SNS topic. These SQS subscriber lambda functions are unable extract the trace contexts from the events creating two separate traces for each request chain.
As a workaround, I've added some code to inject messageAttributes._datadog with StringValue when publishing messages to the SNS topic, which fixes the problem.

const params: SNS.Types.PublishInput = {...};
  const activeSpan = tracer.scope()?.active();
  if (activeSpan) {
    const ddInfo = {};
    tracer.inject(activeSpan, 'text_map', ddInfo);
    if (Object.keys(ddInfo).length !== 0) {
      if (!params.MessageAttributes) {
        params.MessageAttributes = {};
      }
      params.MessageAttributes._datadog = {
        DataType: 'String',
        StringValue: JSON.stringify(ddInfo),
      };
    }
  }
 await sns.publish(params)

@onyxraven
Copy link

onyxraven commented Oct 2, 2024

I just ran across this same issue. We have a number of instances where we desire raw message delivery for the ease of use and efficiency (especially if we have messages that near the message size limit), plus the ability to use message filtering.

The timestamps in the message are only the SQS ones, and the SNS ones are left out? is that the main issue? I wonder if theres a way we could 'opt in' to still being able to pull these in?

The side problem is I tried the workaround proposed by @mahwy, but the auto instrumentation injection is overriding it I think, and making it still a binaryValue. Is there a way I can opt out of that conditionally on a call? Or another way to work around this?

@mahwy
Copy link
Author

mahwy commented Oct 3, 2024

@onyxraven right, it doesn't work with dd-trace auto-instrumentation. I had to exclude aws-sdk from esbuild external config. You might be able to disable aws-sdk instrumentation via DD_TRACE_DISABLED_PLUGINS env var.

@astuyve
Copy link
Contributor

astuyve commented Oct 8, 2024

Hi @onyxraven, thanks for the note

I just ran across this same issue. We have a number of instances where we desire raw message delivery for the ease of use and efficiency (especially if we have messages that near the message size limit), plus the ability to use message filtering.

As far as I know the SNS message attributes are fixed length and quite small, and you still can use SNS based message filtering, but I may be mistaken! Please correct me if I am.

We can add support to extract trace context from raw message delivery, but again the spans won't line up timing wise (because the metadata from SNS is missing – we're blind to it), and the connection would be from an SDK call to SNS -> [gap] -> SQS.

If that's something which would work for you all, I can help prioritize this effort.

@onyxraven
Copy link

IMO, thats a reasonable thing to opt into if possible? a config flag or env var we can set?

@astuyve
Copy link
Contributor

astuyve commented Oct 9, 2024

Yeah I think it's totally reasonable. I think we'll try to make it automatic and then when customers ask why SNS is missing we can point them to this discussion and explain that raw message deliver needs to be disabled to fully trace SNS. But that seems like a fine tradeoff to unblock you folks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants