Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large messages in the audit queue cannot be processed #4185

Closed
Jokelab opened this issue May 21, 2024 · 16 comments · Fixed by #4374
Closed

Large messages in the audit queue cannot be processed #4185

Jokelab opened this issue May 21, 2024 · 16 comments · Fixed by #4374
Milestone

Comments

@Jokelab
Copy link

Jokelab commented May 21, 2024

NOTE: issue edited for release

Symptoms

When a large message is sent from an audited endpoint, ingestion of the message fails and a message similar to the following is logged:

2024-05-21 08:03:46.8118|16|Fatal|ServiceControl.Audit.Auditing.AuditIngestion|OnCriticalError. 'Failed to import too many times'|System.InvalidOperationException: The link 'G6:14740510:amqps://**namespace removed**.servicebus.windows.net/-dd910ad4;0:7:8' is force detached by the broker because publisher(link3082) received a batch message with no data in it. Detach origin: Publisher.

Who's affected

you are affected if you are using ServiceControl 5.0.0 or later and you send large messages from an audited endpoint.

Root cause

### Describe the bug

This bug is caused by Particular/NServiceBus.Transport.AzureServiceBus#994.

Description

After upgrading from ServiceControl 5.0.4 to 5.2.0, audit message ingestion fails and messages are not forwarded to the configured audit log queue.

Expected behavior

Audit messages are ingested and forwarded to the configured audit log queue.

Actual behavior

  • The audit instance logs an exception in the FailedImports folder. (See log output)
  • An event is logged in ServicePulse: Audit Message Ingestion Process: Failed to import too many times
  • The audit queue is not processed anymore.
  • The log queue doesn't receive any messages

Versions

ServiceControl 5.2.0
Transport: Azure ServiceBus (premium) I explicitly mention this because our system sometimes needs to process messages slightly bigger than 1MB.

Our endpoints are Azure Functions using the following versions:
NServiceBus 9.0.0
NServiceBus.Transport.AzureServiceBus 4.0.0
NServiceBus.AzureFunctions.Worker.ServiceBus 5.0.0

Steps to reproduce

  • Deploy a premium Azure ServiceBus namespace
  • Install a ServiceControl and Audit instance with version 5.0.4.
    • Configure the audit instance to forward messages to an audit log queue.
    • Use Azure ServiceBus transport and point the connectionstring to the premium Azure ServiceBus namespace
  • Set the max message size to 100MB
  • Make sure everything runs correctly by sending some messages.
  • Upgrade the ServiceControl and audit instance to 5.2.0
  • Send some messages

Expected: messages are ingested from the audit queue and forwarded to the audit log queue
Actual: messages stay in the audit queue and are not forwarded to the audit log queue.

Relevant log output

In the audit instance log:
2024-05-21 08:03:46.8118|16|Fatal|ServiceControl.Audit.Auditing.AuditIngestion|OnCriticalError. 'Failed to import too many times'|System.InvalidOperationException: The link 'G6:14740510:amqps://**namespace removed**.servicebus.windows.net/-dd910ad4;0:7:8' is force detached by the broker because publisher(link3082) received a batch message with no data in it. Detach origin: Publisher.
For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(AmqpMessage batchMessage, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(AmqpMessage batchMessage, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(IReadOnlyCollection`1 messages, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.<>c.<<SendBatchAsync>b__21_0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.<>c__22`1.<<RunOperation>b__22_0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchAsync(ServiceBusMessageBatch messageBatch, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.ServiceBusSender.SendMessagesAsync(ServiceBusMessageBatch messageBatch, CancellationToken cancellationToken)
   at NServiceBus.Transport.AzureServiceBus.MessageDispatcher.DispatchBatchForDestination(String destination, ServiceBusClient client, Transaction transaction, Queue`1 messagesToSend, CancellationToken cancellationToken) in /_/src/Transport/Sending/MessageDispatcher.cs:line 198
   at NServiceBus.Transport.AzureServiceBus.MessageDispatcher.Dispatch(TransportOperations outgoingMessages, TransportTransaction transaction, CancellationToken cancellationToken) in /_/src/Transport/Sending/MessageDispatcher.cs:line 101
   at ServiceControl.Audit.Auditing.AuditIngestor.Ingest(List`1 contexts) in /_/src/ServiceControl.Audit/Auditing/AuditIngestor.cs:line 72
   at ServiceControl.Audit.Auditing.AuditIngestion.Loop() in /_/src/ServiceControl.Audit/Auditing/AuditIngestion.cs:line 218
   at ServiceControl.Audit.Auditing.AuditIngestion.OnMessage(MessageContext messageContext, CancellationToken cancellationToken) in /_/src/ServiceControl.Audit/Auditing/AuditIngestion.cs:line 197
   at NServiceBus.Transport.AzureServiceBus.MessagePump.ProcessMessage(ServiceBusReceivedMessage message, ProcessMessageEventArgs processMessageEventArgs, String messageId, Dictionary`2 headers, BinaryData body, CancellationToken messageProcessingCancellationToken) in /_/src/Transport/Receiving/MessagePump.cs:line 285

Sample of an error message in the FailedImports folder:
Exception:
The link 'G6:14740510:amqps://**namespace removed**.servicebus.windows.net/-dd910ad4;0:7:8' is force detached by the broker because publisher(link3082) received a batch message with no data in it. Detach origin: Publisher.
For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.
StackTrace:
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(AmqpMessage batchMessage, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(AmqpMessage batchMessage, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchInternalAsync(IReadOnlyCollection`1 messages, TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.<>c.<<SendBatchAsync>b__21_0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.<>c__22`1.<<RunOperation>b__22_0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpSender.SendBatchAsync(ServiceBusMessageBatch messageBatch, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.ServiceBusSender.SendMessagesAsync(ServiceBusMessageBatch messageBatch, CancellationToken cancellationToken)
   at NServiceBus.Transport.AzureServiceBus.MessageDispatcher.DispatchBatchForDestination(String destination, ServiceBusClient client, Transaction transaction, Queue`1 messagesToSend, CancellationToken cancellationToken) in /_/src/Transport/Sending/MessageDispatcher.cs:line 198
   at NServiceBus.Transport.AzureServiceBus.MessageDispatcher.Dispatch(TransportOperations outgoingMessages, TransportTransaction transaction, CancellationToken cancellationToken) in /_/src/Transport/Sending/MessageDispatcher.cs:line 101
   at ServiceControl.Audit.Auditing.AuditIngestor.Ingest(List`1 contexts) in /_/src/ServiceControl.Audit/Auditing/AuditIngestor.cs:line 72
   at ServiceControl.Audit.Auditing.AuditIngestion.Loop() in /_/src/ServiceControl.Audit/Auditing/AuditIngestion.cs:line 218
   at ServiceControl.Audit.Auditing.AuditIngestion.OnMessage(MessageContext messageContext, CancellationToken cancellationToken) in /_/src/ServiceControl.Audit/Auditing/AuditIngestion.cs:line 197
   at NServiceBus.Transport.AzureServiceBus.MessagePump.ProcessMessage(ServiceBusReceivedMessage message, ProcessMessageEventArgs processMessageEventArgs, String messageId, Dictionary`2 headers, BinaryData body, CancellationToken messageProcessingCancellationToken) in /_/src/Transport/Receiving/MessagePump.cs:line 285
Source:
Azure.Messaging.ServiceBus
TargetSite:
Void MoveNext()

Additional Information

Workarounds

Reinstalling everything with ServiceControl version to 5.0.4 resolved the issue.

Additional information

I tried re-importing the messages as suggested in the documentation https://docs.particular.net/servicecontrol/import-failed-messages, but this keeps triggering the same exception.
To make sure it is related to this version and that the issue doesn't occur in 5.0.4, I installed 5.0.4. again and tested that everything is ok. Then upgraded to 5.2.0 again and the issue appears.

@Jokelab Jokelab added the Bug label May 21, 2024
@adamralph
Copy link
Contributor

hi @Jokelab thanks for raising this. We'll try and reproduce the issue and then post an update here.

@adamralph
Copy link
Contributor

adamralph commented May 23, 2024

@Jokelab I followed your repro steps but I couldn't reproduce the problem. After the upgrade to 5.2.0, messages were ingested from the audit queue and forwarded to the audit log queue, as expected.

FYI I made each message approximately 2MB in size.

Is there anything missing from your repro steps?

@Jokelab
Copy link
Author

Jokelab commented May 23, 2024

@adamralph Thank you very much for taking the effort in trying to reproduce the issue. I must be overlooking something.
I tried to reproduce the issue myself today with a fresh Azure ServiceBus namespace and a new Function App. The issue didn't occur this time.

Therefore I tried to update our actual instance again, hoping it was just a fluke, but now the issue re-appears :( .
I noticed that not all messages fail to be forwarded, just a few of them, but I cannot find anything specific about the failing messages when I investigate them in the RavenDb database. They have different types and sizes. I will try to find the differences between my repro steps and the actual situation a bit more and come back here once I figure it out.

@Jokelab
Copy link
Author

Jokelab commented May 27, 2024

So I investigated the issue a little more and have found a way to consequently reproduce it. It turns out that it's not related to the update, but also occurs with a fresh install of ServiceControl5.2.0.

To reproduce the issue, I slightly modified the ASBWorker sample code for Azure Functions (isolated) on your website. The code is attached to this comment. The HttpSender function now has the following parameters:

  • followup: the number of follow up messages to send
  • sizekb: a dummy string of this size in kilobytes will be added to the message.
  • immediate: a boolean that indicates whether or not to send the message with immediate dispatch.

Steps to reproduce

  • Deploy a premium Azure ServiceBus namespace
  • Install a ServiceControl and Audit instance with version 5.2.0
    • Configure the audit instance to forward messages to an audit log queue (audit.log by default)
    • Use Azure ServiceBus transport and point the connectionstring to the premium Azure ServiceBus namespace
    • Wait untill the services are installed succesfully
  • Create a queue for the endpoint named 'asbworkerendpoint'
  • Set the max message size for all queues to 102400Kb (100MB)
  • Set the queue size for all queues to 80GB
  • STOP the Audit service instance
  • Download my sample and change the connection string in the local.settings.json file: audit_forward_error.zip
  • Run the code and trigger the HttpSender function with the following parameters:
    /api/HttpSender?followup=1000&sizekb=100&immediate=true
  • Wait untill there are approx 1000 messages in the audit queue.
  • Stop the code.
  • START the Audit service instance.
  • Wait a few minutes while it ingests the audit queue.
  • Inspect the audit sercice log and the failed messages folder. There should be plenty of exceptions here (similar to the ones in the original post).

Another simple way to get the same error message, but now on the endpoint side:

This works fine:
/api/HttpSender?followup=10&sizekb=100&immediate=false

It also works fine when sent immediately:
/api/HttpSender?followup=10&sizekb=100&immediate=true

But when I increase the message size to 110Kb, I get the error message:
/api/HttpSender?followup=10&sizekb=110&immediate=true

I hope this helps to investigate the issue. Best regards!

@adamralph
Copy link
Contributor

@Jokelab thanks for the detailed repro steps. I've managed to reproduce this now in the SC Audit instance. I'll continue to investigate and I'll comment back here when I have more info.


BTW, this alternative repro didn't work for me:

But when I increase the message size to 110Kb, I get the error message:
/api/HttpSender?followup=10&sizekb=110&immediate=true

The messages were both sent and received without any errors.

@Jokelab
Copy link
Author

Jokelab commented May 27, 2024

Thank you for looking into this!

BTW, this alternative repro didn't work for me:

But when I increase the message size to 110Kb, I get the error message:
/api/HttpSender?followup=10&sizekb=110&immediate=true

The messages were both sent and received without any errors.

I think this depends on the batch size. For me 110 was enough to trigger to exception. Maybe increasing this to a larger number (500) triggers it for you as well.

@adamralph
Copy link
Contributor

adamralph commented May 27, 2024

@Jokelab I found the problem with the alternate repro. It has to be set to immediate=false, not immediate=true.

I.e. this reproduces the problem:

/api/HttpSender?followup=10&sizekb=110&immediate=false

@adamralph
Copy link
Contributor

@Jokelab we've found the source of the problem.

By default, outgoing messages from a message handler are sent in batches, after the handler exits (that can be suppressed with SendOptions.RequireImmediateDispatch as your repro code demonstrates).

We've discovered that, if the size of the entire batch is more than the allowed size per message (which we are setting to 102,400 KB in this example), the Azure SDK client throws this rather misleading exception.

This manifests as a bug in our Azure Service Bus Transport, which in turn manifests as a bug in ServiceControl, because the transport is used internally in ServiceControl.

We will follow up with the Azure SDK team to see what can be done about this and get back to you.

I'm not sure if there's another workaround available in the meantime other than sticking with an older version of ServiceControl, but if we do identify one then we'll let you know.

@Jokelab
Copy link
Author

Jokelab commented May 28, 2024

@adamralph Thank you for clarifying! Is it safe to say that the issue that we can observe on the endpoint side (when triggering /api/HttpSender?followup=10&sizekb=110&immediate=false) is the same as on the ServiceControl side?
For this reason we sometimes use immediate dispatch instead of the preferred method (in batch), with all the risk that entails.
For now we'll stick on our current version (5.0.4).

@adamralph
Copy link
Contributor

Is it safe to say that the issue that we can observe on the endpoint side (when triggering /api/HttpSender?followup=10&sizekb=110&immediate=false) is the same as on the ServiceControl side?

@Jokelab yes, it's exactly the same problem.

@adamralph
Copy link
Contributor

@Jokelab FYI I've raised a corresponding bug for the transport — Particular/NServiceBus.Transport.AzureServiceBus#994

@adamralph
Copy link
Contributor

@Jokelab please follow both this issue and Particular/NServiceBus.Transport.AzureServiceBus#994 for updates on the status of the bug fixes.

If at any point, the bugs fixes become critical for you, please request critical support from https://particular.net/support

@adamralph
Copy link
Contributor

@Jokelab FYI we've prioritized this near the top of our list and will work on it as soon as we can.

@danielmarbach
Copy link
Contributor

@Jokelab The underlying issue has been addressed in Azure.Messaging.ServiceBus 7.18. We have already bumped the minimum required version of the transport and have added an update on our end to improve the story. The situation should soon be improved.

@PhilBastian
Copy link
Contributor

release scheduled for 5.6.1

@PhilBastian PhilBastian changed the title Audit message ingestion fails and doesn't forward messages to audit log queue (version 5.2.0) Large messages in the audit queue cannot be processed Aug 16, 2024
@Jokelab
Copy link
Author

Jokelab commented Aug 17, 2024

@Jokelab The underlying issue has been addressed in Azure.Messaging.ServiceBus 7.18. We have already bumped the minimum required version of the transport and have added an update on our end to improve the story. The situation should soon be improved.

Great! Thank you for letting me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment