Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pub-sub topology limits performance for systems with many event types #888

Open
mikeminutillo opened this issue Oct 5, 2023 · 4 comments

Comments

@mikeminutillo
Copy link
Member

mikeminutillo commented Oct 5, 2023

Azure Service Bus transport publishes all events to a single topic (by default), and each endpoint adds subscriptions to it. These subscriptions include filters to only route the messages that the endpoint can handle. These filters are implemented as SQL filters using LIKE statements against the EnclosedMessageTypes header. This is done to support scenarios (consumer-driven contracts, event polymorphism) where the header contains multiple message types.

Unfortunately, SQL-LIKE filters are operationally expensive, consuming a lot of CPU. When CPU usage gets too high (measured across the Azure Service Bus namespace) the service gets throttled for all operations. This means that under high-load, a complex system with more event types will become significantly slow. The exact level of CPU usage that triggers throttling is not documented but we have observed it happening at the 60-70% range. This slowdown triggers an increase in critical time which saturates the namespace and keeps it throttled.

Azure Service Bus reports being able to handle 2,000 SQL filters on a topic, but we have observed this throttling being applied with as few as 450. Reducing the number of filters to ~350 seems to have disabled the throttle but this is not an easy task on a system that has organically grown over time. The length of event type names (including namespace) may be a factor, as longer names require more CPU to filter.

This means that users using NServiceBus to build complex systems in Azure Service Bus are not getting the performance numbers advertised by Microsoft. When opening a support case with Microsoft, the user is advised that the usage of so many SQL LIKE filters is inhibiting performance and that scaling out will not help.

Users are unlikely aware of the need to monitor CPU usage on the Azure Service Bus namespace and have very little recourse when they are throttled for the first time.

The current topology design exists to support features (Consumer driven contracts and event polymorphism) that the user may not need.

@mikeminutillo
Copy link
Member Author

mikeminutillo commented Oct 5, 2023

Potential solution - Use correlation filters

We can keep the topology exactly the same and switch to correlation filters on the subscriptions. Correlation filters are significantly less CPU intensive and should increase the number of subscriptions we can add to the system before throttling kicks in.

There is a spike showing this approach:

Unfortunately, this will prevent the use of several features and there may not be an easy way to detect that they are in use. If a message contract changes (adding an interface that is identified as a message type by convention, moving to a new namespace or assembly, etc.) then the subscription will silently stop working, leading to message loss. This cannot easily be detected because a change in the publisher might break an existing subscriber (and vice-versa).

We might be able to mitigate this a little by putting each message types as separate header with a simple value (a single character or empty string may be sufficient ). With that in place, the correlation filters could be applied for a specific message key being present. This increases the number of headers. In most cases, probably this is a single additional header. The spike does not demonstrate this approach and it would break wire compatibility with older versions.

@mikeminutillo
Copy link
Member Author

mikeminutillo commented Oct 5, 2023

Workaround - Replace the NSB created filters with correlation filters

NServiceBus is designed to operate in a minimal access mode. With this in mind, it does not need to create filters and will only process the messages that come to it's input queue. This means that you can replace the subscription filters with correlation filters. Once these correlation filters are in place, the system will operate as usual, with all of the caveats that apply to that solution.

On v2 of the transport (core v7), the transport will update the rule so you need to disable autosubscription for the impacted event type on all subscribers before changing the filter. This also means that new subcriptions would need to be set up by hand. This cannot be done with the command line as this would create a SQL-LIKE filter (replacing the Correlation Filter). You also cannot manually subscribe to events or the filter will be replaced.

On v3 of the transport (core v8), the transport will attempt to create the rule and quietly swallow exceptions if the rule already exists. This means that you could allow the transport to create the SQL-LIKE rule and swap them for Correlation filters when the CPU usage gets above a threshold.

@mikeminutillo mikeminutillo changed the title Pub-sub topology inhibits performance for complex systems Pub-sub topology limits performance for systems with many event types Oct 5, 2023
@danielmarbach
Copy link
Contributor

Potential solution - Combine Correlation Filter with Event Mapping

We have already proven with SQS that it is possible to leverage some kind of mapping approach that allows associating events meta information (in the SQS/SNS case sns topics) to support message inheritance for those that need it.

It might be worthwhile investigating whether we can leverage a similar approach with correlation filters in case we require to support inheritance.

@mauroservienti
Copy link
Member

mauroservienti commented Jul 31, 2024

I had a conversation with a customer who mentioned that SqlFilter severely impacts their system's performance, with sometimes seconds of delay in message dispatch, depending on the overall system load.

Extracting from the conversation:

The first issue is about receiving performance. This is likely caused by the heavy use of SqlFilters within the bundle-1 topic. This leads to strong delays (up to 5 minutes today, for example) until an event is forwarded from the topic to the respective destination queues.

The second issue concerns send/publisher performance. Sometimes, there are strong spikes in send latency. The 99th percentile clusters around 20 seconds.

@Particular Particular deleted a comment from mikeminutillo Aug 14, 2024
@Particular Particular deleted a comment from danielmarbach Aug 14, 2024
@Particular Particular deleted a comment from johnsimons Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants