Description
Describe the bug
As part of an integration project that has recently been deployed to production and is still in the pilot phase, we have several Java applications that connect to Azure Service Bus to consume messages from various topics and subscriptions. For message consumption, we are using the ServiceBusProcessorClient
.
However, we have frequently observed that, within the same timeframe, all our consumers stop consuming messages despite there being messages remaining in the various subscriptions. A week or even 10 days can go by without interruption, and sometimes all our clients stop consuming two or three times a day.
To resolve this issue and resume message consumption, we are forced to restart our containers. And a few days ago we put in place a dirty fix, which I'll talk about below.
Environment:
We have 6 environments, including production. We have three service bus servers and three gateways. The production environment has its own dedicated service bus server and application gateway.
Service Bus configuration / Namespace environment
Namespace : prenium
Message entity : Topic some are with session and sonme not.
Average size of Message : 80Ko - 100Ko
Machine Spec:
Our applications run in Docker using Docker Swarm. The JVM used is openjdk:17-jdk-debian
with the following JVM options:
-XX:InitialRAMPercentage=25 -XX:MinRAMPercentage=75 -XX:MaxRAMPercentage=75
The resource specifications for our Docker containers, defined in the Docker Compose YAML file, are as follows:
resources:
limits:
cpus: "2"
memory: 1500M
reservations:
cpus: "0.05"
memory: 700M
Additionally, some applications run with a single replica, while others run with two replicas to handle varying load and ensure high availability.
ServiceBusProcessorClient Configuration:
Following the standards and recommendations of the documentation, we configure our clients as follows:
public ServiceBusProcessorClient getOrCreateClaimProcessorClient() {
return new ServiceBusClientBuilder()
.fullyQualifiedNamespace(XXXX.getFullyQualifiedName())
.credential(new ClientSecretCredentialBuilder()
.tenantId(XXXX.getTenantId())
.clientId(XXXX.getStId())
.clientSecret(XXXX.getStSecret())
.build())
.customEndpointAddress(XXXX.getApplicationGatewayEndpointUrl())
.transportType(AmqpTransportType.AMQP_WEB_SOCKETS)
.processor()
.topicName(YYYY.getTopic())
.subscriptionName(YYYY.getSubscription())
.receiveMode(ServiceBusReceiveMode.PEEK_LOCK)
.disableAutoComplete()
.processMessage(YYYY.processMessage())
.processError(YYYY.processError())
.buildProcessorClient();
}
As you can see, we are not using auto-complete, and there is no prefetch configured. In our case, we are not using sessions, but other applications are using sessions.
Traffic Pattern:
Since we are in the pilot phase, we experience varying traffic patterns. There are periods with several messages per minute and long periods without messages. Sometimes, we receive only one message every 12 hours. This traffic pattern is due to the pilot phase, and we expect a significant increase in message volume after the pilot phase.
To Reproduce / Exception or Stack Trace:
We have attempted multiple times to reproduce this issue, which we have named the "zombie mode"—a state where our applications are up and running but not consuming messages. However, we have been unable to replicate it. In the other 5 environments, we have never encountered the zombie mode.
Regarding logs, there are no errors or crashes reported. As a result, we do not have logs to provide for this issue.
To troubleshoot, we have implemented a workaround. This "dirty fix" involves closing and restarting the processor if the client has not processed any messages for 5 minutes. This forces the closure of the connection, the session and the links. We are unsure if this solution is effective.
What we have done to attempt to reproduce the Zombie mode:
We have taken several steps to try to reproduce the zombie mode:
- We lowered the timeout of the application gateway to its minimum value, but the consumers continued to consume messages without issues.
- We placed our consumers behind a forward proxy, which we then disconnected for several minutes. After reconnecting, our consumers were able to recover the connection and continue consuming messages.
Despite these efforts, we have not been able to consistently reproduce the zombie mode.
We need your support to resolve this issue.
Thank you for your assistance.
PS: Question aside : If we create two ServiceBusProcessorClient instances with the same ServiceBusClientBuilder, according to the documentation, they will share the same connection. Does closing one of them, by calling close(), and then restart it, does it close and create the connection for both ?