Accuracy of inactive threshold for ephemeral consumers #778

erdemiru · 2022-10-25T21:08:57Z

erdemiru
Oct 25, 2022

Defect

Included a [Minimal, Complete, and Verifiable example] (https://github.com/erdemiru/nats-java-examples/blob/master/src/test/java/org/examples/nats/NatsEphemeralConsumerTests.java)

Versions of `io.nats:jnats` and `nats-server`:

nats-server: 2.9.3
io.nats:jnats: 2.16.1

OS/Container environment:

macOS, running local NATS server.
Windows, running local NATS server.

Steps or code to reproduce the issue:

NATS documentation indicates that the default value of the inactive threshold is 5 seconds. When processing messages, if I add a processing delay (e.g. 3s) after several pull requests, nextMessage() always returns null, even though the subscription is active and there are still more messages on the server.

The example project contains two test methods with different test parameters.

shouldConsumeAllMessagesWithBatchPull method

Publishes 7 messages, and tries to pull those messages with a batch size of 3.
Each message is acknowledged synchronously as soon as it is received. Then the consumer thread sleeps 3 seconds.
After the 6^th message, consumer becomes inactive, and nextMessage returns always null.

shouldConsumeAllMessages is a slightly more complex test method

I parameterized the number of messages, batch size, and processing delay so that different behaviours can be investigated easier.
If the batch size is 1 or 2, the consumer can pull all messages without any problem,
if the batch size is 3, the consumer becomes inactive and cannot consume all messages.

Some additional information:

Acknowledging messages synchronously or asynchronously does not change the result.
Acknowledging before or after the processing delay also does not change the result.
You can see the timing information in console output. For example:

0s 7ms: Received message #1 content: 1, consumer information...
3s 38ms: Received message #2 content: 2, consumer information...

Expected result:

Consumer should not be removed if it is active within the inactive threshold limit.
If the consumer is removed, pull requests should throw an exception to notify the client.

Actual result:

Consumer is removed, even though it is active within the inactive threshold limit.
Subsequent pull requests have no effect, nextMessage always returns null.

scottf · 2022-10-25T22:26:48Z

scottf
Oct 25, 2022
Maintainer

@erdemiru I will look at this in detail, but off the top of my head

processingDelayInMillis = 2000;

subscription.pull(3);
processAndAcknowledgeNextMessage(subscription, processingDelayInMillis);
processAndAcknowledgeNextMessage(subscription, processingDelayInMillis);
processAndAcknowledgeNextMessage(subscription, processingDelayInMillis);

2000 + 2000 + 2000 = 6000; 6 seconds. The subscription is already inactive because it did not get another pull. Reading and acking I'm pretty sure do not reset the threshold (I'll verify) I'm surprised you got the second set of 3.

Either way it's not exact. The server is doing lots of work. The subscription may stay active longer than the threshold because of it, but won't be less.

0 replies

scottf · 2022-10-25T22:29:23Z

scottf
Oct 25, 2022
Maintainer

We are working on improvements for handling inactive consumers, but it IS VERY difficult to know. You can ask the server for consumer info, but that's a round trip to the server. This are now heartbeats on pulls, so that's one way we will try to address inactivity.

0 replies

erdemiru · 2022-10-26T12:47:14Z

erdemiru
Oct 26, 2022
Author

Hi @scottf,

Thanks for your quick response. It is interesting to hear acknowledging a message does not reset the threshold as I would expect it is a clear indication that the consumer is in the active state.

I also tried some other examples:

numberOfMessages:9, batch size:5, processing delay:2000 -> test OK, it receives 9 out of 9 messages.
numberOfMessages:12 batch size:5, processing delay:2000 > test FAILS, it receives 10 out of 12 messages.

In both cases, between pull interval is 10 seconds.

As a workaround, we can set inactive threshold something longer than max. processing time x batch size.

0 replies

erdemiru · 2022-10-26T15:21:21Z

erdemiru
Oct 26, 2022
Author

I don't know if that is directly related to this issue, but I also see duplicate messages in some scenarios. Let's say we set inactive threshold to 10 minutes to isolate the inactive threshold limit issue.

number of messages:57 batch size:100 (greater than the available messages), processing delay:1000

In that case, consumer receives duplicates messages,

56s 355ms : Received message #57 content: 57
57s 361ms : Received message #58 content: 31 // 31th message is duplicate
58s 367ms : Received message #59 content: 32 // 32nd message is duplicate
....
82s 530ms : Received message #83 content: 56 // 56th message is duplicate
83s 538ms : Received message #84 content: 57 // 57th message is duplicate

Calling msg.ack() or msg.ackSync() before the sleep doesn't fix it. It seems related to ack_wait (which is 30 seconds by default). Setting it to a larger value avoid message duplicates.

Are the acknowledgements some how delayed for batch pull requests?

0 replies

scottf · 2022-10-26T16:44:55Z

scottf
Oct 26, 2022
Maintainer

Is it taking you 30 seconds to ack something? Maybe you need to reduce your batch size and/or increase your ack wait. Ack wait and redelivering are a fundamental server things, I'd be surprised if it's broken, but I suppose it could be.

I'm moving this issue to a discussion.

4 replies

erdemiru Oct 26, 2022
Author

Is it taking you 30 seconds to ack something?
No, in test code I ack each message as soon as nextMessage() returns them, timestamps in the console log also verifies that. It behavaes like acknowledgements are sent once all messages are processed (or retrieved via sub.nextMessage()).

I have just added a new test with two different parameter sets. testDuplicatesMessagesWithLargeBatchSize

numberOfMessages:57, batch size:100, processing delay:500 -> test OK
numberOfMessages:57 batch size:100, processing delay:1000 > test FAILS, it receives duplicate messages.

I have created a test project here: https://github.com/erdemiru/nats-java-examples, if you would like to take a look. It contains only the related test code.

scottf Oct 26, 2022
Maintainer

So there are 57 messages on the server, you ask for 100. The server gives you all 57 it has. They are "in-flight", which means the clock is ticking on every single one of them. With 1 second of "processing" time on each. by the time you get to message 31, it has not been ack'ed in time. This matches your comment about that message 31 got re-delivered.

erdemiru Oct 26, 2022
Author

Yes, that is right, so in conclusion:

ack wait should be longer than (max. processing time x batch-size) and/or message duplicates should be detected by the client code.
as sending an acknowledgement does not reset the inactivity limit for that consumer, for the time being, inactive threshold should be something longer than (max. processing time x batch size).

scottf Oct 26, 2022
Maintainer

Yeah, that says it very well!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy of inactive threshold for ephemeral consumers #778

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Accuracy of inactive threshold for ephemeral consumers #778

erdemiru Oct 25, 2022

Defect

Versions of io.nats:jnats and nats-server:

OS/Container environment:

Steps or code to reproduce the issue:

Expected result:

Actual result:

Replies: 5 comments · 4 replies

scottf Oct 25, 2022 Maintainer

scottf Oct 25, 2022 Maintainer

erdemiru Oct 26, 2022 Author

erdemiru Oct 26, 2022 Author

scottf Oct 26, 2022 Maintainer

erdemiru Oct 26, 2022 Author

scottf Oct 26, 2022 Maintainer

erdemiru Oct 26, 2022 Author

scottf Oct 26, 2022 Maintainer

erdemiru
Oct 25, 2022

Versions of `io.nats:jnats` and `nats-server`:

Replies: 5 comments 4 replies

scottf
Oct 25, 2022
Maintainer

scottf
Oct 25, 2022
Maintainer

erdemiru
Oct 26, 2022
Author

erdemiru
Oct 26, 2022
Author

scottf
Oct 26, 2022
Maintainer

erdemiru Oct 26, 2022
Author

scottf Oct 26, 2022
Maintainer

erdemiru Oct 26, 2022
Author

scottf Oct 26, 2022
Maintainer