-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Msg backlog & unack msg remains when using acknowledgeAsync #21958
Comments
@semistone |
Our test is running in
originally we use acknowledgeAsync and seem have issue,
so we replace by
which force to synchronized all acknowledge by ReentrantLock, I could try to write test code later if needed. |
A |
I'd recommend using pulsar-client-reactive with Project Reactor and other Reactive Streams implementations. Acknowledgement / Negative Acknowledgement is handled as a value (instead of a side-effect): example: https://github.com/apache/pulsar-client-reactive/tree/main?tab=readme-ov-file#consuming-messages |
we published around 1m of messages, and we are able to reproduce with this code
the unack message keeps increasing and the available permits become negative value, which makes the consumer couldn't poll more events unless we restart it, in order to re-delivery the events to the consumer |
Is there a chance to use 3.0.2 ? A lot of bugs have been fixed in 3.0.1 and 3.0.2 . This applies to both broker and the the client. |
Thanks for sharing the repro app. |
Yes, I run again with |
The original message was running from our application using 3.0.0 client with 3.1.2 broker, we are going to upgrade client to 3.0.2 for now, and the above reproduce code was running from the server, so both client & broker are 3.1.2 |
We tested again it still happen, but we found pulsar client default doesn't wait ack return and return so we turn on because if enable that option, then it will create lock read lock so the concurrent issue will disappear. and if I test without enable that option. so maybe concurrent issue still there or maybe just too many currentIndividualAckFuture pile up. but at least we could enable isAckReceiptEnabled to fix this issue. |
regarding unack message counts, #22657 is possibly related, see #22657 (comment) |
There has been an ack issue with batch index acknowledgements, #22353. That must be a different issue. |
I made an attempt to reproduce this using Pulsar client directly. The problem didn't reproduce with https://github.com/lhotari/pulsar-playground/blob/master/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioAckIssue.java . (The test code is fairly complex due to the counters to validate behavior and since I had the attempt to increase chances of race conditions.) |
I'll attempt to reproduce with the provided changes to pulsar-perf. |
rebased it over master in master...lhotari:pulsar:lh-issue21958-flux-test |
I'm not able to reproduce. compiling Pulsar branch with rebased flux-test patch for pulsar-perf
running Pulsar
running consumer
running producer
@semistone @pqab Are you able to reproduce with master branch version of Pulsar? how about of branches/releases? Is this issue resolved? |
One possible variation to this scenario would be to test together with topic unloading events. |
Related issue #22709 |
let me check tomrrow. |
@semistone thanks, that you be helpful. If it reproduces only within a cluster with multiple nodes and other traffic that could mean that a load balancing event is triggering the problem. Currently in-flight acknowledgements could get lost when this happens. Usually this gets recovered, but it's possible that there's a race condition where the acknowledgements get lost and the message doesn't get redelivered during the unload/reconnection event triggered by load balancing. It should be possible to simulate this scenario also by triggering topic unloads with the admin api. |
because our cluster already on production, so I can't test it on our cluster it seem more difficult to reproduce this issue compare to first time when we report it.
after retry many times, I could see only once the consumer stopped like
but I could still see backlog and I found it happened once that it stopped consuming for about 1 mins and recover later. I will try to test master branch and double check any mistake in my testing later. |
@semistone there are multiple issues contributing to this, here's one update: #22709 (comment) |
I have created a proposal "PIP-377: Automatic retry for failed acknowledgements", #23267 (rendered doc) . Discussion thread https://lists.apache.org/thread/7sg7hfv9dyxto36dr8kotghtksy1j0kr |
Search before asking
Version
3.0.0
Minimal reproduce step
Publish 600k messages
Start 2 consumers with different subscription name and subscribe from Earliest
one with async ack
another one with sync ack
What did you expect to see?
msg backlog & unack message should be 0 for both
acknowledge
&acknowledgeAsync
subscriptionWhat did you see instead?
There are few messages in the backlog & unack message left even we received the ack callback when using
acknowledgeAsync
,acknowledge
is working fineTopic stats for the
acknowledgeAsync
subscription for referenceAnything else?
we run for multiple times, and every time there are few backlog & unack message left for the
acknowledgeAsync
subscriptionAre you willing to submit a PR?
The text was updated successfully, but these errors were encountered: