Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report]: Paused consumer stuck after rebalance #581

Closed
1 task done
massada opened this issue Jul 20, 2024 · 0 comments · Fixed by #582
Closed
1 task done

[Bug Report]: Paused consumer stuck after rebalance #581

massada opened this issue Jul 20, 2024 · 0 comments · Fixed by #582
Labels
bug Something isn't working

Comments

@massada
Copy link
Contributor

massada commented Jul 20, 2024

Prerequisites

  • I have searched issues to ensure it has not already been reported

Description

As per issue reported by @SonicGD in the retry extensions repo (see issue 151), if a rebalance is issued when a message is in the retry loop, it will not be retried after partitions reassignment.

I've pin-pointed the issue to the revoke handler in the consumer and will open a PR soon.

Retry loop flow on error:

  1. Call next middleware
  2. Pause the consumer if needed
  3. Goto 1 if an handled exception is catched, or
  4. if an unhandled exception is catched or if the worker is stopped, resume the consumer if needed

Pause flow:

  • ConsumerFlowManager saves the topic+partition internally, pauses the consumer and starts a heartbeat task to keep the consumer alive

Resume flow:

  • ConsumerFlowManager resumes the consumer if the topic+partition is known and stops the heartbeat task

Revoke handling flow:

  • Kafka driver calls the Consumer revoke handler
  • Consumer revoke handler cleans up and stops the ConsumerFlowManager
    • ConsumerFlowManager clears the topic+partition list and stops the heartbeat
  • Consumer calls any registered revoke handlers, one of which is in ConsumerManager, which stops the worker pool

Issue lies in stopping the ConsumerFlowManager before the worker pool, since stopping the latter triggers the retry loop cancellation, which tries to resume the consumer, but by that time the ConsumerFlowManager's topic+partition list was already cleared and so the resume does nothing.

Steps to reproduce

See the issue in the retry extensions repo for code reproducing the problem.

Expected behavior

A new message should be consumed after partitions reassignment.

Actual behavior

The consumer stays paused and no message is consumed after partitions reassignment.

KafkaFlow version

v3.0.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging a pull request may close this issue.

1 participant