-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trigger maybeIncrementLeaderHW in the alterISR request callback #477
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo the possible deadlock issue--if you could just double-check that to be sure?
Can't argue with the results, though--2x speedup in cert, potentially 8x in regular clusters? Nice.
@groelofs Exactly. 8X is what I would expect from regular clusters. |
I did some research: Upstream tried a similar fix for KAFKA-13091 And ends up with a deadlock issue KAFKA-13254 Though the code differs a lot; we may still want to check if we run into similar case. Also, we should include relavant EXIT_CRITERIA here^. |
Nice find! Unfortunately, the upstream fix looks sufficiently lengthy and complex that I'm not sure I'd trust it to carry over even if the code were more similar to ours. But I do like the extra validation that Hao's fix is needed, even if the implementation needs a bit more finesse. |
Oh wow... I didn't realize Kafka has a ticket system. When I investigated the issue, I tried to search github issues and KIP, but failed to find anything related. If I know this earlier that may save me 2 days! Thanks very much for bringing up the reference. Let me take a look at the upstream fix. |
Actually the only functional difference is upstream triggers tryCompleteDelayedRequest after incrementHW, and I'm relying on other places that triggers tryCompleteDelayedRequest. However on a most recent glance, I didn't find any tryCompleteDelayedRequest could be triggered if there is no new Produce request. In that case, I don't even know how my fix actually solved the issue. Another mystery... That said, I may end up with bringing upstream changes. The good side is that makes us one step closer to the upstream, and is potentially beneficial if in the future we decide to merge upstream for KRaft. Let me think about it more and test it a little bit more. |
Tested with Nurse disabled (broker not brought back immediately after hard-kill): Turns out the produce latency can be still high. The reason should be DelayedProduce is not tried to complete, exactly like what Huilin mentioned. So essentially maybeIncrementWatermark should always followed with tryCompleteDelayedRequests, otherwise the new HWM may not help with the completing the DelayedProduce. Will consider to open another PR by bringing up open source change. |
1f1b63b
to
74630f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One cosmetic nit, but fix looks solid. Thanks!
74630f7
to
ede787a
Compare
This PR should fix the super high Produce latency during uncontrolled broker death.
Description
Today, the maybeIncrementLeaderHW is called after shrinkISR() being called:
kafka/core/src/main/scala/kafka/cluster/Partition.scala
Line 1006 in 3475c3a
The problem is shrinkISR() internally sends the AlterISR request to controller in an async manner. By the time maybeIncrementLeaderHW is called, the ISR state on the current broker is likely not updated yet.
This PR invokes maybeIncrementLeaderHW in the callback of the AlterISR request, making sure the ISR state has been updated when trying to incrementLeaderHW. Meanwhile, it calls the tryCompleteDelayedRequests after the HW is incremented. To avoid deadlock, the tryCompleteDelayedRequests is called out of the leaderIsrUpdateLock, with the help of a completable future.
With this change, the Produce Delay should be capped at ShrinkISR duration (potentially plus controller queue length), comparing to the previously infinite wait time.
Testing
In cert-candidate cluster, before the change, lots of request are timeout on broker hard-kill
After the change, the produce delay is capped at replicaLagMaxMs * 1.5 = 15 seconds