Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger maybeIncrementLeaderHW in the alterISR request callback #477

Merged
merged 7 commits into from
Sep 11, 2023

Conversation

CCisGG
Copy link

@CCisGG CCisGG commented Sep 1, 2023

This PR should fix the super high Produce latency during uncontrolled broker death.

Description

Today, the maybeIncrementLeaderHW is called after shrinkISR() being called:

maybeIncrementLeaderHW(leaderLog)

The problem is shrinkISR() internally sends the AlterISR request to controller in an async manner. By the time maybeIncrementLeaderHW is called, the ISR state on the current broker is likely not updated yet.

This PR invokes maybeIncrementLeaderHW in the callback of the AlterISR request, making sure the ISR state has been updated when trying to incrementLeaderHW. Meanwhile, it calls the tryCompleteDelayedRequests after the HW is incremented. To avoid deadlock, the tryCompleteDelayedRequests is called out of the leaderIsrUpdateLock, with the help of a completable future.

With this change, the Produce Delay should be capped at ShrinkISR duration (potentially plus controller queue length), comparing to the previously infinite wait time.

Testing
In cert-candidate cluster, before the change, lots of request are timeout on broker hard-kill
Screenshot 2023-09-07 at 10 06 14 PM

After the change, the produce delay is capped at replicaLagMaxMs * 1.5 = 15 seconds

Screenshot 2023-09-07 at 7 51 59 PM

Copy link

@groelofs groelofs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the possible deadlock issue--if you could just double-check that to be sure?

Can't argue with the results, though--2x speedup in cert, potentially 8x in regular clusters? Nice.

core/src/main/scala/kafka/cluster/Partition.scala Outdated Show resolved Hide resolved
@CCisGG
Copy link
Author

CCisGG commented Sep 1, 2023

Can't argue with the results, though--2x speedup in cert, potentially 8x in regular clusters? Nice.

@groelofs Exactly. 8X is what I would expect from regular clusters.

@lmr3796
Copy link

lmr3796 commented Sep 2, 2023

I did some research:

Upstream tried a similar fix for KAFKA-13091
apache#11245

And ends up with a deadlock issue KAFKA-13254
apache#11289

Though the code differs a lot; we may still want to check if we run into similar case.

Also, we should include relavant EXIT_CRITERIA here^.

@groelofs
Copy link

groelofs commented Sep 2, 2023

I did some research:

Upstream tried a similar fix for KAFKA-13091 apache#11245

And ends up with a deadlock issue KAFKA-13254 apache#11289

Though the code differs a lot; we may still want to check if we run into similar case.

Also, we should include relavant EXIT_CRITERIA here^.

Nice find! Unfortunately, the upstream fix looks sufficiently lengthy and complex that I'm not sure I'd trust it to carry over even if the code were more similar to ours. But I do like the extra validation that Hao's fix is needed, even if the implementation needs a bit more finesse.

@CCisGG
Copy link
Author

CCisGG commented Sep 5, 2023

Upstream tried a similar fix for KAFKA-13091
apache#11245

Oh wow... I didn't realize Kafka has a ticket system. When I investigated the issue, I tried to search github issues and KIP, but failed to find anything related. If I know this earlier that may save me 2 days! Thanks very much for bringing up the reference. Let me take a look at the upstream fix.

@CCisGG
Copy link
Author

CCisGG commented Sep 5, 2023

Nice find! Unfortunately, the upstream fix looks sufficiently lengthy and complex that I'm not sure I'd trust it to carry over even if the code were more similar to ours. But I do like the extra validation that Hao's fix is needed, even if the implementation needs a bit more finesse.

Actually the only functional difference is upstream triggers tryCompleteDelayedRequest after incrementHW, and I'm relying on other places that triggers tryCompleteDelayedRequest. However on a most recent glance, I didn't find any tryCompleteDelayedRequest could be triggered if there is no new Produce request. In that case, I don't even know how my fix actually solved the issue. Another mystery... That said, I may end up with bringing upstream changes. The good side is that makes us one step closer to the upstream, and is potentially beneficial if in the future we decide to merge upstream for KRaft.

Let me think about it more and test it a little bit more.

@CCisGG
Copy link
Author

CCisGG commented Sep 5, 2023

Tested with Nurse disabled (broker not brought back immediately after hard-kill):

Screenshot 2023-09-05 at 12 47 13 PM

Turns out the produce latency can be still high. The reason should be DelayedProduce is not tried to complete, exactly like what Huilin mentioned. So essentially maybeIncrementWatermark should always followed with tryCompleteDelayedRequests, otherwise the new HWM may not help with the completing the DelayedProduce.

Will consider to open another PR by bringing up open source change.

@CCisGG
Copy link
Author

CCisGG commented Sep 8, 2023

With the new updates (tryCompleteDelayedRequests after increment HW), the max latency is now capped at ~15 seconds even if nurse does not bring the dead broker back immediately.

Screenshot 2023-09-07 at 7 45 01 PM

@CCisGG CCisGG force-pushed the 20230901_fix_high_produce_latency branch 5 times, most recently from 1f1b63b to 74630f7 Compare September 8, 2023 19:04
Copy link

@groelofs groelofs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One cosmetic nit, but fix looks solid. Thanks!

@CCisGG CCisGG force-pushed the 20230901_fix_high_produce_latency branch from 74630f7 to ede787a Compare September 8, 2023 23:00
@CCisGG CCisGG merged commit d144fda into 3.0-li Sep 11, 2023
25 checks passed
@CCisGG CCisGG deleted the 20230901_fix_high_produce_latency branch September 11, 2023 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants