Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix unloadNamespaceBundlesGracefully can be stuck with extensible load manager #23349

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

BewareMyPower
Copy link
Contributor

Motivation

I observed an issue that broker was stuck at close for a long time. It's stuck at BrokerService#unloadNamespaceBundlesGracefully, which calls disableBroker once and unloadNamespaceBundleAsync for all owned namespace bundles synchronously. Most issues happen when the broker is the last broker.

Issue 1: Free events won't be sent in overrideOwnership

In overrideOwnership, if no broker is available, a Free event will be created.

However, since the dstBroker and sourceBroker fields are null in the Free event, an exception will be thrown so that the Free event won't be created and sent.

if (StringUtils.isBlank(dstBroker) && StringUtils.isBlank(sourceBroker)) {

Issue 2: Free events could be skipped due to the same version id

The Free event is created in overrideOwnership based on a previous event on the same bundle from the table view. However, there might be inflight events that are not in the table view yet. In ServiceUnitStateDataConflictResolver#shouldKeepLeft, if the version id is the same, the Free event will be skipped. Then, if the last event is the Owned event whose target broker is the current broker in close, waitForCleanups will wait until the timeout exceeds.

if (data.state() == Owned && broker.equals(data.dstBroker())) {
cleaned = false;
break;

Issue 3: __change_events topic prevents waitForCleanups from finishing

The __change_events topic's reader, which is managed by the system topic based topic policies service, will try acquire the ownership. So that in waitForCleanups, there will always be a Owned event for this topic's bundle. If the target broker is the broker itself, waitForCleanups will never have a chance to exit until the timeout exceeds.

Issue 4: unloadNamespaceBundleAsync will be stuck at getOwnershipAsync if there is no available broker

Broker unregisters itself in disableBroker, if it's the last broker, then no brokers will be available after that. However, unloadNamespaceBundleAsync needs to publish a Unloaded event in unloadAsync.

CompletableFuture<Void> future = serviceUnitStateChannel.publishUnloadEventAsync(unload);
return unloadManager.waitAsync(future, unload.serviceUnit(), unloadDecision, timeout, timeoutUnit)

Modifications

  • issue 1: change the constructor of ServiceUnitStateData to all null dstBroker and sourceBroker for Free events.
  • issue 2: call TableView#refreshAsync to refresh the entry set in ServiceUnitStateTableViewImpl#flush and call it before overrideOwnership. Besides, skip the version id check for the Free event.
  • issue 3: close the topic policies service at the beginning of PulsarService#closeAsync.
  • issue 4: close the extensible load manager if there is no available broker after unregister

Add testLookup to cover the changes above.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@BewareMyPower BewareMyPower self-assigned this Sep 25, 2024
@BewareMyPower BewareMyPower added this to the 4.0.0 milestone Sep 25, 2024
@BewareMyPower BewareMyPower added the type/bug The PR fixed a bug or issue reported a bug label Sep 25, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 25, 2024
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BewareMyPower
Copy link
Contributor Author

This fix might change some behaviors, I will try to fix the root cause

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants