[fix][broker] Fix unloadNamespaceBundlesGracefully can be stuck with extensible load manager #23349
+208
−50
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
I observed an issue that broker was stuck at close for a long time. It's stuck at
BrokerService#unloadNamespaceBundlesGracefully
, which callsdisableBroker
once andunloadNamespaceBundleAsync
for all owned namespace bundles synchronously. Most issues happen when the broker is the last broker.Issue 1: Free events won't be sent in
overrideOwnership
In
overrideOwnership
, if no broker is available, aFree
event will be created.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Line 1363 in 4ce0c75
However, since the
dstBroker
andsourceBroker
fields are null in theFree
event, an exception will be thrown so that theFree
event won't be created and sent.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateData.java
Line 37 in 4ce0c75
Issue 2: Free events could be skipped due to the same version id
The
Free
event is created inoverrideOwnership
based on a previous event on the same bundle from the table view. However, there might be inflight events that are not in the table view yet. InServiceUnitStateDataConflictResolver#shouldKeepLeft
, if the version id is the same, theFree
event will be skipped. Then, if the last event is theOwned
event whose target broker is the current broker in close,waitForCleanups
will wait until the timeout exceeds.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Lines 1394 to 1396 in 4ce0c75
Issue 3:
__change_events
topic preventswaitForCleanups
from finishingThe
__change_events
topic's reader, which is managed by the system topic based topic policies service, will try acquire the ownership. So that inwaitForCleanups
, there will always be aOwned
event for this topic's bundle. If the target broker is the broker itself,waitForCleanups
will never have a chance to exit until the timeout exceeds.Issue 4: unloadNamespaceBundleAsync will be stuck at getOwnershipAsync if there is no available broker
Broker unregisters itself in
disableBroker
, if it's the last broker, then no brokers will be available after that. However,unloadNamespaceBundleAsync
needs to publish aUnloaded
event inunloadAsync
.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/ExtensibleLoadManagerImpl.java
Lines 700 to 701 in 4ce0c75
Modifications
ServiceUnitStateData
to all nulldstBroker
andsourceBroker
forFree
events.TableView#refreshAsync
to refresh the entry set inServiceUnitStateTableViewImpl#flush
and call it beforeoverrideOwnership
. Besides, skip the version id check for theFree
event.PulsarService#closeAsync
.unregister
Add
testLookup
to cover the changes above.Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: