-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] Fix unloadNamespaceBundlesGracefully can be stuck with extensible load manager #23349
[fix][broker] Fix unloadNamespaceBundlesGracefully can be stuck with extensible load manager #23349
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This fix might change some behaviors, I will try to fix the root cause |
Thank you for fixing these issues. Regarding the last broker shutdown issue, Since there is no broker available to transfer ownerships, we could just simply shutdown the last broker without waiting too long(after trying to clean the ownerships) -- after the load balancer is shutdown, no new assignment will happen during shutdown too. Also, even if there are some orphan ownerships in the channel, when the fist broker(leader) starts, it will fix any orphan ones immediately. Regarding the skip message issue, I think the current skip logic can return lookups too soon, and I dont see a good reason to keep it. For example, when there are concurrent Assign events, it could return deferred lookups too soon by the skip msg logic, before Own event. I think it can just wait for the final Own event. Ideally, the channel logic shouldn't rely on skipped messages for its state changes. |
...src/main/java/org/apache/pulsar/broker/loadbalance/extensions/ExtensibleLoadManagerImpl.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Outdated
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateData.java
Show resolved
Hide resolved
...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Outdated
Show resolved
Hide resolved
I reran the |
pulsar-client/src/main/java/org/apache/pulsar/client/impl/TableViewImpl.java
Show resolved
Hide resolved
It seems there are some failed tests in |
…extensible load manager (apache#23349) (cherry picked from commit e91574a)
…extensible load manager (apache#23349) (cherry picked from commit e91574a)
…extensible load manager (apache#23349) (cherry picked from commit e91574a)
…extensible load manager (apache#23349) (cherry picked from commit e91574a)
…extensible load manager (#23349) (#23496) Co-authored-by: Yunze Xu <[email protected]>
Motivation
I observed an issue that broker was stuck at close for a long time. It's stuck at
BrokerService#unloadNamespaceBundlesGracefully
, which callsdisableBroker
once andunloadNamespaceBundleAsync
for all owned namespace bundles synchronously. Most issues happen when the broker is the last broker.Issue 1: Free events won't be sent in
overrideOwnership
In
overrideOwnership
, if no broker is available, aFree
event will be created.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Line 1363 in 4ce0c75
However, since the
dstBroker
andsourceBroker
fields are null in theFree
event, an exception will be thrown so that theFree
event won't be created and sent.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateData.java
Line 37 in 4ce0c75
Issue 2: Free events could be skipped due to the same version id
The
Free
event is created inoverrideOwnership
based on a previous event on the same bundle from the table view. However, there might be inflight events that are not in the table view yet. InServiceUnitStateDataConflictResolver#shouldKeepLeft
, if the version id is the same, theFree
event will be skipped. Then, if the last event is theOwned
event whose target broker is the current broker in close,waitForCleanups
will wait until the timeout exceeds.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Lines 1394 to 1396 in 4ce0c75
Issue 3:
__change_events
topic preventswaitForCleanups
from finishingThe
__change_events
topic's reader, which is managed by the system topic based topic policies service, will try acquire the ownership. So that inwaitForCleanups
, there will always be aOwned
event for this topic's bundle. If the target broker is the broker itself,waitForCleanups
will never have a chance to exit until the timeout exceeds.Issue 4: unloadNamespaceBundleAsync will be stuck at getOwnershipAsync if there is no available broker
Broker unregisters itself in
disableBroker
, if it's the last broker, then no brokers will be available after that. However,unloadNamespaceBundleAsync
needs to publish aUnloaded
event inunloadAsync
.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/ExtensibleLoadManagerImpl.java
Lines 700 to 701 in 4ce0c75
Modifications
ServiceUnitStateData
to all nulldstBroker
andsourceBroker
forFree
events.TableView#refreshAsync
to refresh the entry set inServiceUnitStateTableViewImpl#flush
and call it beforeoverrideOwnership
.PulsarService#closeAsync
.unregister
In addition, in
disableBroker
, cancel the load data report tasks and shutdown theLoadDataStore
objects to avoid being affected by the producers and readers on these two non-persistent topics.Since
LoadDataStore#get
is still used inLeastResourceUsageWithWeight#select
, don't throw an exception and return an empty inget
. And handle the case thatselect
might throw an exception inExtensibleLoadManagerImpl#selectAsync
.To handle the specific case when the broker is the last broker, close the broker in advance if there is no available broker in the metadata store. Then any namespace bundle's unload will also be skipped because the state was restored to INIT.
Add
testLookup
to cover the changes above.Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: