Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix unloadNamespaceBundlesGracefully can be stuck with extensible load manager #23349

Merged

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Sep 25, 2024

Motivation

I observed an issue that broker was stuck at close for a long time. It's stuck at BrokerService#unloadNamespaceBundlesGracefully, which calls disableBroker once and unloadNamespaceBundleAsync for all owned namespace bundles synchronously. Most issues happen when the broker is the last broker.

Issue 1: Free events won't be sent in overrideOwnership

In overrideOwnership, if no broker is available, a Free event will be created.

However, since the dstBroker and sourceBroker fields are null in the Free event, an exception will be thrown so that the Free event won't be created and sent.

if (StringUtils.isBlank(dstBroker) && StringUtils.isBlank(sourceBroker)) {

Issue 2: Free events could be skipped due to the same version id

The Free event is created in overrideOwnership based on a previous event on the same bundle from the table view. However, there might be inflight events that are not in the table view yet. In ServiceUnitStateDataConflictResolver#shouldKeepLeft, if the version id is the same, the Free event will be skipped. Then, if the last event is the Owned event whose target broker is the current broker in close, waitForCleanups will wait until the timeout exceeds.

if (data.state() == Owned && broker.equals(data.dstBroker())) {
cleaned = false;
break;

Issue 3: __change_events topic prevents waitForCleanups from finishing

The __change_events topic's reader, which is managed by the system topic based topic policies service, will try acquire the ownership. So that in waitForCleanups, there will always be a Owned event for this topic's bundle. If the target broker is the broker itself, waitForCleanups will never have a chance to exit until the timeout exceeds.

Issue 4: unloadNamespaceBundleAsync will be stuck at getOwnershipAsync if there is no available broker

Broker unregisters itself in disableBroker, if it's the last broker, then no brokers will be available after that. However, unloadNamespaceBundleAsync needs to publish a Unloaded event in unloadAsync.

CompletableFuture<Void> future = serviceUnitStateChannel.publishUnloadEventAsync(unload);
return unloadManager.waitAsync(future, unload.serviceUnit(), unloadDecision, timeout, timeoutUnit)

Modifications

  • issue 1: change the constructor of ServiceUnitStateData to all null dstBroker and sourceBroker for Free events.
  • issue 2: call TableView#refreshAsync to refresh the entry set in ServiceUnitStateTableViewImpl#flush and call it before overrideOwnership.
  • issue 3: close the topic policies service at the beginning of PulsarService#closeAsync.
  • issue 4: close the extensible load manager if there is no available broker after unregister

In addition, in disableBroker, cancel the load data report tasks and shutdown the LoadDataStore objects to avoid being affected by the producers and readers on these two non-persistent topics.

Since LoadDataStore#get is still used in LeastResourceUsageWithWeight#select, don't throw an exception and return an empty in get. And handle the case that select might throw an exception in ExtensibleLoadManagerImpl#selectAsync.

To handle the specific case when the broker is the last broker, close the broker in advance if there is no available broker in the metadata store. Then any namespace bundle's unload will also be skipped because the state was restored to INIT.

Add testLookup to cover the changes above.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@BewareMyPower BewareMyPower self-assigned this Sep 25, 2024
@BewareMyPower BewareMyPower added this to the 4.0.0 milestone Sep 25, 2024
@BewareMyPower BewareMyPower added the type/bug The PR fixed a bug or issue reported a bug label Sep 25, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 25, 2024
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BewareMyPower
Copy link
Contributor Author

This fix might change some behaviors, I will try to fix the root cause

@heesung-sn
Copy link
Contributor

heesung-sn commented Sep 25, 2024

Thank you for fixing these issues.

Regarding the last broker shutdown issue,

Since there is no broker available to transfer ownerships, we could just simply shutdown the last broker without waiting too long(after trying to clean the ownerships) -- after the load balancer is shutdown, no new assignment will happen during shutdown too. Also, even if there are some orphan ownerships in the channel, when the fist broker(leader) starts, it will fix any orphan ones immediately.

Regarding the skip message issue,

I think the current skip logic can return lookups too soon, and I dont see a good reason to keep it. For example, when there are concurrent Assign events, it could return deferred lookups too soon by the skip msg logic, before Own event. I think it can just wait for the final Own event. Ideally, the channel logic shouldn't rely on skipped messages for its state changes.

@BewareMyPower BewareMyPower marked this pull request as draft September 26, 2024 02:15
@BewareMyPower BewareMyPower marked this pull request as draft September 26, 2024 13:31
@BewareMyPower BewareMyPower marked this pull request as ready for review September 26, 2024 14:28
@BewareMyPower
Copy link
Contributor Author

I reran the ExtensibleLoadManagerCloseTest for 20 times locally on the latest commit
d429420 and it never failed. Some unnecessary changes are reverted. PTAL again @heesung-sn

@BewareMyPower
Copy link
Contributor Author

It seems there are some failed tests in ExtensibleLoadManagerImplTest, please don't merge and let me take a look

@BewareMyPower BewareMyPower marked this pull request as draft September 27, 2024 04:57
@BewareMyPower BewareMyPower marked this pull request as ready for review September 27, 2024 05:28
@BewareMyPower
Copy link
Contributor Author

image

ExtensibleLoadManagerImplTest passed locally now. Let's wait if it would fail.

@codecov-commenter
Copy link

codecov-commenter commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 50.73529% with 67 lines in your changes missing coverage. Please review.

Project coverage is 74.52%. Comparing base (bbc6224) to head (5ad8309).
Report is 606 commits behind head on master.

Files with missing lines Patch % Lines
...dbalance/extensions/ExtensibleLoadManagerImpl.java 48.43% 17 Missing and 16 partials ⚠️
...xtensions/channel/ServiceUnitStateChannelImpl.java 62.96% 7 Missing and 3 partials ⚠️
.../service/SystemTopicBasedTopicPoliciesService.java 46.66% 6 Missing and 2 partials ⚠️
...ensions/channel/ServiceUnitStateTableViewImpl.java 50.00% 4 Missing and 1 partial ⚠️
...rg/apache/pulsar/broker/service/BrokerService.java 25.00% 1 Missing and 2 partials ⚠️
...e/extensions/filter/BrokerMaxTopicCountFilter.java 50.00% 2 Missing ⚠️
...e/extensions/store/TableViewLoadDataStoreImpl.java 60.00% 1 Missing and 1 partial ⚠️
...a/org/apache/pulsar/client/impl/TableViewImpl.java 33.33% 2 Missing ⚠️
...n/java/org/apache/pulsar/broker/PulsarService.java 50.00% 0 Missing and 1 partial ⚠️
...lance/extensions/ExtensibleLoadManagerWrapper.java 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23349      +/-   ##
============================================
+ Coverage     73.57%   74.52%   +0.94%     
- Complexity    32624    33939    +1315     
============================================
  Files          1877     1934      +57     
  Lines        139502   145107    +5605     
  Branches      15299    15863     +564     
============================================
+ Hits         102638   108138    +5500     
+ Misses        28908    28669     -239     
- Partials       7956     8300     +344     
Flag Coverage Δ
inttests 27.72% <23.52%> (+3.13%) ⬆️
systests 24.51% <2.94%> (+0.19%) ⬆️
unittests 73.88% <50.73%> (+1.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...lance/extensions/channel/ServiceUnitStateData.java 88.88% <100.00%> (-11.12%) ⬇️
.../channel/ServiceUnitStateDataConflictResolver.java 76.00% <ø> (ø)
...n/java/org/apache/pulsar/broker/PulsarService.java 83.61% <50.00%> (+1.24%) ⬆️
...lance/extensions/ExtensibleLoadManagerWrapper.java 60.60% <0.00%> (+1.98%) ⬆️
...e/extensions/filter/BrokerMaxTopicCountFilter.java 72.72% <50.00%> (-14.78%) ⬇️
...e/extensions/store/TableViewLoadDataStoreImpl.java 84.25% <60.00%> (-3.85%) ⬇️
...a/org/apache/pulsar/client/impl/TableViewImpl.java 89.83% <33.33%> (+4.52%) ⬆️
...rg/apache/pulsar/broker/service/BrokerService.java 82.94% <25.00%> (+2.16%) ⬆️
...ensions/channel/ServiceUnitStateTableViewImpl.java 69.51% <50.00%> (ø)
.../service/SystemTopicBasedTopicPoliciesService.java 76.42% <46.66%> (+2.23%) ⬆️
... and 2 more

... and 590 files with indirect coverage changes

@BewareMyPower BewareMyPower merged commit e91574a into apache:master Sep 27, 2024
51 checks passed
@BewareMyPower BewareMyPower deleted the bewaremypower/extensible-lm-unload branch September 27, 2024 09:33
heesung-sn pushed a commit to heesung-sn/pulsar that referenced this pull request Oct 22, 2024
heesung-sn pushed a commit to heesung-sn/pulsar that referenced this pull request Oct 22, 2024
heesung-sn pushed a commit to heesung-sn/pulsar that referenced this pull request Oct 22, 2024
heesung-sn pushed a commit to heesung-sn/pulsar that referenced this pull request Oct 23, 2024
BewareMyPower added a commit that referenced this pull request Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-3.3 doc-not-needed Your PR changes do not impact docs ready-to-test release/3.3.3 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants