[improve] PIP-381: Handle PositionInfo that's too large to serialize as a single entry #22799

dlg99 · 2024-05-29T23:08:02Z

Motivation

In some cases cursor position info can be too large to serialize as a single entry, e.g. in case of too many deleted ranges. Also the serialization can be too slow.

cherry-picks of changes by @eolivelli @nicoloboschi and I.

Modifications

Cursor PositionInfo serialization is reworked to produce less garbage/serialize faster; serialized data can be compressed.
In case the serialized data too large it is chunked and saved as a sequence of entries.

Verifying this change

Make sure that the change passes the CI checks.

This change added tests.

Does this pull request potentially affect one of the following parts:

NO

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository: dlg99#17

nicoloboschi

LGTM

merlimat

Since this is touching on-disk formats it would be good to have a PIP discussion.

lhotari · 2024-08-16T10:00:03Z

I wonder if this is related to PIP-81 which doesn't seem to have an implementation: https://github.com/apache/pulsar/wiki/PIP-81%3A-Split-the-individual-acknowledgments-into-multiple-entries

There was a larger PR for PIP-81 that was closed: #10729
Some parts of it were split, such as #15425 and #15607.
@codelipenghui @315157973 any details about future PIP-81 plans to share?

lhotari · 2024-08-16T10:04:19Z

Btw. I'm currently investigating a Key_Shared subscription type issue where ordinary consumption of message leads to a very large number of "ack holes". The WIP test app where this is reproduced is https://github.com/lhotari/pulsar-playground/blob/master/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioIssueKeyShared.java .
The test class is not yet simplified to contain the relevant parts. I started with a very complex test case and it seems that the "ack hole" problem shows up in all possible cases.

No messages get lost. It's just that some messages don't get delivered until all other messages have been processed.

lhotari · 2024-08-16T13:15:49Z

In the 1M message experiment, the number of ack holes goes down from about 150k ack holes to <500 with this experiment: lhotari@a3b0639

lhotari · 2024-08-27T13:53:23Z

It's possible that the root cause of this issue of large PositionInfo is #23200 and it is addressed with PRs #23231 and #23226. There's #23224 for observability. Large msgInReplay counts would confirm the root cause.

rdhabalia · 2024-09-20T23:48:45Z

this is a real problem and it has been solved with a simple and fundamentally proven solution with perf numbers : #9292

But again I am not sure some folks blocked this PR without saying the reason even after asking multiple times and blocked the progress on this PR.

315157973 · 2024-09-21T10:00:28Z

I wonder if this is related to PIP-81 which doesn't seem to have an implementation: https://github.com/apache/pulsar/wiki/PIP-81%3A-Split-the-individual-acknowledgments-into-multiple-entries

There was a larger PR for PIP-81 that was closed: #10729 Some parts of it were split, such as #15425 and #15607. @codelipenghui @315157973 any details about future PIP-81 plans to share?

Since a PR implemented the compression of PositionInfo, the size of PositionInfo can be greatly reduced, and the problem of Entry size exceeding the threshold will no longer occur, so this PIP was not further promoted.

lhotari

Great work Andrey!

If there's way to refactor the logic to avoid byte[] and use Netty ByteBuf when possible, the solution would be more aligned with the "no garbage" style in Pulsar.

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

lhotari · 2024-09-23T06:39:11Z

I wonder if this is related to PIP-81 which doesn't seem to have an implementation: https://github.com/apache/pulsar/wiki/PIP-81%3A-Split-the-individual-acknowledgments-into-multiple-entries
There was a larger PR for PIP-81 that was closed: #10729 Some parts of it were split, such as #15425 and #15607. @codelipenghui @315157973 any details about future PIP-81 plans to share?

Since a PR implemented the compression of PositionInfo, the size of PositionInfo can be greatly reduced, and the problem of Entry size exceeding the threshold will no longer occur, so this PIP was not further promoted.

@315157973 Are you referring to PIP-146: ManagedCursorInfo compression
or ManagedLedgerInfo compression (does that contain a PIP?) ? What if the size exceeds the threshold after compression?

(cherry picked from commit 1ef9664)

* serialize/compress without intermediate byte arrays * use lightproto for cursor serialization to the ledger * Reuse PositionInfo (cherry picked from commit 1887c44)

(cherry picked from commit 98a3d25)

* ManagedCursor: manually serialise PositionInfo * Add tests and save last serialized side to prevent reallocations (cherry picked from commit 8a365d0)

(cherry picked from commit 44ba614)

(cherry picked from commit f1323c6)

(cherry picked from commit d4b94ab)

(cherry picked from commit 5f07f0c)

(cherry picked from commit 6d2e494)

…pache#275) (cherry picked from commit 6a2a010)

(cherry picked from commit 4c5387d)

(cherry picked from commit c3fe80e)

…footer of the chunked data (apache#282) (cherry picked from commit 6e72ecb)

codecov-commenter · 2024-09-24T01:31:51Z

Codecov Report

Attention: Patch coverage is 38.12317% with 844 lines in your changes missing coverage. Please review.

Project coverage is 74.19%. Comparing base (bbc6224) to head (54157d8).
Report is 629 commits behind head on master.

Files with missing lines	Patch %	Lines
...che/bookkeeper/mledger/impl/PositionInfoUtils.java	28.24%	681 Missing and 20 partials ⚠️
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java	61.20%	110 Missing and 25 partials ⚠️
.../apache/bookkeeper/mledger/impl/MetaStoreImpl.java	60.00%	4 Missing and 2 partials ⚠️
...e/bookkeeper/mledger/impl/LedgerMetadataUtils.java	50.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #22799      +/-   ##
============================================
+ Coverage     73.57%   74.19%   +0.61%     
- Complexity    32624    34041    +1417     
============================================
  Files          1877     1937      +60     
  Lines        139502   146587    +7085     
  Branches      15299    16087     +788     
============================================
+ Hits         102638   108756    +6118     
- Misses        28908    29472     +564     
- Partials       7956     8359     +403

Flag	Coverage Δ
inttests	`27.41% <20.60%> (+2.83%)`	⬆️
systests	`24.42% <19.94%> (+0.09%)`	⬆️
unittests	`73.56% <38.12%> (+0.71%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...apache/bookkeeper/mledger/ManagedLedgerConfig.java	`96.55% <100.00%> (+0.25%)`	⬆️
...bookkeeper/mledger/ManagedLedgerFactoryConfig.java	`90.47% <100.00%> (+1.00%)`	⬆️
...org/apache/pulsar/broker/ServiceConfiguration.java	`98.97% <100.00%> (-0.43%)`	⬇️
...ache/pulsar/broker/ManagedLedgerClientFactory.java	`90.24% <100.00%> (+6.09%)`	⬆️
...rg/apache/pulsar/broker/service/BrokerService.java	`83.61% <100.00%> (+2.83%)`	⬆️
...e/bookkeeper/mledger/impl/LedgerMetadataUtils.java	`91.66% <50.00%> (-8.34%)`	⬇️
.../apache/bookkeeper/mledger/impl/MetaStoreImpl.java	`83.72% <60.00%> (-2.19%)`	⬇️
...che/bookkeeper/mledger/impl/ManagedCursorImpl.java	`76.73% <61.20%> (-2.57%)`	⬇️
...che/bookkeeper/mledger/impl/PositionInfoUtils.java	`28.24% <28.24%> (ø)`

... and 603 files with indirect coverage changes

…king

dlg99 · 2024-09-26T17:51:18Z

@lhotari I got rid of byte[] usage and added a feature flag, per your comments

lhotari

I added a few review comments.

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/MetaStoreImpl.java

lhotari

LGTM

lhotari · 2024-10-01T16:20:53Z

conf/broker.conf

+#persistentUnackedRangesMaxEntrySize=1048576
+
+# Set the compression type to use for cursor info.
+# Possible options are NONE, LZ4, ZLIB, ZSTD, SNAPPY
+#cursorInfoCompressionType=NONE


It's better to remove the comment from the line. The reason for this is that when it's without the comment, it won't require PULSAR_PREFIX_ with the k8s config hack that Pulsar uses.

Please also add the config changes to standalone.conf

Requested change has been performed. PIP-381 is the PIP.

…lone.conf

eolivelli

LGTM

lhotari

I noticed that this PR adds a lot of info level logs. Just thinking if that could cause log traffic to grow in unexpected ways. In Pulsar, configuring loggers is a bit clumber some, but I think that we should primarily use log4j2 features to achieve it.

btw. SLF4J 2.0 comes with a fluent API that can also support a pattern where the log level is parameterized. However, the performance concerns of using that API are unknown to me. Regarding SLF4J 2.0 API usage, we can use SLF4J 2.0 API on broker side code. However for client side code, we should avoid using the SLF4J 2.0 APIs so that client could continue to use SLF4J 1.7.x implementations in their client apps.

Instead of using the fluent API with a log level parameter, a better approach would be to use a unique logger name for specific log groups since that would give the ability to enable and disable certain type of logging when it's necessary.

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/MetaStoreImpl.java

lhotari

LGTM

rdhabalia · 2024-10-03T18:12:33Z

conf/broker.conf

+
+# Set the compression type to use for cursor info.
+# Possible options are NONE, LZ4, ZLIB, ZSTD, SNAPPY
+cursorInfoCompressionType=NONE


we already have lot of ServiceConfig and I guess we should define best compression type and use it by default instead giving tuning parameter. introducing 3 tuning config for a simple feature is not required. I guess we should introduce later if we see the need of this config.

Let's not break backwards compatibility

rdhabalia · 2024-10-03T18:14:22Z

conf/broker.conf

-# If enabled, the maximum "acknowledgment holes" will not be limited and "acknowledgment holes" are stored in
-# multiple entries.
+# If enabled, the maximum "acknowledgment holes" (as defined by managedLedgerMaxUnackedRangesToPersist)
+# can be stored in multiple entries, allowing the higher limits.


we should define that higher limit because again it will confuse user and no one will have any idea about this limit. PR #9292 by default supports up to 10M unack messages in single entry,. So, we should define higher limit as > 10M unack messages

Let's not break backwards compatibility given that feature is disabled by default

how does it break the compatibility by just updating the comment?

also this is a new feature to write in multiple entry from where backward compatibility came from :-)

It writes to the same ledger as existing single entry write.

rdhabalia · 2024-10-03T18:16:01Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/ManagedLedgerConfig.java

@@ -45,7 +45,9 @@ public class ManagedLedgerConfig {
    private boolean createIfMissing = true;
    private int maxUnackedRangesToPersist = 10000;
    private int maxBatchDeletedIndexToPersist = 10000;
+    private String cursorInfoCompressionType = "NONE";


let's use the best possible compression algo by performing various tests. performing perf tests for each type for such internal implementation is difficult for any user and it's the author's responsibility to give those numbers.

Let's not break backwards compatibility

isnt' it part of this new feature? then how come it will impact backward compatibility? if we will introduce it in this feature then we will have compatibility issue and that's what I would like to avoid by not adding it here,

rdhabalia · 2024-10-03T18:20:27Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

-                    return;
+            LedgerEntry entry = seq.nextElement();
+            ByteBuf data = entry.getEntryBuffer();
+            try {


this should go into a separate method as the current method is already super long.

rdhabalia · 2024-10-03T18:21:29Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

-                    callback.operationFailed(createManagedLedgerException(rc1));
-                    return;
+            LedgerEntry entry = seq.nextElement();
+            ByteBuf data = entry.getEntryBuffer();


do we have to release the ByteBuf?

rdhabalia · 2024-10-03T18:50:00Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

-                startCreatingNewMetadataLedger();
-                break;
+                case NoLedger:
+                    pendingMarkDeleteOps.add(mdEntry);


extra space in format?

rdhabalia · 2024-10-03T18:50:51Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

+            }
+
+            individualDeletedMessages.forEachRawRange((lowerKey, lowerValue, upperKey, upperValue) -> {
+                acksSerializedSize.addAndGet(16 * 4);


can we create a CONSTANT with meaningful name and use it here.?

rdhabalia · 2024-10-03T18:52:22Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

@@ -3164,58 +3365,252 @@ private List<MLDataFormats.BatchedEntryDeletionIndexInfo> buildBatchEntryDeletio
        }
    }

+    private void buildBatchEntryDeletionIndexInfoList(


now, I feel we have multiple implementation of Cursor Recovery and it's a high time to create an interface and move all implementation there as it has create such a huge file with multiple heavy implementation.

While a good idea, this goes beyond the scope of this PR, the blast radius of refactoring with the tests etc is too much to fit into the release schedule and other obligations.

hmm.. I would avoid creating tech debt by adding such heavy implementation in ManagedCursorImpl which is just related to ser/deser of Position ranges.

rdhabalia · 2024-10-03T18:57:23Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedCursorImpl.java

+        }
+    }
+
+    static ByteBuf decompressDataIfNeeded(ByteBuf data, LedgerHandle lh) {


we must need a separate interface and implementation to have this heavy implementation in separate file.

While a good idea, this goes beyond the scope of this PR, the blast radius of refactoring with the tests etc is too much to fit into the release schedule and other obligations.

rdhabalia · 2024-10-03T19:00:09Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/PositionInfoUtils.java

+import org.apache.bookkeeper.mledger.Position;
+import org.apache.pulsar.common.allocator.PulsarByteBufAllocator;
+
+final class PositionInfoUtils {


I request to avoid creating Util Class for Position object as I have seen Position casting into client or in other system eg: Apache Beam connector and Once we have Util class, we are opening can of warm. Instead if we have an interface of CursorRecovery then this entire logic can go into MultiEntryCursorRecoveryImplementation Class,.

PositionInfoUtils addresses problem that appears when serializing large PositionInfo. IIRC LightProto hogs the memory allocated for larger repeated objects and when the size of PositionInfo drop the memory isn't freed.
This is used just for the serialization in the same compatible way that can be read by protobuf and doesn't create problems for any system that relies on proto definitions.

is PositionInfoUtils auto generated? which I don't think and I am kind of confused why it has such complex serialization/deser code which can be avoided if you can use some proto gen framework?

it is generated by light proto with serializePositionInfo() crafted manually to avoid some of the inefficiencies. Unfortunately, the generated code couldn't be (re)used directly as some of the required things are hidden with private access.

oh.. this seems like a big red flag as we are combining auto generated code with actual code. this can make a nightmare when we will have a change in auto-generated code and we need to merge it with actual code. this is a huge red flag of this PR and we should avoid it.

This is some sort of trick, but there is no better option at the moment, Protobuf is not able to handle efficiently serialization/deserialization while Lightproto is great.

Unfortunately there is not a good way to let the LightProto compiler fully create this file at build time, as we (I worked on the original patch) had to amend the generated code.

All the tricks are in this file, and not in other places.

As @dlg99 mentioned introducing some interface/abstraction will make the change more complicated.

But I would avoid adding such a hack for just this feature in Pulsar. This shows that we should make interface for unack-message store/recovery and make it pluggable so, anyone can add pluggable solution and add a hack if needed. because by just looking at the amount of code, complication and hack for this feature in this PR , it's a great example of making it a pluggable and injecting custom code to keep Pulsar project free from hacks.

rdhabalia · 2024-10-03T19:03:22Z

overall let us organize code more appropriately and I am not comfortable with such heavy implementation for a simple feature and it should go to separate interface implementation. I will not block this PR but let's not rush to merge PR and please do not merge this PR without addressing such fundamental separation to avoid creating a messy code base.

dlg99 added the ready-to-test label May 29, 2024

dlg99 requested a review from eolivelli May 29, 2024 23:08

github-actions bot added the doc-not-needed Your PR changes do not impact docs label May 29, 2024

nicoloboschi approved these changes May 30, 2024

View reviewed changes

merlimat previously requested changes May 31, 2024

View reviewed changes

dlg99 force-pushed the cpick/cursor-large-state branch from 6fd14cc to 319ad5f Compare September 9, 2024 21:18

dlg99 force-pushed the cpick/cursor-large-state branch from 319ad5f to 1c405fe Compare September 16, 2024 22:48

dlg99 mentioned this pull request Sep 20, 2024

[improve][pip] PIP-381: Handle large PositionInfo state #23328

Merged

4 tasks

lhotari reviewed Sep 23, 2024

View reviewed changes

eolivelli and others added 14 commits September 23, 2024 15:53

ManagedCursor: compress data written to BookKeeper

e014356

(cherry picked from commit 1ef9664)

serialize/compress without intermediate byte arrays (apache#268)

33e4c71

* serialize/compress without intermediate byte arrays * use lightproto for cursor serialization to the ledger * Reuse PositionInfo (cherry picked from commit 1887c44)

Print time

6d1b93a

(cherry picked from commit 98a3d25)

ManagedCursor: manually serialise PositionInfo (apache#270)

568d446

* ManagedCursor: manually serialise PositionInfo * Add tests and save last serialized side to prevent reallocations (cherry picked from commit 8a365d0)

Fix PositionInfoUtilsTest

0240250

(cherry picked from commit 44ba614)

PositionInfo Util serialization fix and test (apache#272)

ed8df4d

(cherry picked from commit f1323c6)

Remove auto reset of cursor in case of read error

c2f0908

(cherry picked from commit d4b94ab)

Revert removal of 'containsKey' in ManagedCursorImpl

27152ff

(cherry picked from commit 5f07f0c)

Prevent ZK connection loss in case of huge cursor status (apache#273)

564a668

(cherry picked from commit 6d2e494)

[managed-ledger] Compressed cursors: fix problem with little buffers (a…

0d23d5b

…pache#275) (cherry picked from commit 6a2a010)

[tests] Fix build after merge conflict

08af8fc

(cherry picked from commit 4c5387d)

Fix WriteCursorLedgerSize metric

e8d3930

(cherry picked from commit c3fe80e)

try ledger recovery from previous entries in case of corrupt/missing …

89adf38

…footer of the chunked data (apache#282) (cherry picked from commit 6e72ecb)

fix boken test after merge/resolve

43a5b31

dlg99 added 2 commits September 23, 2024 15:54

post-rebase fixes

8db72f4

removed usage of byte[] where possible

1397faf

dlg99 force-pushed the cpick/cursor-large-state branch from 1c405fe to 1397faf Compare September 24, 2024 00:13

added config parameters for teh chunk size and to enable/disable chin…

d4b4195

…king

dlg99 changed the title ~~[improve] Handle PositionInfo that's too large to serialize as a single entry~~ [improve] PIP-381: Handle PositionInfo that's too large to serialize as a single entry Sep 24, 2024

lhotari reviewed Sep 27, 2024

View reviewed changes

dlg99 added 2 commits September 27, 2024 15:10

CR feedback, addComponent(true, ..)

2fdbe63

Updated broker.conf with new entries

9b88801

lhotari approved these changes Oct 1, 2024

View reviewed changes

lhotari reviewed Oct 1, 2024

View reviewed changes

lhotari requested a review from merlimat October 1, 2024 16:21

dlg99 added 2 commits October 1, 2024 16:16

updated configs with docs and new config values, including the standa…

7918f21

…lone.conf

Merge branch 'master' into cpick/cursor-large-state

048142c

eolivelli approved these changes Oct 3, 2024

View reviewed changes

lhotari reviewed Oct 3, 2024

View reviewed changes

info logging to debug

590c5ac

lhotari approved these changes Oct 3, 2024

View reviewed changes

rdhabalia reviewed Oct 3, 2024

View reviewed changes

CR feedback

54157d8

rdhabalia mentioned this pull request Oct 4, 2024

[fix][broker] Support large number of unack message store for cursor recovery #9292

Merged

4 tasks

[improve] PIP-381: Handle PositionInfo that's too large to serialize as a single entry #22799

Are you sure you want to change the base?

[improve] PIP-381: Handle PositionInfo that's too large to serialize as a single entry #22799

Conversation

dlg99 commented May 29, 2024

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

nicoloboschi left a comment

Choose a reason for hiding this comment

merlimat left a comment

Choose a reason for hiding this comment

lhotari commented Aug 16, 2024

lhotari commented Aug 16, 2024

lhotari commented Aug 16, 2024

lhotari commented Aug 27, 2024 • edited Loading

rdhabalia commented Sep 20, 2024

315157973 commented Sep 21, 2024

lhotari left a comment

Choose a reason for hiding this comment

lhotari commented Sep 23, 2024 • edited Loading

codecov-commenter commented Sep 24, 2024 • edited Loading

Codecov Report

dlg99 commented Sep 26, 2024

lhotari left a comment

Choose a reason for hiding this comment

lhotari left a comment

Choose a reason for hiding this comment

lhotari Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

eolivelli left a comment

Choose a reason for hiding this comment

lhotari left a comment

Choose a reason for hiding this comment

lhotari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdhabalia Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdhabalia Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdhabalia commented Oct 3, 2024

lhotari commented Aug 27, 2024 •

edited

Loading

lhotari commented Sep 23, 2024 •

edited

Loading

codecov-commenter commented Sep 24, 2024 •

edited

Loading

lhotari Oct 1, 2024 •

edited

Loading

rdhabalia Oct 3, 2024 •

edited

Loading

rdhabalia Oct 3, 2024 •

edited

Loading