-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][broker][PIP-379] Add observability stats for "draining hashes" #23429
Conversation
…ith array format The String format is very inefficient. It's better to replace it for PIP-379 This is needed for tests
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #23429 +/- ##
============================================
+ Coverage 73.57% 74.33% +0.76%
- Complexity 32624 34398 +1774
============================================
Files 1877 1950 +73
Lines 139502 146980 +7478
Branches 15299 16184 +885
============================================
+ Hits 102638 109261 +6623
- Misses 28908 29298 +390
- Partials 7956 8421 +465
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
...broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentSubscription.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation
PIP-379: Key_Shared Draining Hashes for Improved Message Ordering was implemented in #23352.
One of the major benefits of PIP-379 is the easy-to-understand model of when a hash is blocked.
When a new consumer is added, hash range assignments move from existing consumers to the new consumer. (In some cases, hash range assignments can move between existing consumers after a consumer is added or removed.)
The PIP-379 implementation ensures that no new messages for the hash ranges that were moved can be delivered until all unacknowledged messages for a specific hash are cleared with acknowledgements or when the consumer disconnects.
This applies to the AUTO_SPLIT ordered mode of the Key_Shared subscription type.
There's a concept of "draining hashes" in PIP-379 which is now reflected in the consumer stats. This is an intentionally exposed internal detail since the user must have the information available for understanding why messages don't get delivered.
Since there's no mapping between external and internal concepts, the abstraction isn't leaky. The user doesn't need to know about the internal details of how the draining hashes are implemented, but they need to know that the consumer is blocked on unacknowledged messages for a specific hash range. This is all relevant information and doesn't contain unnecessary implementation details.
This PR contains the "consumer stats" changes that provide the information in a clear way.
Modifications
Added consumer-level stats:
drainingHashesCount
- the current number of hashes in the draining state for this consumerdrainingHashesClearedTotal
- the total number of hashes cleared from the draining state since the consumer connecteddrainingHashesUnackedMessages
- the total number of unacknowledged messages for all draining hashes for this consumerdrainingHashes
- draining hashes information for this consumerhash
- the sticky key hash which is drainingunackMsgs
- the number of unacknowledged messages for this hashblockedAttempts
- the number of times the hash has blocked an attempted delivery of a messageIn addition:
keyHashRangeArrays
- the consumer's hash range assignments in a list of lists where each item contains the start and end as elements.[ [ 2960, 5968 ], [ 22258, 43033 ], [ 49261, 54464 ], [ 55155, 61273 ] ]
It was necessary to add this field with a new name
keyHashRangeArrays
since there's already an existingkeyHashRange
field. Changing that isn't possible since it would break compatibility. A newer admin client couldn't read stats from an older broker and vice-versa.The previous
keyHashRange
is now deprecated. The field format was different.Example of both fields where the difference is visible:
The field
keyHashRanges
contains the information as a list of string values, which isn't very usable for most use cases since it would need to be parsed before it can be used.The stats will continue to contain
keyHashRange
andreadPositionWhenJoining
when the "classic" (3.3.x) implementation of Key_Shared is used by configuringsubscriptionKeySharedUseClassicPersistentImplementation=true
("classic" support was added in #23424).In the default configuration, the fields are removed from the topic stats output, but the client continues to support the fields for backward and forward compatibility.
Example of consumer stats for a subscription
Relevant information for consumer c1:
Relevant information in this case about consumer c2:
The PIP-379 implementation will only block hashes that are necessary. For each hash, there's a way to get detailed information to find out why the delivery is blocked.
The major difference from the previous
readPositionWhenJoining
solution is that it's possible to automate and build CLI and web user interface tools to assist a user, making it very easy to troubleshoot issues when message delivery is blocked by unacknowledged messages in Key_Shared subscriptions.Client-side tooling could already use the information provided in this PR to determine which consumer is blocked by a hash in the case that there would be multiple consumers.
In the above example, the hash
2862
is contained in the hash range[1, 2959]
, which means 2 unacknowledged messages for that hash are preventing further messages with hash2862
from being delivered to consumerc2
.The
blockedAttempts
field contains a counter that increments each time the dispatcher skips delivery to a consumer due to this hash. Using this information alone, it's very convenient to observe Key_Shared AUTO_SPLIT subscriptions and find out the causes.A future improvement will be to add a REST API for finding out the unacknowledged message ID information of the unacknowledged message for a hash. Using this information, it's possible to find out the details of the message that is blocking a particular hash.
Documentation
doc
doc-required
doc-not-needed
doc-complete