Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat][broker] PIP-264: Add topic messaging metrics #22467

Merged
merged 59 commits into from
May 1, 2024

Conversation

dragosvictor
Copy link
Contributor

@dragosvictor dragosvictor commented Apr 9, 2024

PIP-264

Motivation

This PR adds OpenTelemetry metrics for the topic messaging group. The aim is to replace the Prometheus metrics listed in the reference documentation.

Modifications

  • Added the following general topic attributes (sample output relative to topic persistent://my-tenant/my-namespace/my-topic-partition-2):

    Attribute Notes Sample
    pulsar.domain persistent
    pulsar.tenant my-tenant
    pulsar.namespace Includes both tenant and namespace my-tenant/my-namespace
    pulsar.topic Includes the complete topic name, without partition persistent://my-tenant/my-namespace/my-topic
    pulsar.partition.index Not included if the topic is not partitioned 2
  • Added attributes

    • pulsar.transaction.status (differentiates between committed, aborted and active txns)
    • pulsar.backlog.quota.type (differentiates between size and time based quotas)
    • pulsar.compaction.status (differentiates between success and failure of compaction operation result)
  • Added class OpenTelemetryTopicStats. This is where the metrics are being defined (name, type, unit, description). All added metrics are 'asynchronous': OpenTelemetry uses a callback mechanism to retrieve their values, instead of internally maintaining the counters itself. Thus, we can leverage the existing counters stored throughout Pulsar. It also allows us to 'remove' a topic from OpenTelemetry tracking by simply not recording any metrics for it during a callback op (contrast that to synchronous OpenTelemetry metrics that cannot be 'closed', staying resident in memory). To further improve performance, the PR batches together updates for these metrics, meaning we can record all metrics relevant to a topic in one callback operation.

  • For the metric definitions, please consult [feat][site] PIP-264: Document topic messaging metrics pulsar-site#880

  • Many of the Prometheus metrics exposed both rates and absolute counters of their values. The rates would be redundant in OpenTelemetry, and are excluded here.

  • In the interest of keeping the size of this PR manageable, a few metrics have been omitted and will be added separately: pulsar_delayed_message_*. Replacement for metric pulsar_consumer_msg_ack_rate is defined in the PR but not currently recorded, as more work is needed to fetch this information efficiently. Metrics of type Histogram will only be populated at the namespace level, and are thus excluded here: pulsar_storage_write_latency_le_*, pulsar_entry_size_le_*, pulsar_compaction_latency_le_*, pulsar_delayed_message_index_bucket_op_latency_ms.

  • A couple of metrics did not have a proper backend for reporting:

    • Added getAddEntrySucceedTotal and getReadEntriesSucceededTotal to ManagedLedgerMXBean, supporting metrics pulsar.broker.topic.storage.outgoing and pulsar.broker.topic.storage.incoming.
    • Added field AbstractTopic.totalPublishRateLimitedCounter, reporting back the total hit number for the publish rate limiter.
    • Added fields CompactionRecord.compactionReadBytes and CompactionRecord.compactionWriteBytes, supporting metrics pulsar.broker.topic.compaction.incoming and pulsar.broker.topic.compaction.outgoing.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

(example:)

  • All metrics that are exposed in this PR are tested for value
  • Added OpenTelemetryTopicStatsTest#testMessagingMetrics to validate basic messaging metrics: subscription/producer/consumer counts, message in/out counts, bytes in/out counts etc.
  • Added OpenTelemetryTopicStatsTest#testPublishRateLimitMetric to validate rate limiter metric
  • Modified TransactionTest#testTopicTransactionMetrics to validate OpenTelemetry topic metrics too
  • Modified DelayedDeliveryTest#testDelayedDelivery to validate topic delayed delivery metric
  • Modified CompactorTest#testAllCompactedOut to validate message compaction metrics
  • Modified BacklogQuotaManagerTest#testConsumerBacklogEvictionTimeQuota and BacklogQuotaManagerTest#testConsumerBacklogEvictionSizeQuota to validate eviction backlog metrics
  • Modified AdminApiOffloadTest#testOffload to validate storage offload metric

Does this pull request potentially affect one of the following parts:

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

Matching PR in forked repository

PR in forked repository: dragosvictor#17

@dragosvictor dragosvictor marked this pull request as ready for review April 9, 2024 21:11
@merlimat merlimat added this to the 3.3.0 milestone Apr 9, 2024
Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good.

My only comment is that we should be using qualified names in the attributes:

pulsar.namespace Includes the namespace portion only my-namespace
pulsar.topic Includes the local topic name my-topic

eg: pulsar.namespace -> my-tenant/my-namespace
pulsar.topic -> my-tenant/my-namespace/my-topic

This is to keep consistency with many other places in APIs and CLI tools. Also, consistent with OTel metrics attributes in the client SDK.

@dragosvictor
Copy link
Contributor Author

Changes look good.

My only comment is that we should be using qualified names in the attributes:

pulsar.namespace Includes the namespace portion only my-namespace
pulsar.topic Includes the local topic name my-topic

eg: pulsar.namespace -> my-tenant/my-namespace pulsar.topic -> my-tenant/my-namespace/my-topic

This is to keep consistency with many other places in APIs and CLI tools. Also, consistent with OTel metrics attributes in the client SDK.

Note that such a transformation would lead the namespace to occasionally include the "cluster" portion in case of old topic names: pulsar.namespace="my-property/use/my-ns"

The topic name, as is currently filled in by the client also includes the persistence part: pulsar.topic="persistent://my-property/use/my-ns/testAllCompactedOut-07b9ad7f-89cb-4800-88e8-cb3417cf0406".

If we can confirm that this is the intent here, I can go ahead and proceed with this proposal. Alternatively, we can augment the existing implementation with one more attribute, say pulsar.topic.complete.name, even tough it leads to repetition.

@merlimat
Copy link
Contributor

@dragosvictor I think we shouldn't worry about v1 topic names, these were deprecated long ago we should actually start to get rid of them completely.

The topic name, as is currently filled in by the client also includes the persistence part: pulsar.topic="persistent://my-property/use/my-ns/testAllCompactedOut-07b9ad7f-89cb-4800-88e8-cb3417cf0406".

Yes, it's better to include, because it's part of fully-qualified name.

@dragosvictor dragosvictor force-pushed the pip-264-topic-messaging-metrics branch from 7d64327 to f8f2e9a Compare April 25, 2024 22:15
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 99.23954% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 72.47%. Comparing base (bbc6224) to head (f8f2e9a).
Report is 208 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #22467      +/-   ##
============================================
- Coverage     73.57%   72.47%   -1.10%     
+ Complexity    32624    32374     -250     
============================================
  Files          1877     1886       +9     
  Lines        139502   140783    +1281     
  Branches      15299    15456     +157     
============================================
- Hits         102638   102037     -601     
- Misses        28908    30872    +1964     
+ Partials       7956     7874      -82     
Flag Coverage Δ
inttests 27.11% <87.45%> (+2.52%) ⬆️
systests 24.59% <60.07%> (+0.26%) ⬆️
unittests 71.21% <99.23%> (-1.64%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...ookkeeper/mledger/impl/ManagedLedgerMBeanImpl.java 89.85% <100.00%> (+0.14%) ⬆️
...n/java/org/apache/pulsar/broker/PulsarService.java 82.27% <100.00%> (-0.11%) ⬇️
...rg/apache/pulsar/broker/service/AbstractTopic.java 88.19% <100.00%> (+0.20%) ⬆️
...che/pulsar/broker/stats/prometheus/TopicStats.java 94.26% <ø> (+0.07%) ⬆️
...org/apache/pulsar/compaction/CompactionRecord.java 100.00% <100.00%> (ø)
.../pulsar/opentelemetry/OpenTelemetryAttributes.java 100.00% <100.00%> (ø)
...e/pulsar/broker/stats/OpenTelemetryTopicStats.java 99.15% <99.15%> (ø)

... and 313 files with indirect coverage changes

@merlimat merlimat merged commit 4f3cc6c into apache:master May 1, 2024
51 of 68 checks passed
if (topicName.isPartitioned()) {
builder.put(OpenTelemetryAttributes.PULSAR_PARTITION_INDEX, topicName.getPartitionIndex());
}
var attributes = builder.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dragosvictor @merlimat Please note this contradicts all the work done to eliminate memory allocation.
The Attributes for each topic should be persisted either in a Map or easier by saving it in the implementation of Topic interface you're getting. Then you simply retrieve it by topic.getTopicOTelAttributes().
When you are not doing it, you are creating a memory allocation each time you run collect, for each topic. This is exactly what I was set out to avoid in the whole implementation rework of OpenTelemetry Java SDK memory mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asafm can you please create an issue to track this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-required Your PR changes impact docs and you will update later. ready-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants