Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [broker] Do not record a bundle-unloading into the topic load failed metrics #23334

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

poorbarcode
Copy link
Contributor

Motivation & Modifications

Pulsar has a metric that indicates load topic failed: topic_load_failed_total, it reflects something is not expected, such as

  • Not enough bookies
  • Failed to connect to the Metadata store

But the topic loaded failed due to a bundle in unloading is expected, we should not record it into this metrics.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode self-assigned this Sep 23, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 23, 2024
@poorbarcode poorbarcode changed the title [fix] [broker] Do not record a bundle-unloading into the topic load failed metrics [draft] [fix] [broker] Do not record a bundle-unloading into the topic load failed metrics Sep 23, 2024
@poorbarcode poorbarcode changed the title [draft] [fix] [broker] Do not record a bundle-unloading into the topic load failed metrics [fix] [broker] Do not record a bundle-unloading into the topic load failed metrics Sep 23, 2024
Comment on lines 1680 to 1681
MutableInt failedLoadTopic1 = new MutableInt(0);
MutableInt concurrencyLoadTopicAndUnloadBundle1 = new MutableInt(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since multiple threads are involved, shouldn't these be AtomicIntegers?

Copy link
Member

@lhotari lhotari Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't currently understand the reason to use MutableInts.
If the reason is to that a final variable is needed for an inner lambda, it might be better to use a final variable.
One trick that could be useful is to isolate variable scopes with a {....} block. That would allow to introduce the same variable name multiple times in a single unit test without variable name collisions since the variables would be in different scopes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@heesung-sn
Copy link
Contributor

But the topic loaded failed due to a bundle in unloading is expected, we should not record it into this metrics.

Don't we need to count this case as failure as well? To me if the topic load failed on the broker and client lookup failed, it should be counted as a failure.

I think this is one area that Extensible Load Manager can shine as it handles lookups more gracefully during unloading -- the topic lookups are deferred until the unloading completes in the service unit state channel. Hence, the client won't see failure, and the lookup response should be slightly delayed instead of returning error code(no retry as well).

@poorbarcode
Copy link
Contributor Author

poorbarcode commented Sep 24, 2024

@heesung-sn

Don't we need to count this case as failure as well? To me if the topic load failed on the broker and client lookup failed, it should be counted as a failure.

It is useful to check if there is a bug in LoadBalancer, such as: if the variable keep increasing fastly for a long time, maybe the a broker cached a wrong bundle owner or DNS issue

Sorry, I misunderstood the comment. I reply again:

  • Increasespulsar_concurrency_load_topic_and_unload_bundle_count when a topic loaded failed due to the bundle unloading
  • Increasestopic_load_failed_total when an unexpected error occurs

If users assumed it should be combined into failed to load topic, they can get the counter this way pulsar_concurrency_load_topic_and_unload_bundle_count + topic_load_failed_total

@poorbarcode poorbarcode requested review from lhotari and heesung-sn and removed request for heesung-sn September 24, 2024 02:40
@heesung-sn
Copy link
Contributor

I still think this is a still failure case. Can we add a "reason" label instead of adding a new metric?

@poorbarcode
Copy link
Contributor Author

poorbarcode commented Sep 25, 2024

@heesung-sn

I still think this is a still failure case. Can we add a "reason" label instead of adding a new metric?

I will start a PIP for this metric, it is useful to detect the following issue:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs release/3.0.7 release/3.3.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants