Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archival: consistent log size probes across replicas #24257

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

nvartolomei
Copy link
Contributor

@nvartolomei nvartolomei commented Nov 22, 2024

We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics.

Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas.

Another option is to report metrics only from the leader :think:

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Bug Fixes

  • Make redpanda_cloud_storage_cloud_log_size metric consistent across all replicas. We used to update it seldomly from the leader replica only which lead to inconsistent/stale values.

@nvartolomei nvartolomei force-pushed the nv/CORE-8326-cloud-log-size-drift branch from 734375b to 6e467d4 Compare November 22, 2024 15:39
We called update probe only from leaders and after exiting the upload
loop which led to inconsistent and stale metrics.

Fix this by introducing a subscription mechanism to the STM which
is the source of truth for the manifest state and must be consistent
across all replicas.

Another option is to report metrics only from the leader :think:
@nvartolomei nvartolomei force-pushed the nv/CORE-8326-cloud-log-size-drift branch from 6e467d4 to 74eb0dd Compare November 22, 2024 16:04
@vbotbuildovich
Copy link
Collaborator

the below tests from https://buildkite.com/redpanda/redpanda/builds/58575#0193549f-2f41-4bc4-af2a-f496586d336e have failed and will be retried

cloud_storage_rpfixture

@vbotbuildovich
Copy link
Collaborator

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_probe doesn't block, so i'm wondering why the metrics aren't computed on demand when requested. is it very expensive? other than that i wonder if the subscription is overkill compared to a few more invocations of update_probe? i couldn't follow the code flow exactly, so maybe that wouldn't make sense...

// Ensure we are exception safe and won't leave the probe without
// a watcher in case of exceptions. Also ensures we won't crash calling
// the callback.
static_assert(noexcept(update_probe()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could add this requirement to parameter to subscribe_to_state_change ?

@nvartolomei
Copy link
Contributor Author

@dotnwat that's a great idea. I'm in favor of computing on demand. Metric scraping should be less frequent than applying stm commands too so even better.

@Lazin
Copy link
Contributor

Lazin commented Nov 26, 2024

@nvartolomei in order to do this you need to add manifest to the probe as a dependency. I'm not against it but object lifetimes might become tricky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants