archival: consistent log size probes across replicas #24257

nvartolomei · 2024-11-22T14:17:30Z

We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics.

Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas.

Another option is to report metrics only from the leader :think:

Backports Required

Release Notes

Bug Fixes

Make redpanda_cloud_storage_cloud_log_size metric consistent across all replicas. We used to update it seldomly from the leader replica only which lead to inconsistent/stale values.

src/v/cluster/archival/ntp_archiver_service.cc

src/v/cluster/archival/stm_subscriptions.h

We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics. Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas. Another option is to report metrics only from the leader :think:

vbotbuildovich · 2024-11-22T18:27:56Z

the below tests from https://buildkite.com/redpanda/redpanda/builds/58575#0193549f-2f41-4bc4-af2a-f496586d336e have failed and will be retried

cloud_storage_rpfixture

vbotbuildovich · 2024-11-22T19:03:41Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58575#019354e4-a507-4c8c-b133-8647ca1bde2a

dotnwat

update_probe doesn't block, so i'm wondering why the metrics aren't computed on demand when requested. is it very expensive? other than that i wonder if the subscription is overkill compared to a few more invocations of update_probe? i couldn't follow the code flow exactly, so maybe that wouldn't make sense...

dotnwat · 2024-11-26T04:38:07Z

src/v/cluster/archival/ntp_archiver_service.cc

+        // Ensure we are exception safe and won't leave the probe without
+        // a watcher in case of exceptions. Also ensures we won't crash calling
+        // the callback.
+        static_assert(noexcept(update_probe()));


nit: you could add this requirement to parameter to subscribe_to_state_change ?

nvartolomei · 2024-11-26T09:17:29Z

@dotnwat that's a great idea. I'm in favor of computing on demand. Metric scraping should be less frequent than applying stm commands too so even better.

Lazin · 2024-11-26T13:46:29Z

@nvartolomei in order to do this you need to add manifest to the probe as a dependency. I'm not against it but object lifetimes might become tricky.

github-actions bot added area/build area/redpanda labels Nov 22, 2024

nvartolomei requested review from Lazin, andrwng, dotnwat and WillemKauf November 22, 2024 14:24

WillemKauf reviewed Nov 22, 2024

View reviewed changes

src/v/cluster/archival/ntp_archiver_service.cc Outdated Show resolved Hide resolved

WillemKauf reviewed Nov 22, 2024

View reviewed changes

src/v/cluster/archival/ntp_archiver_service.cc Show resolved Hide resolved

WillemKauf reviewed Nov 22, 2024

View reviewed changes

src/v/cluster/archival/stm_subscriptions.h Outdated Show resolved Hide resolved

nvartolomei force-pushed the nv/CORE-8326-cloud-log-size-drift branch from 734375b to 6e467d4 Compare November 22, 2024 15:39

nvartolomei force-pushed the nv/CORE-8326-cloud-log-size-drift branch from 6e467d4 to 74eb0dd Compare November 22, 2024 16:04

Lazin approved these changes Nov 25, 2024

View reviewed changes

dotnwat reviewed Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archival: consistent log size probes across replicas #24257

archival: consistent log size probes across replicas #24257

nvartolomei commented Nov 22, 2024 •

edited

Loading

vbotbuildovich commented Nov 22, 2024

vbotbuildovich commented Nov 22, 2024

dotnwat left a comment

dotnwat Nov 26, 2024

nvartolomei commented Nov 26, 2024

Lazin commented Nov 26, 2024

archival: consistent log size probes across replicas #24257

Are you sure you want to change the base?

archival: consistent log size probes across replicas #24257

Conversation

nvartolomei commented Nov 22, 2024 • edited Loading

Backports Required

Release Notes

Bug Fixes

vbotbuildovich commented Nov 22, 2024

vbotbuildovich commented Nov 22, 2024

dotnwat left a comment

Choose a reason for hiding this comment

dotnwat Nov 26, 2024

Choose a reason for hiding this comment

nvartolomei commented Nov 26, 2024

Lazin commented Nov 26, 2024

nvartolomei commented Nov 22, 2024 •

edited

Loading