Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced Prometheur metrics' pusher #7346

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft

Enhanced Prometheur metrics' pusher #7346

wants to merge 10 commits into from

Conversation

Scooletz
Copy link
Contributor

@Scooletz Scooletz commented Aug 20, 2024

This PR enhances monitoring capabilities by adding an option to force push of metrics right after a specific event. Next, it uses the option to force push after a block processing is done. This should allow to measure multiple nodes with better comparability, by making snapshots of metrics at the same point of execution. What is expected is that once metrics are published at the end of each block processed, they should be aligned and allow better comparison between instances.

image

Design

With introduction of another signal to publish metrics that is not time-based, there should be a limit of how frequent the other signal can be called. If blocks were processed 10 times per second, clearly we don't want to flush metrics that often. The signal should be throttled in some way. The following design it proposed then:

  1. instead of IMetricsConfig.PushAfterBlock that were added already, introduce IMetricsConfig.MinimalIntervalSeconds that would default to 2
  2. leave MetricPusher triggerable as is, meaning that if a MetricsController updates metrics, it can immediately releate pusher to push
  3. MetricsController to collect metrics and trigger MetricPusher
    1. after update and triggering MetricPusher waits MinimalIntervalSeconds before proceeding
    2. awaits one of the events:
      1. an external signal (block)
      2. passing IntervalSeconds - MinimalIntervalSeconds

With this approach we:

  1. limit the number of updates by minimal wait time MinimalIntervalSeconds so that Graphana is not flooded
  2. limit the time to publish metrics by sticking to IntervalSeconds in the worst case
  3. allow signaling but not oversignaling

Changes

  • MonitoringService with a new MetricPusher that allows to force the push
  • UpdateMetrics can be triggered via timer but also via force push
  • iteration over dictionaries simplified so that no retrieval of the value is needed (using enumerator instead of the indexer to get the value)

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)
  • New feature (a non-breaking change that adds functionality)
  • Breaking change (a change that causes existing functionality not to work as expected)
  • Optimization
  • Refactoring
  • Documentation update
  • Build-related changes
  • Other: Description

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

Notes on testing

Optional. Remove if not applicable.

Documentation

Requires documentation update

  • Yes
  • No

If yes, link the PR to the docs update or the issue with the details labeled docs. Remove if not applicable.

Requires explanation in Release Notes

  • Yes
  • No

If yes, fill in the details here. Remove if not applicable.

Remarks

Optional. Remove if not applicable.

@@ -8,6 +8,8 @@ public class MetricsConfig : IMetricsConfig
public string ExposeHost { get; set; } = "+";
public int? ExposePort { get; set; } = null;
public bool Enabled { get; set; } = false;

public bool PushAfterBlock { get; set; } = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If idea here would work I'd probably prefer it to be enabled by default - this way we have much better view on metrics than now.

@Scooletz
Copy link
Contributor Author

Scooletz commented Sep 3, 2024

Two separate nodes run to compare metrics that were incoherent in the past, like gas:

  1. https://github.com/NethermindEth/nethermind/actions/runs/10682675711
  2. https://github.com/NethermindEth/nethermind/actions/runs/10684064868

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants