Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry support #169

Merged
merged 7 commits into from
Sep 30, 2024
Merged

Telemetry support #169

merged 7 commits into from
Sep 30, 2024

Conversation

Maelkum
Copy link
Contributor

@Maelkum Maelkum commented Sep 25, 2024

This PR introduces support for telemetry for b7s nodes.

We have support for two main mechanisms - tracing and metrics.

Tracing

Tracing can be enabled with the CLI flags --enable-tracing and either --tracing-grpc-endpoint or --tracing-http-endpoint flags, depending on which protocol should be used to transfer tracing data.

Below is an example configuration for a config file:

telemetry:
  tracing:
    enable: true
    exporter-batch-timeout: 5s
    http:
      endpoint: localhost:4318
    grpc:
      endpoint: localhost:4317

Some more background about the implementation can be found here - #158 .

Metrics

Metrics can be enabled with the CLI flags --enable-metrics and --prometheus-address. The latter CLI flag controls on which address the worker node will serve the metrics. In the case of the head node, the metrics will be served on the REST API address.

Metrics can be found on the /metrics endpoint.

Support for push metrics was dropped because it seems against the Prometheus guidelines, and pull metrics can be enabled by using Prometheus PushProx.

For the config file, the metrics configuration might be:

telemetry:
  metrics:
    enable: true
    prometheus-address: localhost:8080

More information about the implementation can be found here - #164 .

Note that these two PRs are not the final version, and some changes were made afterwards, here - #167 ,
and in minor scale in this PR, after rebasing the feature branch.

Subsequent changes mainly deal with splitting tracing and metrics and supporting enabling one or the other, instead of forcing telemetry as a single system.

* WIP: Starting bootstrapping prometheus

* Minor refactor

* Fix init condition

* Add libp2p metrics

* Add echo metrics

* WIP: metrics rename

* Fix closing of libp2p host

* Use milliseconds for time resolution + use prefixes for different packages

* Use variables for metric names for node

* Fstore and host also use vars for metrics

* Add descriptions to defined metrics

* Add more metrics - consensus time and messages sent/published

* Add node info to metrics

* Obsolete comment

* Minor tweaks

* (re)Add node ID to the node_info metric
* Update/fix tests

* Refactor telemetry init functions

* Tests for resource and simple tracing (in mem exporter)

* Add test for tracing wrapped functions

* Add tests for trace info

* Decouple prometheus definitions bootstrapping by using config

* Remove printf

* Add test to trace health check and worker execution

* Go mod tidy

* Offload pipeline type to a separate package

* Telemetry tests are external

* Refactor metric initialization and start adding tests for metrics

* Add tests for metrics config

* Use metrics as a subcomponent instead of the global instance

* Rename receiver in fstore code

* Add test for processed messages

* Rename test file

* Add metrics verification to integration test

* Split tracing and metrics initialization

* Use new context for untraced test

* Remove addressed comments
@Maelkum Maelkum self-assigned this Sep 25, 2024
@Maelkum Maelkum requested a review from dmikey September 25, 2024 08:48
@dmikey dmikey merged commit 84eb6d3 into main Sep 30, 2024
5 checks passed
@dmikey dmikey deleted the open-telemetry branch September 30, 2024 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants