Experiment with metrics and prometheus exporter with otel4s #353

lenguyenthanh · 2024-10-27T20:08:59Z

No description provided.

project/Dependencies.scala

iRevive · 2024-10-29T08:12:29Z

There is a new snapshot otel4s snapshot 0.11-8e1f500-SNAPSHOT with your fixes.

lenguyenthanh · 2024-10-29T08:17:24Z

There is a new snapshot otel4s snapshot 0.11-8e1f500-SNAPSHOT with your fixes.

thanks a lot @iRevive, I'll bump it later.

btw there is an issue with prometheus exporter built-in server when I deploying using systemd:

Oct 28 16:54:37 sirch systemd[3137355]: lila-search-ingestor.service: Failed to execute command: Permission denied
Oct 28 16:54:37 sirch systemd[3137355]: lila-search-ingestor.service: Failed at step EXEC spawning /home/lila-search-ingestor/bin/app: Permission denied
Oct 28 16:54:37 sirch systemd[1]: lila-search-ingestor.service: Main process exited, code=exited, status=203/EXEC
Oct 28 16:54:37 sirch systemd[1]: lila-search-ingestor.service: Failed with result 'exit-code'.

I think prometheus exporter try to to spawn a new process which is not allowed here. Do you have any idea how to by pass this?

iRevive · 2024-10-29T08:23:37Z

Hm, by default, the Prometheus server is created as:

val routes = PrometheusHttpRoutes.routes[F](exporter, writerConfig)

EmberServerBuilder
  .default[F]
  .withHost(host)
  .withPort(port)
  .withHttpApp(Router("metrics" -> routes).orNotFound)
  .build

Where the default host is host"localhost" and port - port"9464". Perhaps your system doesn't allow the process to launch an HTTP server on the localhost.

You can change the default host/port via environment variables or system properties. Here is the complete set of available Prometheus exporter settings.

export OTEL_EXPORTER_PROMETHEUS_HOST=0.0.0.0 should do the trick.

lenguyenthanh · 2024-10-29T08:53:04Z

thanks, I'll try it later today.

iRevive · 2024-10-31T08:15:39Z

modules/elastic/src/main/scala/ESClient.scala

+      countDuration
+        .recordDuration(TimeUnit.MILLISECONDS, Attribute(indexAttributeKey, q.index(query).value))
+        .surround:
+          client
+            .execute(q.countDef(query))
+            .flatMap(toResult)
+            .map(_.count)
+            .onError(_ => countErrorCounter.inc(Attribute(indexAttributeKey, q.index(query).value)))


Everything below doesn't necessarily apply to your service and infrastructure. But it might be useful in the future.

You can use attributes to distinguish errored/succeeded actions. The OpenTelemetry Semantic Conventions encourages this approach.

For example, count.duration indicates how long it takes to execute the count query. If you add error.type attribute, you can track successful and unsuccessful queries within the same metric.

In Grafana, you can query data as:

count_duration_count{"error_type" != ""} # shows the number of failed queries count_duration_count{"error_type" = ""} # shows the number of succeeded queries

Code:

def withErrorType(static: Attributes)(ec: Resource.ExitCase) = ec match case Resource.ExitCase.Succeeded => static case Resource.ExitCase.Errored(e) => static.added(Attribute("error.type", e.getClass.getName)) case Resource.ExitCase.Canceled => static.added(Attribute("error.type", "canceled")) countDuration .recordDuration( TimeUnit.MILLISECONDS, withErrorType(Attributes(Attribute(indexAttributeKey, q.index(query).value))) ) .surround: client .execute(q.countDef(query)) .flatMap(toResult) .map(_.count)

If we take it one step further, we can follow the OTel specification:

opDuration <- meter.histogram[Double]("db.client.operation.duration").withUnit("s").create def search[A](query: A, from: From, size: Size)(using q: Queryable[A]): F[List[Id]] = opDuration .recordDuration( TimeUnit.SECONDS, withErrorType(Attributes(Attribute("db.operation.name", "search"), Attribute(indexAttributeKey, q.index(query).value))) ) .surround: ... def count[A](query: A)(using q: Queryable[A]): F[Long] = opDuration .recordDuration( TimeUnit.SECONDS, withErrorType(Attributes(Attribute("db.operation.name", "count"), Attribute(indexAttributeKey, q.index(query).value))) ) .surround: ...

And the queries:

db_client_operation_duration_count{"db_operation_name" = "search", "error_type" == ""} db_client_operation_duration_count{"db_operation_name" = "search", "error_type" != ""} db_client_operation_duration_count{"db_operation_name" = "count", "error_type" == ""} db_client_operation_duration_count{"db_operation_name" = "count", "error_type" != ""}

oh this is extremely useful, I think I could apply it right now. Thanks a lot @iRevive.

Also do you mind bump otel4s-experimental-metric version to match otel4s 0.11-8e1f500-SNAPSHOT? This is binary incompatible with the previous one somehow.

Here is a new build: 0.4.0-6-8c1230f-SNAPSHOT.

@iRevive

Thanks @iRevive: #353 (comment)

Auto config server doesn't work because: Failed at step EXEC spawning lila-search-ingestor/bin/app: Permission denied

@iRevive

Thanks @iRevive: #353 (comment)

https://opentelemetry.io/docs/specs/semconv/database/database-metrics

iRevive · 2024-11-02T07:40:00Z

Does everything work? I can release a stable version of otel4s then.

lenguyenthanh · 2024-11-02T08:03:23Z

Does everything work? I can release a stable version of otel4s then.

yes, it just works! I run them in prod for few days, use build in server for ingestor tool and manual hook prometheus exporter in the server. Both works, here are a screenshot of our grafana.

The only problem is I don't know much about metric/grafana and how to create meaningfull dashboad 😂 .

Thanks again @iRevive for your help. Pls take a look and add comments if you have time, much appreciated!

iRevive · 2024-11-02T09:28:36Z

A few ideas:

Average time per ES operation

sum by (db_collection_name, db_operation_name) (irate(db_client_operation_duration_milliseconds_sum[$__rate_interval])) / 
sum by (db_collection_name, db_operation_name) (irate(db_client_operation_duration_milliseconds_count[$__rate_interval]))

Legend: {{db_collection_name}} {{db_operation_name}}

Heap memory usage

Query A:

sum(jvm_memory_limit_bytes{jvm_memory_type='heap'})

Legend: Max

Query B:

sum(jvm_memory_used_bytes{jvm_memory_type='heap'})

Legend: Current

lenguyenthanh · 2024-11-09T14:01:22Z

thanks @iRevive !

lenguyenthanh force-pushed the otel branch from b579695 to f190a55 Compare October 27, 2024 20:12

lenguyenthanh marked this pull request as draft October 27, 2024 20:31

lenguyenthanh force-pushed the otel branch 3 times, most recently from dd9bba4 to c464484 Compare October 27, 2024 21:03

iRevive reviewed Oct 28, 2024

View reviewed changes

project/Dependencies.scala Outdated Show resolved Hide resolved

lenguyenthanh force-pushed the otel branch from c464484 to ccc4c54 Compare October 28, 2024 08:08

lenguyenthanh changed the title ~~Experiment with metrics and promethues export with otel4s~~ Experiment with metrics and prometheus exporter with otel4s Oct 29, 2024

iRevive reviewed Oct 31, 2024

View reviewed changes

lenguyenthanh added a commit that referenced this pull request Oct 31, 2024

Smarter metric

9cf010f

Thanks @iRevive: #353 (comment)

lenguyenthanh force-pushed the otel branch 2 times, most recently from 229bf3f to 68881fe Compare November 1, 2024 18:09

lenguyenthanh added 13 commits November 1, 2024 20:36

Implement metrics with otel4s

0ab9f11

Add jvm metric

4cf3cd5

Implement metrics for ingestor

8672c8c

Add metric to ESClient

905d244

Move to otel4s sdk and use prometheus exporter

863d32f

Use otel4s sdk and promethues for ingestor

c90f4ee

Clean up dependencies

d4ad2a5

Manual config prometheus export server

b0ec530

Auto config server doesn't work because: Failed at step EXEC spawning lila-search-ingestor/bin/app: Permission denied

Hook MetricExporter reader with Otel Meter

1a0ac50

Smarter metric

539161d

Thanks @iRevive: #353 (comment)

Shorter metric names

a18cff5

Bump otel4s dependencies

1c1aa00

Follow Open Telemetry Semantic Conventions

b31e25f

https://opentelemetry.io/docs/specs/semconv/database/database-metrics

lenguyenthanh force-pushed the otel branch from 68881fe to b31e25f Compare November 1, 2024 19:40

lenguyenthanh added 2 commits November 1, 2024 21:27

Implement prometheus exporter routes for app

0c813ff

Use prometheus build in server for ingestor

54a7bc9

lenguyenthanh marked this pull request as ready for review November 1, 2024 21:23

lenguyenthanh merged commit 87d682a into master Nov 2, 2024
3 checks passed

lenguyenthanh deleted the otel branch November 2, 2024 07:38

lenguyenthanh mentioned this pull request Dec 1, 2024

Use otel4s for metrics instead of kamon lichess-org/lila-fishnet#371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with metrics and prometheus exporter with otel4s #353

Experiment with metrics and prometheus exporter with otel4s #353

lenguyenthanh commented Oct 27, 2024

iRevive commented Oct 29, 2024

lenguyenthanh commented Oct 29, 2024

iRevive commented Oct 29, 2024 •

edited

Loading

lenguyenthanh commented Oct 29, 2024

iRevive Oct 31, 2024

lenguyenthanh Oct 31, 2024

iRevive Oct 31, 2024

iRevive commented Nov 2, 2024

lenguyenthanh commented Nov 2, 2024

iRevive commented Nov 2, 2024

lenguyenthanh commented Nov 9, 2024

Experiment with metrics and prometheus exporter with otel4s #353

Experiment with metrics and prometheus exporter with otel4s #353

Conversation

lenguyenthanh commented Oct 27, 2024

iRevive commented Oct 29, 2024

lenguyenthanh commented Oct 29, 2024

iRevive commented Oct 29, 2024 • edited Loading

lenguyenthanh commented Oct 29, 2024

iRevive Oct 31, 2024

Choose a reason for hiding this comment

lenguyenthanh Oct 31, 2024

Choose a reason for hiding this comment

iRevive Oct 31, 2024

Choose a reason for hiding this comment

iRevive commented Nov 2, 2024

lenguyenthanh commented Nov 2, 2024

iRevive commented Nov 2, 2024

Average time per ES operation

Heap memory usage

lenguyenthanh commented Nov 9, 2024

iRevive commented Oct 29, 2024 •

edited

Loading