Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with metrics and prometheus exporter with otel4s #353

Merged
merged 15 commits into from
Nov 2, 2024

Conversation

lenguyenthanh
Copy link
Member

No description provided.

project/Dependencies.scala Outdated Show resolved Hide resolved
@iRevive
Copy link
Contributor

iRevive commented Oct 29, 2024

There is a new snapshot otel4s snapshot 0.11-8e1f500-SNAPSHOT with your fixes.

@lenguyenthanh
Copy link
Member Author

There is a new snapshot otel4s snapshot 0.11-8e1f500-SNAPSHOT with your fixes.

thanks a lot @iRevive, I'll bump it later.

btw there is an issue with prometheus exporter built-in server when I deploying using systemd:

Oct 28 16:54:37 sirch systemd[3137355]: lila-search-ingestor.service: Failed to execute command: Permission denied
Oct 28 16:54:37 sirch systemd[3137355]: lila-search-ingestor.service: Failed at step EXEC spawning /home/lila-search-ingestor/bin/app: Permission denied
Oct 28 16:54:37 sirch systemd[1]: lila-search-ingestor.service: Main process exited, code=exited, status=203/EXEC
Oct 28 16:54:37 sirch systemd[1]: lila-search-ingestor.service: Failed with result 'exit-code'.

I think prometheus exporter try to to spawn a new process which is not allowed here. Do you have any idea how to by pass this?

@iRevive
Copy link
Contributor

iRevive commented Oct 29, 2024

Hm, by default, the Prometheus server is created as:

val routes = PrometheusHttpRoutes.routes[F](exporter, writerConfig)

EmberServerBuilder
  .default[F]
  .withHost(host)
  .withPort(port)
  .withHttpApp(Router("metrics" -> routes).orNotFound)
  .build

Where the default host is host"localhost" and port - port"9464". Perhaps your system doesn't allow the process to launch an HTTP server on the localhost.

You can change the default host/port via environment variables or system properties. Here is the complete set of available Prometheus exporter settings.

export OTEL_EXPORTER_PROMETHEUS_HOST=0.0.0.0 should do the trick.

@lenguyenthanh
Copy link
Member Author

thanks, I'll try it later today.

@lenguyenthanh lenguyenthanh changed the title Experiment with metrics and promethues export with otel4s Experiment with metrics and prometheus exporter with otel4s Oct 29, 2024
Comment on lines 98 to 105
countDuration
.recordDuration(TimeUnit.MILLISECONDS, Attribute(indexAttributeKey, q.index(query).value))
.surround:
client
.execute(q.countDef(query))
.flatMap(toResult)
.map(_.count)
.onError(_ => countErrorCounter.inc(Attribute(indexAttributeKey, q.index(query).value)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything below doesn't necessarily apply to your service and infrastructure. But it might be useful in the future.


You can use attributes to distinguish errored/succeeded actions. The OpenTelemetry Semantic Conventions encourages this approach.

For example, count.duration indicates how long it takes to execute the count query. If you add error.type attribute, you can track successful and unsuccessful queries within the same metric.

In Grafana, you can query data as:

count_duration_count{"error_type" != ""} # shows the number of failed queries
count_duration_count{"error_type" = ""} # shows the number of succeeded queries

Code:

def withErrorType(static: Attributes)(ec: Resource.ExitCase) = ec match
  case Resource.ExitCase.Succeeded =>
    static
  case Resource.ExitCase.Errored(e) =>
    static.added(Attribute("error.type", e.getClass.getName))
  case Resource.ExitCase.Canceled =>
    static.added(Attribute("error.type", "canceled"))

countDuration
  .recordDuration(
    TimeUnit.MILLISECONDS, 
    withErrorType(Attributes(Attribute(indexAttributeKey, q.index(query).value)))
  )
  .surround:
    client
      .execute(q.countDef(query))
      .flatMap(toResult)
      .map(_.count)

If we take it one step further, we can follow the OTel specification:

opDuration <- meter.histogram[Double]("db.client.operation.duration").withUnit("s").create

def search[A](query: A, from: From, size: Size)(using q: Queryable[A]): F[List[Id]] =
  opDuration
    .recordDuration(
      TimeUnit.SECONDS, 
      withErrorType(Attributes(Attribute("db.operation.name", "search"), Attribute(indexAttributeKey, q.index(query).value)))
    )
    .surround:
      ...

def count[A](query: A)(using q: Queryable[A]): F[Long] =
  opDuration
    .recordDuration(
      TimeUnit.SECONDS, 
      withErrorType(Attributes(Attribute("db.operation.name", "count"), Attribute(indexAttributeKey, q.index(query).value)))
    )
    .surround:
      ...

And the queries:

db_client_operation_duration_count{"db_operation_name" = "search",  "error_type" == ""}
db_client_operation_duration_count{"db_operation_name" = "search",  "error_type" != ""}
db_client_operation_duration_count{"db_operation_name" = "count",  "error_type" == ""}
db_client_operation_duration_count{"db_operation_name" = "count",  "error_type" != ""}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this is extremely useful, I think I could apply it right now. Thanks a lot @iRevive.

Also do you mind bump otel4s-experimental-metric version to match otel4s 0.11-8e1f500-SNAPSHOT? This is binary incompatible with the previous one somehow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a new build: 0.4.0-6-8c1230f-SNAPSHOT.

lenguyenthanh added a commit that referenced this pull request Oct 31, 2024
@lenguyenthanh lenguyenthanh force-pushed the otel branch 2 times, most recently from 229bf3f to 68881fe Compare November 1, 2024 18:09
@lenguyenthanh lenguyenthanh marked this pull request as ready for review November 1, 2024 21:23
@lenguyenthanh lenguyenthanh merged commit 87d682a into master Nov 2, 2024
3 checks passed
@lenguyenthanh lenguyenthanh deleted the otel branch November 2, 2024 07:38
@iRevive
Copy link
Contributor

iRevive commented Nov 2, 2024

Does everything work? I can release a stable version of otel4s then.

@lenguyenthanh
Copy link
Member Author

Does everything work? I can release a stable version of otel4s then.

yes, it just works! I run them in prod for few days, use build in server for ingestor tool and manual hook prometheus exporter in the server. Both works, here are a screenshot of our grafana.

Screenshot 2024-11-02 at 09 00 42

The only problem is I don't know much about metric/grafana and how to create meaningfull dashboad 😂 .

Thanks again @iRevive for your help. Pls take a look and add comments if you have time, much appreciated!

@iRevive
Copy link
Contributor

iRevive commented Nov 2, 2024

A few ideas:

Average time per ES operation

sum by (db_collection_name, db_operation_name) (irate(db_client_operation_duration_milliseconds_sum[$__rate_interval])) / 
sum by (db_collection_name, db_operation_name) (irate(db_client_operation_duration_milliseconds_count[$__rate_interval]))

Legend: {{db_collection_name}} {{db_operation_name}}

Heap memory usage

Query A:

sum(jvm_memory_limit_bytes{jvm_memory_type='heap'})

Legend: Max

Query B:

sum(jvm_memory_used_bytes{jvm_memory_type='heap'})

Legend: Current

@lenguyenthanh
Copy link
Member Author

thanks @iRevive !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants