Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry integration #699

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

psardana
Copy link

@psardana psardana commented Nov 11, 2024

Fixes Issue

closes #701

Changes proposed

Check List (Check all the applicable boxes)

  • I sign off on contributing this submission to open-source
  • My code follows the code style of this project.
  • My change requires changes to the documentation.
  • I have updated the documentation accordingly.
  • All new and existing tests passed.
  • This PR does not contain plagiarized content.
  • The title of my pull request is a short description of the requested changes.

Screenshots

Note to reviewers

Copy link

netlify bot commented Nov 11, 2024

Deploy Preview for opal-docs canceled.

Name Link
🔨 Latest commit 837780e
🔍 Latest deploy log https://app.netlify.com/sites/opal-docs/deploys/674017ce62978e00083c5718

psardana and others added 3 commits November 11, 2024 16:11
…ck latency of data update events

feat(prometheus_metrics.py): create data_update_latency histogram to monitor latency of data update events
@psardana psardana marked this pull request as draft November 11, 2024 15:30
…etrics to use opal_server.metrics.prometheus_metrics for better organization

chore(requirements.txt): add prometheus_client to dependencies for metrics tracking functionality
…c to track updates per topic

feat(prometheus_metrics.py): introduce data_update_count_per_topic counter for monitoring data updates by topic
… to enhance observability

fix(api.py): increment policy bundle request count and measure latency for bundle generation
fix(callbacks.py): observe size of changed directories in policy update notifications
fix(task.py): track policy update count and latency when triggering policy watcher
@psardana psardana changed the title added initial config for prometheus integration in opal server prometheus integration Nov 15, 2024
@psardana psardana marked this pull request as ready for review November 15, 2024 07:29
@danyi1212 danyi1212 self-requested a review November 17, 2024 12:27
@danyi1212
Copy link
Collaborator

danyi1212 commented Nov 18, 2024

Hey @psardana, thank you for this contribution! 💎

Can you please add documentation about the metrics and explain how to set it up?
Additionally, please add a docker-compose example with a Prometheus container collecting the metrics from OPAL, which will be good for showcasing it in docs 🙂

Notice that there are conflicts against the main branch, please make sure to rebase from master.
Also note that the tests are failing, I think due to invalid label names in the metrics. Please take a look at it 😁

Looking forward for this! 🙏

@psardana
Copy link
Author

Hey @psardana, thank you for this contribution! 💎

Can you please add documentation about the metrics and explain how to set it up? Additionally, please add a docker-compose example with a Prometheus container collecting the metrics from OPAL, which will be good for showcasing it in docs 🙂

Notice that there are conflicts against the main branch, please make sure to rebase from master. Also note that the tests are failing, I think due to invalid label names in the metrics. Please take a look at it 😁

Looking forward for this! 🙏

Thank you for the review! I have added commits for documentation, docker compose and fixed the label names.

Copy link
Collaborator

@danyi1212 danyi1212 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good! 🌟
I've left some comments about specific areas and improvements.

Upon review, some instrumented parts appear to align more with tracing rather than pure metrics. This led to some unnatural metrics and duplications, having separate latency - count and error metrics.

To address this, we suggest exploring OpenTelemetry, which offers native Prometheus integration alongside robust tracing capabilities.

Here's a proposed mapping of the current metrics to OpenTelemetry:

  • opal_server_data_update -> Trace
  • opal_server_policy_update -> Trace
  • opal_server_policy_bundle_request -> Trace
  • opal_server_policy_bundle_size -> Metric
  • opal_server_active_clients -> Metric
  • opal_client_data_subscriptions -> Metric
  • opal_client_data_update_trigger -> Trace
  • opal_client_data_update_apply -> Trace (new)
  • opal_client_policy_update_apply -> Trace (new)
  • opal_client_policy_store_status -> Metric

We believe this approach will provide a more comprehensive observability solution.

Please let us know your thoughts, and let's work together to enhance OPAL's observability 💎

documentation/docs/tutorials/monitoring_opal.mdx Outdated Show resolved Hide resolved

@app.get("/metrics", include_in_schema=False)
async def metrics():
"""Endpoint to expose Prometheus metrics."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add authentication for this endpoint and add an ability to disable it completely. Making it an opt-in feature would be best.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where I disagree, metrics endpoints usually are not protected. Prometheus needs direct and unauthenticated access to scrape metrics. One example would keycloak, metrics are also exposed directly/unauthenticated


@app.get("/metrics", include_in_schema=False)
def metrics():
"""Endpoint to expose Prometheus metrics."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we need to add authentication for this endpoint and add an ability to disable it completely.

packages/opal-server/opal_server/policy/bundles/api.py Outdated Show resolved Hide resolved
packages/opal-server/opal_server/policy/bundles/api.py Outdated Show resolved Hide resolved
@psardana psardana changed the title prometheus integration OpenTelemetry integration Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Prometheus Metrics for OPAL Server and Client
2 participants