Introduce Data Traces into WM stack #12170

vkuznet · 2024-11-18T19:11:35Z

Impact of the new feature
Data Traces and OpenTelemetry can significantly improve monitoring of individual services as well as its data flow.

Is your feature request related to a problem? Please describe.
Currently debugging individual workflow, tracing and understand errors is very cumbersome within WM eco-system. This issue can improve this by introduce data traces into all WM services and provide better understanding of running workflows.

Describe the solution you'd like
Use OpenTelemetry framework to add Data Traces to all WM services.

Describe alternatives you've considered
None

Additional context
This can be done as simple as following (below is an example of a data services with data traces):

# required dependencies
pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-flask opentelemetry-exporter-otlp

and, then instrument Data Traces into your python application as following:

# Basic Example in Python with OpenTelemetry
from opentelemetry import trace
from opentelemetry.propagate import inject
from opentelemetry.trace import SpanKind
import requests

# Initialize tracer
tracer = trace.get_tracer(__name__)

def send_request_to_service_b():
    with tracer.start_as_current_span("service_a_to_b", kind=SpanKind.CLIENT) as span:
        # Inject tracing context into HTTP headers
        headers = {}
        inject(headers)
        
        response = requests.get("http://service-b.local/process", headers=headers)
        return response

After traces are instrumented they can be visualized within the following applications:

OpenTelemetry Collector: You can use an OpenTelemetry Collector to gather tracing data and then convert parts of it into metrics, such as latency, request counts, and error rates. The Collector can send these metrics to Prometheus, where they can be scraped and visualized.
Jaeger with Prometheus Metrics: Jaeger is a popular tracing tool that integrates with Prometheus. Jaeger can collect tracing data across services and expose certain trace metrics (like service call counts, latency, and errors) to Prometheus, which can then be scraped. You can configure Jaeger to expose metrics in a format Prometheus understands by using the prometheus-metrics endpoint.
Custom Trace-to-Metrics Exporter: In some cases, you might write a custom exporter that takes key trace information, such as request duration or error counts, and exports it as metrics that Prometheus can consume.
Grafana Tempo Integration: If you’re using Grafana, Tempo can store trace data, and you can integrate it with Prometheus to query metrics and view traces side by side in Grafana dashboards.

Example Setup for OpenTelemetry + Prometheus

Here’s a basic setup that uses OpenTelemetry to generate metrics based on traces and then scrape these with Prometheus.

Install OpenTelemetry Instrumentation in each service to generate trace data.
Set up the OpenTelemetry Collector to receive trace data, transform it into metrics (such as request latency), and export these metrics to Prometheus.
Configure Prometheus to scrape the OpenTelemetry Collector's metrics endpoint.

By converting key trace events into metrics, Prometheus can gain insights into trace-related metrics without directly handling trace data, and tools like Grafana can correlate the metrics with trace spans if both Prometheus and a trace backend (such as Tempo or Jaeger) are set up.

Please also see USCMS WM Monitoring and Alerts talk for more details.

The text was updated successfully, but these errors were encountered:

vkuznet added New Feature Monitoring Debugging Observability R&D labels Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Data Traces into WM stack #12170

Introduce Data Traces into WM stack #12170

vkuznet commented Nov 18, 2024 •

edited

Loading

Introduce Data Traces into WM stack #12170

Introduce Data Traces into WM stack #12170

Comments

vkuznet commented Nov 18, 2024 • edited Loading

Example Setup for OpenTelemetry + Prometheus

vkuznet commented Nov 18, 2024 •

edited

Loading