Skip to content

feat: poc implementation of the OTel Metrics API #13780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

zacharycmontoya
Copy link
Contributor

@zacharycmontoya zacharycmontoya commented Jun 26, 2025

Motivation

We want to implement support for the OTel Metrics API across our libraries. There is an RFC in development as well as system-tests to define that behavior, and this POC aims to deliver the necessary code to implement that.

For API usage, see the utils/build/docker/python/flask/app.py file in the referenced system-tests PR.

Summary

This PR is a minimum implementation of the OTel Metrics API. This API covers the minimum code that application / library developers need to:

  • Get/Set a global MeterProvider instance (which provides access to Meters)
  • Get a Meter with a given name+version
  • Create synchronous/asynchronous instruments with a given name, kind, unit, and description. This also allows users to directly record measurements on the synchronous instruments.

All of this functionality is controlled by the environment variable DD_TRACE_OTEL_METRICS_ENABLED which is disabled by default.

At a high-level, the OTel interfaces are implemented by new implementation types. To be able to emit these data types into OTLP as a quick POC, the following approaches were taken:

  • Core OTel types and internal metrics classes needed to represent the fields of the OTLP metrics protobuf were vendored from the open-telemetry/opentelemetry-python repository. Also vendored was a partial implementation of a MetricReader to collect measurements from the asynchronous instruments
  • The OTLP exporter was directly referenced and imported by the code. This may or may not be the best long-term solution, but it expedited the proof-of-concept.

Vendored Files

Many files are directly vendored from the open-telemetry/opentelemetry-python repository, including:

  • ddtrace/internal/opentelemetry/types.py - Defines key values like Attributes, AnyValue
  • ddtrace/internal/opentelemetry/resources.py - Defines types for the Resource concept in OpenTelemetry, which is included in the OTLP metrics payload. A Resource defines the entity producing data, like a service name, env, version.
  • ddtrace/internal/opentelemetry/metrics.py - The part of this file that is vendored is a minimal implementation of the PeriodicExportingMetricReader from the SDK, whose purpose is to collect metrics before exporting them..
  • ddtrace/internal/opentelemetry/metric_points.py - This class defines the dataclasses that map 1:1 with the protobuf format of OTLP metrics
  • ddtrace/internal/opentelemetry/instrumentation.py - Contains the definition of the InstrumentationScope type, which is included in the OTLP metrics payload. This identifies the smallest identifiable scope of instrumentation, like an instrumentation library name and version.

Remaining Files

The rest of the net-new code is:

  • ddtrace/internal/opentelemetry/metrics.py - The main entrypoints of MeterProvider and Meter are defined here.
  • ddtrace/internal/opentelemetry/instrument.py - The classes that implement the various instrument interfaces are defined here. All of the instrument types are as follows:
    • Counter
    • UpDownCounter
    • Gauge
    • Histogram
    • ObservableCounter
    • ObservableUpDownCounter
    • ObservableGauge

Testing

This PR doesn't add any unit or integration tests to the repo. At this time, I'd prefer to defer that work and use the referenced system-tests as the testing strategy.

Improvements

As a POC, there are some implementation details that are not long-term ideal but I'd like to defer them to a follow-up PR. If you feel they must be included in this PR, I can go ahead and change them.This includes:

  • For synchronous instruments, recording an instrument (e.g. Counter.Add) blocks on the generation and HTTP submission of this one metric to the OTLP endpoint. This should be batched alongside the other measurements from the asynchronous instruments that are collected and submitted on a configured interval.
  • The resource attributes emitted in the OTLP Metrics are hard-coded. These should take the service name, env, and version values.
  • The scope name emitted in the OTLP Metrics are hard-coded. This should be inherited from the name of the Meter set in the API call.
  • Most instruments require a value to represent aggregation temporality (DELTA vs CUMULATIVE). This is currently hard-coded to the Datadog-preferred value, based on the instrument type.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Contributor

CODEOWNERS have been resolved as:

ddtrace/internal/opentelemetry/instrument.py                            @DataDog/apm-sdk-api-python
ddtrace/internal/opentelemetry/instrumentation.py                       @DataDog/apm-sdk-api-python
ddtrace/internal/opentelemetry/metric_points.py                         @DataDog/apm-sdk-api-python
ddtrace/internal/opentelemetry/metrics.py                               @DataDog/apm-sdk-api-python
ddtrace/internal/opentelemetry/resources.py                             @DataDog/apm-sdk-api-python
ddtrace/internal/opentelemetry/types.py                                 @DataDog/apm-sdk-api-python
ddtrace/bootstrap/preload.py                                            @DataDog/apm-core-python
ddtrace/opentelemetry/__init__.py                                       @DataDog/apm-sdk-api-python
ddtrace/settings/_config.py                                             @DataDog/apm-core-python
pyproject.toml                                                          @DataDog/python-guild
tests/opentelemetry/conftest.py                                         @DataDog/apm-sdk-api-python
tests/opentelemetry/flask_app.py                                        @DataDog/apm-sdk-api-python

elif other.schema_url == "":
schema_url = self.schema_url
elif self.schema_url == other.schema_url:
schema_url = other.schema_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Code Quality Violation

duplicate blocks between conditions (...read more)

Code in the branches of an if condition must be unique. If you have duplicated branches, merge the conditions.

View in Datadog  Leave us feedback  Documentation

@zacharycmontoya zacharycmontoya changed the title feat: POC implementation of the OTel Metrics API feat: poc implementation of the OTel Metrics API Jun 26, 2025
Copy link
Contributor

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 276 ± 2 ms.

The average import time from base is: 277 ± 2 ms.

The import time difference between this PR and base is: -0.36 ± 0.1 ms.

Import time breakdown

The following import paths have grown:

ddtrace.auto 2.233 ms (0.81%)
ddtrace.bootstrap.sitecustomize 2.233 ms (0.81%)
ddtrace.bootstrap.preload 2.089 ms (0.76%)
multiprocessing 0.748 ms (0.27%)
multiprocessing.context 0.748 ms (0.27%)
multiprocessing.reduction 0.748 ms (0.27%)
pickle 0.659 ms (0.24%)
ddtrace.internal.remoteconfig._connectors 0.140 ms (0.05%)
ctypes 0.140 ms (0.05%)
ctypes._endian 0.066 ms (0.02%)
ddtrace.internal.flare.flare 0.110 ms (0.04%)
logging.handlers 0.110 ms (0.04%)
ddtrace.internal.symbol_db.remoteconfig 0.109 ms (0.04%)
ddtrace.internal.symbol_db.symbols 0.109 ms (0.04%)
ddtrace.settings.symbol_db 0.109 ms (0.04%)
ddtrace.internal.products 0.101 ms (0.04%)
importlib.metadata 0.101 ms (0.04%)
csv 0.101 ms (0.04%)
multiprocessing.sharedctypes 0.092 ms (0.03%)
ddtrace.settings.errortracking 0.086 ms (0.03%)
ddtrace.appsec._remoteconfiguration 0.072 ms (0.03%)
ddtrace._trace.trace_handlers 0.076 ms (0.03%)
ddtrace.contrib.trace_utils 0.076 ms (0.03%)
ddtrace.contrib.internal.trace_utils 0.076 ms (0.03%)
ddtrace.contrib.internal.trace_utils_base 0.076 ms (0.03%)
shlex 0.069 ms (0.02%)

The following import paths have shrunk:

ddtrace.auto 2.900 ms (1.05%)
ddtrace.bootstrap.sitecustomize 2.207 ms (0.80%)
ddtrace.bootstrap.preload 1.990 ms (0.72%)
ddtrace.internal.remoteconfig.client 0.709 ms (0.26%)
multiprocessing.sharedctypes 0.660 ms (0.24%)
multiprocessing.heap 0.660 ms (0.24%)
mmap 0.660 ms (0.24%)
multiprocessing 0.145 ms (0.05%)
multiprocessing.context 0.145 ms (0.05%)
multiprocessing.reduction 0.067 ms (0.02%)
pickle 0.067 ms (0.02%)
_compat_pickle 0.067 ms (0.02%)
ddtrace.internal.products 0.100 ms (0.04%)
importlib.metadata 0.100 ms (0.04%)
csv 0.100 ms (0.04%)
_csv 0.100 ms (0.04%)
ddtrace.internal.symbol_db.remoteconfig 0.100 ms (0.04%)
ddtrace.internal.symbol_db.symbols 0.100 ms (0.04%)
ddtrace.internal.remoteconfig._connectors 0.099 ms (0.04%)
ctypes 0.099 ms (0.04%)
_ctypes 0.099 ms (0.04%)
ddtrace.internal.flare.flare 0.099 ms (0.04%)
ddtrace.debugging._import 0.078 ms (0.03%)
ddtrace.debugging._function.discovery 0.078 ms (0.03%)
ddtrace._trace.trace_handlers 0.083 ms (0.03%)
ddtrace._trace._inferred_proxy 0.042 ms (0.02%)
ddtrace 0.693 ms (0.25%)
ddtrace.internal._unpatched 0.032 ms (0.01%)
json 0.032 ms (0.01%)
json.decoder 0.032 ms (0.01%)
re 0.032 ms (0.01%)
enum 0.032 ms (0.01%)
types 0.032 ms (0.01%)
ddtrace.settings._config 0.015 ms (0.01%)
ddtrace.internal.schema 0.015 ms (0.01%)

@@ -626,7 +626,8 @@ def __init__(self):
"DD_CIVISIBILITY_EARLY_FLAKE_DETECTION_ENABLED", True, asbool
)
self._otel_enabled = _get_config("DD_TRACE_OTEL_ENABLED", False, asbool, "OTEL_SDK_DISABLED")
if self._otel_enabled:
self._otel_metrics_enabled = _get_config("DD_TRACE_OTEL_METRICS_ENABLED", False, asbool, "OTEL_SDK_DISABLED")
if self._otel_enabled or self._otel_metrics_enabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little unrelated but OTEL_METRICS_EXPORTER is only used to enable/disable runtime metrics. Do we need to do anything with this config now that we support OTLP metrics or will this be handled by otel packages?

@@ -33,6 +33,7 @@ dependencies = [
"importlib_metadata<=6.5.0; python_version<'3.8'",
"legacy-cgi>=2.0.0; python_version>='3.13.0'",
"opentelemetry-api>=1",
"opentelemetry-exporter-otlp>=1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @brettlangdon @emmettbutler How does this change fit in to nextgen api efforts.

These are all the transitive dependencies of the opentelemetry-exporter-otlp package. It's important to note that OpenTelemetry packages are only compatible with matching versions of the SDK, API, and exporters. For example, you cannot install version 1.20.0 of the API and use it with version 1.20.1 of the SDK or exporter:

certifi==2025.6.15
charset-normalizer==3.4.2
googleapis-common-protos==1.70.0
grpcio==1.73.1
idna==3.10
importlib_metadata==8.7.0
opentelemetry-api==1.34.1
opentelemetry-exporter-otlp==1.34.1
opentelemetry-exporter-otlp-proto-common==1.34.1
opentelemetry-exporter-otlp-proto-grpc==1.34.1
opentelemetry-exporter-otlp-proto-http==1.34.1
opentelemetry-proto==1.34.1
opentelemetry-sdk==1.34.1
opentelemetry-semantic-conventions==0.55b1
protobuf==5.29.5
requests==2.32.4
typing_extensions==4.14.0
urllib3==2.5.0
zipp==3.23.0

Another to note is that opentelemetry exporters require a specific major version of the protobuf library: https://github.com/open-telemetry/opentelemetry-python/blob/698f9a521482d6ab3ec75721ff7ed61a207fa110/opentelemetry-proto/pyproject.toml#L28.

# This is the implementation of the "Any" type as specified by the specifications of OpenTelemetry data model for logs.
# For more details, refer to the OTel specification:
# https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#type-any
AnyValue = Union[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can import some of these types from the OpenTelemetry API: https://github.com/open-telemetry/opentelemetry-python/blob/v1.0.0/opentelemetry-api/src/opentelemetry/util/types.py. If possible we should avoid defining our own implementations.

We can use an approach similar to this to support changes to the API across versions: https://github.com/DataDog/dd-trace-py/blob/v3.10.0rc1/ddtrace/internal/opentelemetry/trace.py#L53 (only re-define objects/classes when we absolutely have to).

from json import dumps
from typing import Optional

from ddtrace.internal.opentelemetry.types import Attributes, BoundedAttributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attributes is defined in the opentelemetry-api for all supported versions. BoundedAttributes was introduced in v1.4.0: open-telemetry/opentelemetry-python@01c6954. We should use the definitions provided by the OpenTelemetry API where possible. This will minimize the maintenance burden on our part and hopefully "future" proof the library.

from ddtrace.internal.opentelemetry.types import Attributes, BoundedAttributes


class InstrumentationScope:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instrumentation scope was added to the SDK in v1.11.0: open-telemetry/opentelemetry-python@7647a11. Since the opentelemetry exporter already installs the SDK we can just re-use this component. If someone is using opentelemetry-api<1.12.0 this component won't exist so we won't need this type.

@@ -75,6 +75,20 @@ def _(_):

set_tracer_provider(TracerProvider())

if config._otel_metrics_enabled:
# This POC uses the OpenTelemetry OTLP exporter.
# If we implement our own exporter, we can remove this import, which seems much safer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this in my earlier comments. It looks like this PR is attempting to vendor as much code as possible, with the eventual goal of replacing the opentelemetry-exporter with our own implementation.

However, this approach puts us in the worst of both worlds. We're still installing the official opentelemetry-exporter (which brings in the OpenTelemetry SDK, Open Telemetry Protos, OpenTelemetry Semantics package, and versions of the grpc and protobuf libraries with heavy restrictions), while also hoping our vendored implementations don’t conflict. To move forward, we need to fully commit to one direction: either vendor all OpenTelemetry components or none. A partial solution will likely lead to version conflicts and unnecessary complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But ya if we vendor nothing this PR will be 10-20 line change. If we vendor everything (including protobuf and grpc libraries this change will likely more than double the size of the ddtrace library)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants