Skip to content

Commit

Permalink
feat: Basic OpenTelemetry and Datadog support (#397)
Browse files Browse the repository at this point in the history
Switch to pluggable design so that some monitoring functions can
report to something other than New Relic. Still defaults to just New
Relic, but new setting allows adding OpenTelemetry or Datadog, or
removing New Relic.

Initialization and configuration of OpenTelemetry is left as an
exercise to the deployer, but
https://github.com/mitodl/open-edx-plugins/tree/main/src/ol_openedx_otel_monitoring/
would be a likely candidate.

Part of #389
  • Loading branch information
timmc-edx committed Apr 30, 2024
1 parent b0f3404 commit 4b1e318
Show file tree
Hide file tree
Showing 17 changed files with 632 additions and 67 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ Change Log

.. There should always be an "Unreleased" section for changes pending release.
[5.13.0] - 2024-04-30
---------------------
Added
~~~~~
* Initial support for sending monitoring data to OpenTelemetry collector or Datadog agent, configured by new Django setting ``OPENEDX_TELEMETRY``. See monitoring README for details.

[5.12.0] - 2024-03-29
---------------------
Expand Down
2 changes: 1 addition & 1 deletion edx_django_utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
EdX utilities for Django Application development..
"""

__version__ = "5.12.0"
__version__ = "5.13.0"

default_app_config = (
"edx_django_utils.apps.EdxDjangoUtilsConfig"
Expand Down
34 changes: 34 additions & 0 deletions edx_django_utils/monitoring/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,40 @@ See ``__init__.py`` for a list of everything included in the public API.

If, for some reason, you need low level access to the newrelic agent, please extend this library to implement the feature that you want. Applications should never include ``import newrelic.agent`` directly.

Choice of monitoring tools
--------------------------

The most complete feature support is for New Relic (the default), but there is also initial support for OpenTelemetry and Datadog.

The Django setting ``OPENEDX_TELEMETRY`` can be set to a list of implementations, e.g. ``['edx_django_utils.monitoring.NewRelicBackend', 'edx_django_utils.monitoring.OpenTelemetryBackend']``. All of the implementations that can be loaded will be used for all applicable telemetry calls.

Feature support matrix for built-in telemetry backends:

.. list-table::
:header-rows: 1
:widths: 55, 15, 15, 15

* -
- New Relic
- OpenTelemetry
- Datadog
* - Custom span attributes (``set_custom_attribute``, ``accumulate``, ``increment``, etc.)
- ✅ (on root span)
- ✅ (on current span)
- ✅ (on root span)
* - Retrieve and manipulate spans (``function_trace``, ``get_current_transaction``, ``ignore_transaction``, ``set_monitoring_transaction_name``)
- ✅
- ❌
- ❌
* - Record exceptions (``record_exception``)
- ✅
- ✅
- ✅
* - Instrument non-web tasks (``background_task``)
- ✅
- ❌
- ❌

Using Custom Attributes
-----------------------

Expand Down
1 change: 1 addition & 0 deletions edx_django_utils/monitoring/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
See README.rst for details.
"""
from .internal.backends import DatadogBackend, NewRelicBackend, OpenTelemetryBackend, TelemetryBackend
from .internal.code_owner.middleware import CodeOwnerMonitoringMiddleware
from .internal.code_owner.utils import (
get_code_owner_from_module,
Expand Down
180 changes: 180 additions & 0 deletions edx_django_utils/monitoring/internal/backends.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
"""
Telemetry abstraction and backends that implement it.
Only a certain subset of the monitoring functions have been made
configurable via this module.
"""

import logging
import sys
from abc import ABC, abstractmethod
from functools import lru_cache

from django.conf import settings
from django.dispatch import receiver
from django.test.signals import setting_changed
from django.utils.module_loading import import_string

log = logging.getLogger(__name__)

# The newrelic package used to not be part of the requirements files
# and so a try-import was used here. This situation is no longer true,
# but we're still preserving that pattern until someone feels like
# doing the work to remove it. (Should just be a major version bump
# and communication to anyone who might be specifically removing the
# package for some reason.)
#
# Ticket for just doing an unconditional import:
# https://github.com/openedx/edx-django-utils/issues/396
try:
import newrelic.agent
except ImportError: # pragma: no cover
newrelic = None # pylint: disable=invalid-name


class TelemetryBackend(ABC):
"""
Base class for telemetry sinks.
"""
@abstractmethod
def set_attribute(self, key, value):
"""
Set a key-value attribute on a span. This might be the current
span or it might the root span of the process, depending on
the backend.
"""

@abstractmethod
def record_exception(self):
"""
Record the exception that is currently being handled.
"""


class NewRelicBackend(TelemetryBackend):
"""
Send telemetry to New Relic.
https://docs.newrelic.com/docs/apm/agents/python-agent/python-agent-api/guide-using-python-agent-api/
"""
def __init__(self):
if newrelic is None:
raise Exception("Could not load New Relic monitoring backend; package not present.")

def set_attribute(self, key, value):
# Sets attribute on the transaction, rather than the current
# span, matching historical behavior. There is also an
# `add_custom_span_attribute` that would better match
# OpenTelemetry's behavior, which we could try exposing
# through a new, more specific TelemetryBackend method.
#
# TODO: Update to newer name `add_custom_attribute`
# https://docs.newrelic.com/docs/apm/agents/python-agent/python-agent-api/addcustomparameter-python-agent-api/
newrelic.agent.add_custom_parameter(key, value)

def record_exception(self):
# TODO: Replace with newrelic.agent.notice_error()
# https://docs.newrelic.com/docs/apm/agents/python-agent/python-agent-api/recordexception-python-agent-api/
newrelic.agent.record_exception()


class OpenTelemetryBackend(TelemetryBackend):
"""
Send telemetry via OpenTelemetry.
Requirements to use:
- Install `opentelemetry-api` Python package
- Configure and initialize OpenTelemetry
API reference: https://opentelemetry-python.readthedocs.io/en/latest/
"""
# pylint: disable=import-outside-toplevel
def __init__(self):
# If import fails, the backend won't be used.
from opentelemetry import trace
self.otel_trace = trace

def set_attribute(self, key, value):
# Sets the value on the current span, not necessarily the root
# span in the process.
self.otel_trace.get_current_span().set_attribute(key, value)

def record_exception(self):
self.otel_trace.get_current_span().record_exception(sys.exc_info()[1])


class DatadogBackend(TelemetryBackend):
"""
Send telemetry to Datadog via ddtrace.
Requirements to use:
- Install `ddtrace` Python package
- Initialize ddtrace, either via ddtrace-run or ddtrace.auto
API reference: https://ddtrace.readthedocs.io/en/stable/api.html
"""
# pylint: disable=import-outside-toplevel
def __init__(self):
# If import fails, the backend won't be used.
from ddtrace import tracer
self.dd_tracer = tracer

def set_attribute(self, key, value):
if root_span := self.dd_tracer.current_root_span():
root_span.set_tag(key, value)

def record_exception(self):
if span := self.dd_tracer.current_span():
span.set_traceback()


# We're using an lru_cache instead of assigning the result to a variable on
# module load. With the default settings (pointing to a TelemetryBackend
# in this very module), this function can't be successfully called until
# the module finishes loading, otherwise we get a circular import error
# that will cause the backend to be dropped from the list.
@lru_cache
def configured_backends():
"""
Produce a list of TelemetryBackend instances from Django settings.
"""
# .. setting_name: OPENEDX_TELEMETRY
# .. setting_default: ['edx_django_utils.monitoring.NewRelicBackend']
# .. setting_description: List of telemetry backends to send data to. Allowable values
# are dotted module paths to classes implementing `edx_django_utils.monitoring.TelemetryBackend`,
# such as the built-in `NewRelicBackend`, `OpenTelemetryBackend`, and `DatadogBackend`
# (in the same module). For historical reasons, this defaults to just
# New Relic, and not all monitoring features will report to all backends (New Relic
# having the broadest support). Unusable options are ignored. Configuration
# of the backends themselves is via environment variables and system config files
# rather than via Django settings.
backend_classes = getattr(settings, 'OPENEDX_TELEMETRY', None)
if isinstance(backend_classes, str):
# Prevent a certain kind of easy mistake.
raise Exception("OPENEDX_TELEMETRY must be a list, not a string.")
if backend_classes is None:
backend_classes = ['edx_django_utils.monitoring.NewRelicBackend']

backends = []
for backend_class in backend_classes:
try:
cls = import_string(backend_class)
if issubclass(cls, TelemetryBackend):
backends.append(cls())
else:
log.warning(
f"Could not load OPENEDX_TELEMETRY option {backend_class!r}: "
f"{cls} is not a subclass of TelemetryBackend"
)
except BaseException as e:
log.warning(f"Could not load OPENEDX_TELEMETRY option {backend_class!r}: {e!r}")

return backends


@receiver(setting_changed)
def _reset_state(sender, **kwargs): # pylint: disable=unused-argument
"""Reset caches when settings change during unit tests."""
configured_backends.cache_clear()
27 changes: 9 additions & 18 deletions edx_django_utils/monitoring/internal/middleware.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
"""
Middleware for monitoring.
At this time, monitoring details can only be reported to New Relic.
"""
import base64
import hashlib
Expand All @@ -23,12 +20,9 @@
from edx_django_utils.cache import RequestCache
from edx_django_utils.logging import encrypt_for_log

from .backends import configured_backends

log = logging.getLogger(__name__)
try:
import newrelic.agent
except ImportError: # pragma: no cover
log.warning("Unable to load NewRelic agent module")
newrelic = None # pylint: disable=invalid-name


_DEFAULT_NAMESPACE = 'edx_django_utils.monitoring'
Expand Down Expand Up @@ -77,17 +71,14 @@ class CachedCustomMonitoringMiddleware(MiddlewareMixin):
Make sure to add below the request cache in MIDDLEWARE.
This middleware will only call on the newrelic agent if there are any attributes
This middleware will only call on the telemetry collector if there are any attributes
to report for this request, so it will not incur any processing overhead for
request handlers which do not record custom attributes.
Note: New Relic adds custom attributes to events, which is what is being used here.
"""
@classmethod
def _get_attributes_cache(cls):
"""
Get a request cache specifically for New Relic custom attributes.
Get a request cache specifically for custom attributes.
"""
return RequestCache(namespace=_REQUEST_CACHE_NAMESPACE)

Expand Down Expand Up @@ -126,9 +117,9 @@ def accumulate_metric(cls, name, value): # pragma: no cover
@classmethod
def _batch_report(cls):
"""
Report the collected custom attributes to New Relic.
Report the collected custom attributes.
"""
if not newrelic: # pragma: no cover
if not configured_backends(): # pragma: no cover
return
attributes_cache = cls._get_attributes_cache()
for key, value in attributes_cache.data.items():
Expand Down Expand Up @@ -157,8 +148,8 @@ def _set_custom_attribute(key, value):
Note: Can't use public method in ``utils.py`` due to circular reference.
"""
if newrelic: # pragma: no cover
newrelic.agent.add_custom_parameter(key, value)
for backend in configured_backends():
backend.set_attribute(key, value)


class MonitoringMemoryMiddleware(MiddlewareMixin):
Expand Down Expand Up @@ -499,7 +490,7 @@ def split_ascii_log_message(msg, chunk_size):
yield msg # no need for continuation messages
else:
# Generate a unique-enough collation ID for this message.
h = hashlib.shake_128(msg.encode()).digest(6) # pylint/#4039 pylint: disable=too-many-function-args
h = hashlib.shake_128(msg.encode()).digest(6)
group_id = base64.b64encode(h).decode().rstrip('=')

for i in range(chunk_count):
Expand Down
30 changes: 13 additions & 17 deletions edx_django_utils/monitoring/internal/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
At this time, the custom monitoring will only be reported to New Relic.
"""
from .backends import configured_backends
from .middleware import CachedCustomMonitoringMiddleware

try:
Expand Down Expand Up @@ -62,49 +63,44 @@ def set_custom_attributes_for_course_key(course_key):
"""
Set monitoring custom attributes related to a course key.
This is not cached, and only support reporting to New Relic Insights.
This is not cached.
"""
if newrelic: # pragma: no cover
newrelic.agent.add_custom_parameter('course_id', str(course_key))
newrelic.agent.add_custom_parameter('org', str(course_key.org))
set_custom_attribute('course_id', str(course_key))
set_custom_attribute('org', str(course_key.org))


def set_custom_attribute(key, value):
"""
Set monitoring custom attribute.
This is not cached, and only support reporting to New Relic Insights.
This is not cached.
"""
if newrelic: # pragma: no cover
# note: parameter is new relic's older name for attributes
newrelic.agent.add_custom_parameter(key, value)
for backend in configured_backends():
backend.set_attribute(key, value)


def record_exception():
"""
Records a caught exception to the monitoring system.
Record a caught exception to the monitoring system.
Note: By default, only unhandled exceptions are monitored. This function
can be called to record exceptions as monitored errors, even if you handle
the exception gracefully from a user perspective.
For more details, see:
https://docs.newrelic.com/docs/agents/python-agent/python-agent-api/recordexception-python-agent-api
"""
if newrelic: # pragma: no cover
newrelic.agent.record_exception()
for backend in configured_backends():
backend.record_exception()


def background_task(*args, **kwargs):
"""
Handles monitoring for background tasks that are not passed in through the web server like
celery and event consuming tasks.
This function only supports New Relic.
For more details, see:
https://docs.newrelic.com/docs/apm/agents/python-agent/supported-features/monitor-non-web-scripts-worker-processes-tasks-functions
https://docs.newrelic.com/docs/apm/agents/python-agent/python-agent-api/backgroundtask-python-agent-api/
"""
def noop_decorator(func):
Expand Down
Loading

0 comments on commit 4b1e318

Please sign in to comment.