diff --git a/.gitignore b/.gitignore index 23d2b6a..ff92690 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ *.charm build/ __pycache__/ +.tox diff --git a/lib/charms/grafana_k8s/v0/grafana_dashboard.py b/lib/charms/grafana_k8s/v0/grafana_dashboard.py new file mode 100644 index 0000000..f981ed6 --- /dev/null +++ b/lib/charms/grafana_k8s/v0/grafana_dashboard.py @@ -0,0 +1,1380 @@ +# Copyright 2021 Canonical Ltd. +# See LICENSE file for licensing details. + +"""## Overview. + +This document explains how to integrate with the Grafana charm +for the purpose of providing a dashboard which can be used by +end users. It also explains the structure of the data +expected by the `grafana-dashboard` interface, and may provide a +mechanism or reference point for providing a compatible interface +or library by providing a definitive reference guide to the +structure of relation data which is shared between the Grafana +charm and any charm providing datasource information. + +## Provider Library Usage + +The Grafana charm interacts with its dashboards using its charm +library. The goal of this library is to be as simple to use as +possible, and instantiation of the class with or without changing +the default arguments provides a complete use case. For the simplest +use case of a charm which bundles dashboards and provides a +`provides: grafana-dashboard` interface, creation of a +`GrafanaDashboardProvider` object with the default arguments is +sufficient. + +:class:`GrafanaDashboardProvider` expects that bundled dashboards should +be included in your charm with a default path of: + + path/to/charm.py + path/to/src/grafana_dashboards/*.tmpl + +Where the `*.tmpl` files are Grafana dashboard JSON data either from the +Grafana marketplace, or directly exported from a a Grafana instance. + +The default arguments are: + + `charm`: `self` from the charm instantiating this library + `relation_name`: grafana-dashboard + `dashboards_path`: "/src/grafana_dashboards" + +If your configuration requires any changes from these defaults, they +may be set from the class constructor. It may be instantiated as +follows: + + from charms.grafana_k8s.v0.grafana_dashboard import GrafanaDashboardProvider + + class FooCharm: + def __init__(self, *args): + super().__init__(*args, **kwargs) + ... + self.grafana_dashboard_provider = GrafanaDashboardProvider(self) + ... + +The first argument (`self`) should be a reference to the parent (providing +dashboards), as this charm's lifecycle events will be used to re-submit +dashboard information if a charm is upgraded, the pod is restarted, or other. + +An instantiated `GrafanaDashboardProvider` validates that the path specified +in the constructor (or the default) exists, reads the file contents, then +compresses them with LZMA and adds them to the application relation data +when a relation is established with Grafana. + +Provided dashboards will be checked by Grafana, and a series of dropdown menus +providing the ability to select query targets by Juju Model, application instance, +and unit will be added if they do not exist. + +To avoid requiring `jinja` in `GrafanaDashboardProvider` users, template validation +and rendering occurs on the other side of the relation, and relation data in +the form of: + + { + "event": { + "valid": `true|false`, + "errors": [], + } + } + +Will be returned if rendering or validation fails. In this case, the +`GrafanaDashboardProvider` object will emit a `dashboard_status_changed` event +of the type :class:`GrafanaDashboardEvent`, which will contain information +about the validation error. + +This information is added to the relation data for the charms as serialized JSON +from a dict, with a structure of: +``` +{ + "application": { + "dashboards": { + "uuid": a uuid generated to ensure a relation event triggers, + "templates": { + "file:{hash}": { + "content": `{compressed_template_data}`, + "charm": `charm.meta.name`, + "juju_topology": { + "model": `charm.model.name`, + "model_uuid": `charm.model.uuid`, + "application": `charm.app.name`, + "unit": `charm.unit.name`, + } + }, + "file:{other_file_hash}": { + ... + }, + }, + }, + }, +} +``` + +This is ingested by :class:`GrafanaDashboardConsumer`, and is sufficient for configuration. + +The [COS Configuration Charm](https://charmhub.io/cos-configuration-k8s) can be used to +add dashboards which are bundled with charms. + +## Consumer Library Usage + +The `GrafanaDashboardConsumer` object may be used by Grafana +charms to manage relations with available dashboards. For this +purpose, a charm consuming Grafana dashboard information should do +the following things: + +1. Instantiate the `GrafanaDashboardConsumer` object by providing it a +reference to the parent (Grafana) charm and, optionally, the name of +the relation that the Grafana charm uses to interact with dashboards. +This relation must confirm to the `grafana-dashboard` interface. + +For example a Grafana charm may instantiate the +`GrafanaDashboardConsumer` in its constructor as follows + + from charms.grafana_k8s.v0.grafana_dashboard import GrafanaDashboardConsumer + + def __init__(self, *args): + super().__init__(*args) + ... + self.grafana_dashboard_consumer = GrafanaDashboardConsumer(self) + ... + +2. A Grafana charm also needs to listen to the +`GrafanaDashboardConsumer` events emitted by the `GrafanaDashboardConsumer` +by adding itself as an observer for these events: + + self.framework.observe( + self.grafana_source_consumer.on.sources_changed, + self._on_dashboards_changed, + ) + +Dashboards can be retrieved the :meth:`dashboards`: + +It will be returned in the format of: + +``` +[ + { + "id": unique_id, + "relation_id": relation_id, + "charm": the name of the charm which provided the dashboard, + "content": compressed_template_data + }, +] +``` + +The consuming charm should decompress the dashboard. +""" + +import base64 +import json +import logging +import lzma +import os +import re +import uuid +from pathlib import Path +from typing import Any, Dict, List, Optional, Union + +from ops.charm import ( + CharmBase, + HookEvent, + RelationBrokenEvent, + RelationChangedEvent, + RelationCreatedEvent, + RelationEvent, + RelationRole, +) +from ops.framework import ( + EventBase, + EventSource, + Object, + ObjectEvents, + StoredDict, + StoredList, + StoredState, +) +from ops.model import Relation + +# The unique Charmhub library identifier, never change it +LIBID = "c49eb9c7dfef40c7b6235ebd67010a3f" + +# Increment this major API version when introducing breaking changes +LIBAPI = 0 + +# Increment this PATCH version before using `charmcraft publish-lib` or reset +# to 0 if you are raising the major API version +LIBPATCH = 10 + +logger = logging.getLogger(__name__) + + +DEFAULT_RELATION_NAME = "grafana-dashboard" +RELATION_INTERFACE_NAME = "grafana_dashboard" + +TEMPLATE_DROPDOWNS = [ + { + "allValue": None, + "datasource": "${prometheusds}", + "definition": "label_values(up,juju_model)", + "description": None, + "error": None, + "hide": 0, + "includeAll": False, + "label": "Juju model", + "multi": False, + "name": "juju_model", + "query": { + "query": "label_values(up,juju_model)", + "refId": "StandardVariableQuery", + }, + "refresh": 1, + "regex": "", + "skipUrlSync": False, + "sort": 0, + "tagValuesQuery": "", + "tags": [], + "tagsQuery": "", + "type": "query", + "useTags": False, + }, + { + "allValue": None, + "datasource": "${prometheusds}", + "definition": 'label_values(up{juju_model="$juju_model"},juju_model_uuid)', + "description": None, + "error": None, + "hide": 0, + "includeAll": False, + "label": "Juju model uuid", + "multi": False, + "name": "juju_model_uuid", + "query": { + "query": 'label_values(up{juju_model="$juju_model"},juju_model_uuid)', + "refId": "StandardVariableQuery", + }, + "refresh": 1, + "regex": "", + "skipUrlSync": False, + "sort": 0, + "tagValuesQuery": "", + "tags": [], + "tagsQuery": "", + "type": "query", + "useTags": False, + }, + { + "allValue": None, + "datasource": "${prometheusds}", + "definition": 'label_values(up{juju_model="$juju_model",juju_model_uuid="$juju_model_uuid"},juju_application)', + "description": None, + "error": None, + "hide": 0, + "includeAll": False, + "label": "Juju application", + "multi": False, + "name": "juju_application", + "query": { + "query": 'label_values(up{juju_model="$juju_model",juju_model_uuid="$juju_model_uuid"},juju_application)', + "refId": "StandardVariableQuery", + }, + "refresh": 1, + "regex": "", + "skipUrlSync": False, + "sort": 0, + "tagValuesQuery": "", + "tags": [], + "tagsQuery": "", + "type": "query", + "useTags": False, + }, + { + "allValue": None, + "datasource": "${prometheusds}", + "definition": 'label_values(up{juju_model="$juju_model",juju_model_uuid="$juju_model_uuid",juju_application="$juju_application"},juju_unit)', + "description": None, + "error": None, + "hide": 0, + "includeAll": False, + "label": "Juju unit", + "multi": False, + "name": "juju_unit", + "query": { + "query": 'label_values(up{juju_model="$juju_model",juju_model_uuid="$juju_model_uuid",juju_application="$juju_application"},juju_unit)', + "refId": "StandardVariableQuery", + }, + "refresh": 1, + "regex": "", + "skipUrlSync": False, + "sort": 0, + "tagValuesQuery": "", + "tags": [], + "tagsQuery": "", + "type": "query", + "useTags": False, + }, + { + "description": None, + "error": None, + "hide": 0, + "includeAll": False, + "label": None, + "multi": False, + "name": "prometheusds", + "options": [], + "query": "prometheus", + "refresh": 1, + "regex": "", + "skipUrlSync": False, + "type": "datasource", + }, +] + +REACTIVE_CONVERTER = { # type: ignore + "allValue": None, + "datasource": "${prometheusds}", + "definition": 'label_values(up{juju_model="$juju_model",juju_model_uuid="$juju_model_uuid",juju_application="$juju_application"},host)', + "description": None, + "error": None, + "hide": 0, + "includeAll": False, + "label": "hosts", + "multi": True, + "name": "host", + "options": [], + "query": { + "query": 'label_values(up{juju_model="$juju_model",juju_model_uuid="$juju_model_uuid",juju_application="$juju_application"},host)', + "refId": "StandardVariableQuery", + }, + "refresh": 1, + "regex": "", + "skipUrlSync": False, + "sort": 1, + "tagValuesQuery": "", + "tags": [], + "tagsQuery": "", + "type": "query", + "useTags": False, +} + + +class RelationNotFoundError(Exception): + """Raised if there is no relation with the given name.""" + + def __init__(self, relation_name: str): + self.relation_name = relation_name + self.message = "No relation named '{}' found".format(relation_name) + + super().__init__(self.message) + + +class RelationInterfaceMismatchError(Exception): + """Raised if the relation with the given name has a different interface.""" + + def __init__( + self, + relation_name: str, + expected_relation_interface: str, + actual_relation_interface: str, + ): + self.relation_name = relation_name + self.expected_relation_interface = expected_relation_interface + self.actual_relation_interface = actual_relation_interface + self.message = ( + "The '{}' relation has '{}' as " + "interface rather than the expected '{}'".format( + relation_name, actual_relation_interface, expected_relation_interface + ) + ) + + super().__init__(self.message) + + +class RelationRoleMismatchError(Exception): + """Raised if the relation with the given name has a different direction.""" + + def __init__( + self, + relation_name: str, + expected_relation_role: RelationRole, + actual_relation_role: RelationRole, + ): + self.relation_name = relation_name + self.expected_relation_interface = expected_relation_role + self.actual_relation_role = actual_relation_role + self.message = "The '{}' relation has role '{}' rather than the expected '{}'".format( + relation_name, repr(actual_relation_role), repr(expected_relation_role) + ) + + super().__init__(self.message) + + +class InvalidDirectoryPathError(Exception): + """Raised if the grafana dashboards folder cannot be found or is otherwise invalid.""" + + def __init__( + self, + grafana_dashboards_absolute_path: str, + message: str, + ): + self.grafana_dashboards_absolute_path = grafana_dashboards_absolute_path + self.message = message + + super().__init__(self.message) + + +def _resolve_dir_against_charm_path(charm: CharmBase, *path_elements: str) -> str: + """Resolve the provided path items against the directory of the main file. + + Look up the directory of the charmed operator file being executed. This is normally + going to be the charm.py file of the charm including this library. Then, resolve + the provided path elements and return its absolute path. + + Raises: + InvalidDirectoryPathError if the resolved path does not exist or it is not a directory + + """ + charm_dir = Path(str(charm.charm_dir)) + if not charm_dir.exists() or not charm_dir.is_dir(): + # Operator Framework does not currently expose a robust + # way to determine the top level charm source directory + # that is consistent across deployed charms and unit tests + # Hence for unit tests the current working directory is used + # TODO: updated this logic when the following ticket is resolved + # https://github.com/canonical/operator/issues/643 + charm_dir = Path(os.getcwd()) + + dir_path = charm_dir.absolute().joinpath(*path_elements) + + if not dir_path.exists(): + raise InvalidDirectoryPathError(str(dir_path), "directory does not exist") + if not dir_path.is_dir(): + raise InvalidDirectoryPathError(str(dir_path), "is not a directory") + + return str(dir_path) + + +def _validate_relation_by_interface_and_direction( + charm: CharmBase, + relation_name: str, + expected_relation_interface: str, + expected_relation_role: RelationRole, +) -> None: + """Verifies that a relation has the necessary characteristics. + + Verifies that the `relation_name` provided: (1) exists in metadata.yaml, + (2) declares as interface the interface name passed as `relation_interface` + and (3) has the right "direction", i.e., it is a relation that `charm` + provides or requires. + + Args: + charm: a `CharmBase` object to scan for the matching relation. + relation_name: the name of the relation to be verified. + expected_relation_interface: the interface name to be matched by the + relation named `relation_name`. + expected_relation_role: whether the `relation_name` must be either + provided or required by `charm`. + + Raises: + RelationNotFoundError: If there is no relation in the charm's metadata.yaml + named like the value of the `relation_name` argument. + RelationInterfaceMismatchError: If the relation interface of the + relation named as the provided `relation_name` argument does not + match the `expected_relation_interface` argument. + RelationRoleMismatchError: If the relation named as the provided `relation_name` + argument has a different role than what is specified by the + `expected_relation_role` argument. + """ + if relation_name not in charm.meta.relations: + raise RelationNotFoundError(relation_name) + + relation = charm.meta.relations[relation_name] + + actual_relation_interface = relation.interface_name + if actual_relation_interface != expected_relation_interface: + raise RelationInterfaceMismatchError( + relation_name, expected_relation_interface, actual_relation_interface + ) + + if expected_relation_role == RelationRole.provides: + if relation_name not in charm.meta.provides: + raise RelationRoleMismatchError( + relation_name, RelationRole.provides, RelationRole.requires + ) + elif expected_relation_role == RelationRole.requires: + if relation_name not in charm.meta.requires: + raise RelationRoleMismatchError( + relation_name, RelationRole.requires, RelationRole.provides + ) + else: + raise Exception("Unexpected RelationDirection: {}".format(expected_relation_role)) + + +def _encode_dashboard_content(content: Union[str, bytes]) -> str: + if isinstance(content, str): + content = bytes(content, "utf-8") + + return base64.b64encode(lzma.compress(content)).decode("utf-8") + + +def _decode_dashboard_content(encoded_content: str) -> str: + return lzma.decompress(base64.b64decode(encoded_content.encode("utf-8"))).decode() + + +def _inject_dashboard_dropdowns(content: str) -> str: + """Make sure dropdowns are present for Juju topology.""" + dict_content = json.loads(content) + if "templating" not in content: + dict_content["templating"] = {"list": [d for d in TEMPLATE_DROPDOWNS]} + else: + for d in TEMPLATE_DROPDOWNS: + if d not in dict_content["templating"]["list"]: + dict_content["templating"]["list"].insert(0, d) + + return json.dumps(dict_content) + + +def _type_convert_stored(obj): + """Convert Stored* to their appropriate types, recursively.""" + if isinstance(obj, StoredList): + return list(map(_type_convert_stored, obj)) + elif isinstance(obj, StoredDict): + rdict = {} # type: Dict[Any, Any] + for k in obj.keys(): + rdict[k] = _type_convert_stored(obj[k]) + return rdict + else: + return obj + + +class GrafanaDashboardsChanged(EventBase): + """Event emitted when Grafana dashboards change.""" + + def __init__(self, handle, data=None): + super().__init__(handle) + self.data = data + + def snapshot(self) -> Dict: + """Save grafana source information.""" + return {"data": self.data} + + def restore(self, snapshot): + """Restore grafana source information.""" + self.data = snapshot["data"] + + +class GrafanaDashboardEvents(ObjectEvents): + """Events raised by :class:`GrafanaSourceEvents`.""" + + dashboards_changed = EventSource(GrafanaDashboardsChanged) + + +class GrafanaDashboardEvent(EventBase): + """Event emitted when Grafana dashboards cannot be resolved. + + Enables us to set a clear status on the provider. + """ + + def __init__(self, handle, error_message: str = "", valid: bool = False): + super().__init__(handle) + self.error_message = error_message + self.valid = valid + + def snapshot(self) -> Dict: + """Save grafana source information.""" + return {"error_message": self.error_message, "valid": self.valid} + + def restore(self, snapshot): + """Restore grafana source information.""" + self.error_message = snapshot["error_message"] + self.valid = snapshot["valid"] + + +class GrafanaProviderEvents(ObjectEvents): + """Events raised by :class:`GrafanaSourceEvents`.""" + + dashboard_status_changed = EventSource(GrafanaDashboardEvent) + + +class GrafanaDashboardProvider(Object): + """An API to provide Grafana dashboards to a Grafana charm.""" + + _stored = StoredState() + on = GrafanaProviderEvents() + + def __init__( + self, + charm: CharmBase, + relation_name: str = DEFAULT_RELATION_NAME, + dashboards_path: str = "src/grafana_dashboards", + ) -> None: + """API to provide Grafana dashboard to a Grafana charmed operator. + + The :class:`GrafanaDashboardProvider` object provides an API + to upload dashboards to a Grafana charm. In its most streamlined + usage, the :class:`GrafanaDashboardProvider` is integrated in a + charmed operator as follows: + + self.grafana = GrafanaDashboardProvider(self) + + The :class:`GrafanaDashboardProvider` will look for dashboard + templates in the `/grafana_dashboards` folder. + Additionally, dashboard templates can be uploaded programmatically + via the :method:`GrafanaDashboardProvider.add_dashboard` method. + + To use the :class:`GrafanaDashboardProvider` API, you need a relation + defined in your charm operator's metadata.yaml as follows: + + provides: + grafana-dashboard: + interface: grafana_dashboard + + If you would like to use relation name other than `grafana-dashboard`, + you will need to specify the relation name via the `relation_name` + argument when instantiating the :class:`GrafanaDashboardProvider` object. + However, it is strongly advised to keep the the default relation name, + so that people deploying your charm will have a consistent experience + with all other charms that provide Grafana dashboards. + + It is possible to provide a different file path for the Grafana dashboards + to be automatically managed by the :class:`GrafanaDashboardProvider` object + via the `dashboards_path` argument. This may be necessary when the directory + structure of your charmed operator repository is not the "usual" one as + generated by `charmcraft init`, for example when adding the charmed operator + in a Java repository managed by Maven or Gradle. However, unless there are + such constraints with other tooling, it is strongly advised to store the + Grafana dashboards in the default `/grafana_dashboards` + folder, in order to provide a consistent experience for other charmed operator + authors. + + Args: + charm: a :class:`CharmBase` object which manages this + :class:`GrafanaProvider` object. Generally this is + `self` in the instantiating class. + relation_name: a :string: name of the relation managed by this + :class:`GrafanaDashboardProvider`; it defaults to "grafana-dashboard". + dashboards_path: a filesystem path relative to the charm root + where dashboard templates can be located. By default, the library + expects dashboard files to be in the `/grafana_dashboards` + directory. + """ + _validate_relation_by_interface_and_direction( + charm, relation_name, RELATION_INTERFACE_NAME, RelationRole.provides + ) + + try: + dashboards_path = _resolve_dir_against_charm_path(charm, dashboards_path) + except InvalidDirectoryPathError as e: + logger.warning( + "Invalid Grafana dashboards folder at %s: %s", + e.grafana_dashboards_absolute_path, + e.message, + ) + + super().__init__(charm, relation_name) + + self._charm = charm + self._relation_name = relation_name + self._dashboards_path = dashboards_path + self._stored.set_default(dashboard_templates={}) + + self.framework.observe(self._charm.on.leader_elected, self._update_all_dashboards_from_dir) + self.framework.observe(self._charm.on.upgrade_charm, self._update_all_dashboards_from_dir) + + self.framework.observe( + self._charm.on[self._relation_name].relation_created, + self._on_grafana_dashboard_relation_created, + ) + self.framework.observe( + self._charm.on[self._relation_name].relation_changed, + self._on_grafana_dashboard_relation_changed, + ) + + def add_dashboard(self, content: str) -> None: + """Add a dashboard to the relation managed by this :class:`GrafanaDashboardProvider`. + + Args: + content: a string representing a Jinja template. Currently, no + global variables are added to the Jinja template evaluation + context. + """ + # Update of storage must be done irrespective of leadership, so + # that the stored state is there when this unit becomes leader. + stored_dashboard_templates = self._stored.dashboard_templates + + encoded_dashboard = _encode_dashboard_content(content) + + # Use as id the first chars of the encoded dashboard, so that + # it is predictable across units. + id = "prog:{}".format(encoded_dashboard[-24:-16]) + stored_dashboard_templates[id] = self._content_to_dashboard_object(encoded_dashboard) + + if self._charm.unit.is_leader(): + for dashboard_relation in self._charm.model.relations[self._relation_name]: + self._upset_dashboards_on_relation(dashboard_relation) + + def remove_non_builtin_dashboards(self) -> None: + """Remove all dashboards to the relation added via :method:`add_dashboard`.""" + # Update of storage must be done irrespective of leadership, so + # that the stored state is there when this unit becomes leader. + stored_dashboard_templates = self._stored.dashboard_templates + + for dashboard_id in list(stored_dashboard_templates.keys()): + if dashboard_id.startswith("prog:"): + del stored_dashboard_templates[dashboard_id] + self._stored.dashboard_templates = stored_dashboard_templates + + if self._charm.unit.is_leader(): + for dashboard_relation in self._charm.model.relations[self._relation_name]: + self._upset_dashboards_on_relation(dashboard_relation) + + def update_dashboards(self) -> None: + """Trigger the re-evaluation of the data on all relations.""" + if self._charm.unit.is_leader(): + for dashboard_relation in self._charm.model.relations[self._relation_name]: + self._upset_dashboards_on_relation(dashboard_relation) + + def _update_all_dashboards_from_dir(self, _: Optional[HookEvent] = None) -> None: + """Scans the built-in dashboards and updates relations with changes.""" + # Update of storage must be done irrespective of leadership, so + # that the stored state is there when this unit becomes leader. + + # Ensure we do not leave outdated dashboards by removing from stored all + # the encoded dashboards that start with "file/". + if self._dashboards_path: + stored_dashboard_templates = self._stored.dashboard_templates + + for dashboard_id in list(stored_dashboard_templates.keys()): + if dashboard_id.startswith("file:"): + del stored_dashboard_templates[dashboard_id] + + for path in filter(Path.is_file, Path(self._dashboards_path).glob("*.tmpl")): + id = "file:{}".format(path.stem) + stored_dashboard_templates[id] = self._content_to_dashboard_object( + _encode_dashboard_content(path.read_bytes()) + ) + + self._stored.dashboard_templates = stored_dashboard_templates + + if self._charm.unit.is_leader(): + for dashboard_relation in self._charm.model.relations[self._relation_name]: + self._upset_dashboards_on_relation(dashboard_relation) + + def _reinitialize_dashboard_data(self) -> None: + """Triggers a reload of dashboard outside of an eventing workflow. + + This will destroy any existing relation data. + """ + try: + _resolve_dir_against_charm_path(self._charm, self._dashboards_path) + self._update_all_dashboards_from_dir() + + except InvalidDirectoryPathError as e: + logger.warning( + "Invalid Grafana dashboards folder at %s: %s", + e.grafana_dashboards_absolute_path, + e.message, + ) + stored_dashboard_templates = self._stored.dashboard_templates + + for dashboard_id in list(stored_dashboard_templates.keys()): + if dashboard_id.startswith("file:"): + del stored_dashboard_templates[dashboard_id] + self._stored.dashboard_templates = stored_dashboard_templates + + # With all of the file-based dashboards cleared out, force a refresh + # of relation data + if self._charm.unit.is_leader(): + for dashboard_relation in self._charm.model.relations[self._relation_name]: + self._upset_dashboards_on_relation(dashboard_relation) + + def _on_grafana_dashboard_relation_created(self, event: RelationCreatedEvent) -> None: + """Watch for a relation being created and automatically send dashboards. + + Args: + event: The :class:`RelationJoinedEvent` sent when a + `grafana_dashboaard` relationship is joined + """ + if self._charm.unit.is_leader(): + self._upset_dashboards_on_relation(event.relation) + + def _on_grafana_dashboard_relation_changed(self, event: RelationChangedEvent) -> None: + """Watch for changes so we know if there's an error to signal back to the parent charm. + + Args: + event: The `RelationChangedEvent` that triggered this handler. + """ + if self._charm.unit.is_leader(): + data = json.loads(event.relation.data[event.app].get("event", "{}")) + + if not data: + return + + valid = bool(data.get("valid", True)) + errors = data.get("errors", []) + if valid and not errors: + self.on.dashboard_status_changed.emit(valid=valid) + else: + self.on.dashboard_status_changed.emit(valid=valid, errors=errors) + + def _upset_dashboards_on_relation(self, relation: Relation) -> None: + """Update the dashboards in the relation data bucket.""" + # It's completely ridiculous to add a UUID, but if we don't have some + # pseudo-random value, this never makes it across 'juju set-state' + stored_data = { + "templates": _type_convert_stored(self._stored.dashboard_templates), + "uuid": str(uuid.uuid4()), + } + + relation.data[self._charm.app]["dashboards"] = json.dumps(stored_data) + + def _content_to_dashboard_object(self, content: str) -> Dict: + return { + "charm": self._charm.meta.name, + "content": content, + "juju_topology": self._juju_topology, + } + + # This is not actually used in the dashboards, but is present to provide a secondary + # salt to ensure uniqueness in the dict keys in case individual charm units provide + # dashboards + @property + def _juju_topology(self) -> Dict: + return { + "model": self._charm.model.name, + "model_uuid": self._charm.model.uuid, + "application": self._charm.app.name, + "unit": self._charm.unit.name, + } + + @property + def dashboard_templates(self) -> List: + """Return a list of the known dashboard templates.""" + return [v for v in self._stored.dashboard_templates.values()] + + +class GrafanaDashboardConsumer(Object): + """A consumer object for working with Grafana Dashboards.""" + + on = GrafanaDashboardEvents() + _stored = StoredState() + + def __init__(self, charm: CharmBase, relation_name: str = DEFAULT_RELATION_NAME) -> None: + """API to receive Grafana dashboards from charmed operators. + + The :class:`GrafanaDashboardConsumer` object provides an API + to consume dashboards provided by a charmed operator using the + :class:`GrafanaDashboardProvider` library. The + :class:`GrafanaDashboardConsumer` is integrated in a + charmed operator as follows: + + self.grafana = GrafanaDashboardConsumer(self) + + To use this library, you need a relation defined as follows in + your charm operator's metadata.yaml: + + requires: + grafana-dashboard: + interface: grafana_dashboard + + If you would like to use a different relation name than + `grafana-dashboard`, you need to specify the relation name via the + `relation_name` argument. However, it is strongly advised not to + change the default, so that people deploying your charm will have + a consistent experience with all other charms that consume Grafana + dashboards. + + Args: + charm: a :class:`CharmBase` object which manages this + :class:`GrafanaProvider` object. Generally this is + `self` in the instantiating class. + relation_name: a :string: name of the relation managed by this + :class:`GrafanaDashboardConsumer`; it defaults to "grafana-dashboard". + """ + _validate_relation_by_interface_and_direction( + charm, relation_name, RELATION_INTERFACE_NAME, RelationRole.requires + ) + + super().__init__(charm, relation_name) + self._charm = charm + self._relation_name = relation_name + + self._stored.set_default(dashboards=dict()) + + self.framework.observe( + self._charm.on[self._relation_name].relation_changed, + self._on_grafana_dashboard_relation_changed, + ) + self.framework.observe( + self._charm.on[self._relation_name].relation_broken, + self._on_grafana_dashboard_relation_broken, + ) + + def get_dashboards_from_relation(self, relation_id: int) -> List: + """Get a list of known dashboards for one instance of the monitored relation. + + Args: + relation_id: the identifier of the relation instance, as returned by + :method:`ops.model.Relation.id`. + + Returns: a list of known dashboards coming from the provided relation instance. + """ + return [ + self._to_external_object(relation_id, dashboard) + for dashboard in self._stored.dashboards.get(relation_id, []) + ] + + def _on_grafana_dashboard_relation_changed(self, event: RelationChangedEvent) -> None: + """Handle relation changes in related providers. + + If there are changes in relations between Grafana dashboard consumers + and providers, this event handler (if the unit is the leader) will + get data for an incoming grafana-dashboard relation through a + :class:`GrafanaDashboardsChanged` event, and make the relation data + available in the app's datastore object. The Grafana charm can + then respond to the event to update its configuration. + """ + # TODO Are we sure this is right? It sounds like every Grafana unit + # should create files with the dashboards in its container. + if not self._charm.unit.is_leader(): + return + + self._render_dashboards_and_emit_event(event.relation) + + def update_dashboards(self, relation: Optional[Relation] = None) -> None: + """Re-establish dashboards on one or more relations. + + If something changes between this library and a datasource, try to re-establish + invalid dashboards and invalidate active ones. + + Args: + relation: a specific relation for which the dashboards have to be + updated. If not specified, all relations managed by this + :class:`GrafanaDashboardConsumer` will be updated. + """ + if not self._charm.unit.is_leader(): + return + + relations = [relation] if relation else self._charm.model.relations[self._relation_name] + + for relation in relations: + self._render_dashboards_and_emit_event(relation) + + def _on_grafana_dashboard_relation_broken(self, event: RelationBrokenEvent) -> None: + """Update job config when providers depart. + + When a Grafana dashboard provider departs, the configuration + for that provider is removed from the list of dashboards + """ + if not self._charm.unit.is_leader(): + return + + self._remove_all_dashboards_for_relation(event.relation) + + def _render_dashboards_and_emit_event(self, relation: Relation) -> None: + """Validate a given dashboard. + + Verify that the passed dashboard data is able to be found in our list + of datasources and will render. If they do, let the charm know by + emitting an event. + + Args: + relation: Relation; The relation the dashboard is associated with. + """ + other_app = relation.app + + raw_data = relation.data[other_app].get("dashboards", {}) + + if not raw_data: + logger.warning( + "No dashboard data found in the %s:%s relation", + self._relation_name, + str(relation.id), + ) + return + + data = json.loads(raw_data) + + # The only piece of data needed on this side of the relations is "templates" + templates = data.pop("templates") + + # Import only if a charmed operator uses the consumer, we don't impose these + # dependencies on the client + from jinja2 import Template # type: ignore + from jinja2.exceptions import TemplateSyntaxError # type: ignore + + # The dashboards are WAY too big since this ultimately calls out to Juju to + # set the relation data, and it overflows the maximum argument length for + # subprocess, so we have to use b64, annoyingly. + # Worse, Python3 expects absolutely everything to be a byte, and a plain + # `base64.b64encode()` is still too large, so we have to go through hoops + # of encoding to byte, compressing with lzma, converting to base64 so it + # can be converted to JSON, then all the way back. + + rendered_dashboards = [] + relation_has_invalid_dashboards = False + + for _, (fname, template) in enumerate(templates.items()): + decoded_content = _decode_dashboard_content(template["content"]) + + content = None + error = None + try: + content = Template(decoded_content).render() + content = _encode_dashboard_content(_inject_dashboard_dropdowns(content)) + except TemplateSyntaxError as e: + error = str(e) + relation_has_invalid_dashboards = True + + # Prepend the relation name and ID to the dashboard ID to avoid clashes with + # multiple relations with apps from the same charm, or having dashboards with + # the same ids inside their charm operators + rendered_dashboards.append( + { + "id": "{}:{}/{}".format(relation.name, relation.id, fname), + "original_id": fname, + "content": content if content else None, + "template": template, + "valid": (error is None), + "error": error, + } + ) + + if relation_has_invalid_dashboards: + self._remove_all_dashboards_for_relation(relation) + + invalid_templates = [ + data["original_id"] for data in rendered_dashboards if not data["valid"] + ] + + logger.warning( + "Cannot add one or more Grafana dashboards from relation '{}:{}': the following " + "templates are invalid: {}".format( + relation.name, + relation.id, + invalid_templates, + ) + ) + + relation.data[self._charm.app]["event"] = json.dumps( + { + "errors": [ + { + "dashboard_id": rendered_dashboard["original_id"], + "error": rendered_dashboard["error"], + } + for rendered_dashboard in rendered_dashboards + if rendered_dashboard["error"] + ] + } + ) + + # Dropping dashboards for a relation needs to be signalled + self.on.dashboards_changed.emit() + else: + stored_data = rendered_dashboards + currently_stored_data = self._stored.dashboards.get(relation.id, {}) + + coerced_data = ( + _type_convert_stored(currently_stored_data) if currently_stored_data else {} + ) + + if not coerced_data == stored_data: + self._stored.dashboards[relation.id] = stored_data + self.on.dashboards_changed.emit() + + def _remove_all_dashboards_for_relation(self, relation: Relation) -> None: + """If an errored dashboard is in stored data, remove it and trigger a deletion.""" + if self._stored.dashboards.pop(relation.id, None): + self.on.dashboards_changed.emit() + + def _to_external_object(self, relation_id, dashboard): + return { + "id": dashboard["original_id"], + "relation_id": relation_id, + "charm": dashboard["template"]["charm"], + "content": _decode_dashboard_content(dashboard["content"]), + } + + @property + def dashboards(self) -> List[Dict]: + """Get a list of known dashboards across all instances of the monitored relation. + + Returns: a list of known dashboards. The JSON of each of the dashboards is available + in the `content` field of the corresponding `dict`. + """ + dashboards = [] + + for _, (relation_id, dashboards_for_relation) in enumerate( + self._stored.dashboards.items() + ): + for dashboard in dashboards_for_relation: + dashboards.append(self._to_external_object(relation_id, dashboard)) + + return dashboards + + +class GrafanaDashboardAggregator(Object): + """API to retrieve Grafana dashboards from machine dashboards. + + The :class:`GrafanaDashboardAggregator` object provides a way to + collate and aggregate Grafana dashboards from reactive/machine charms + and transport them into Charmed Operators, using Juju topology. + + For detailed usage instructions, see the documentation for + :module:`lma-proxy-operator`, as this class is intended for use as a + single point of intersection rather than use in individual charms. + + Since :class:`GrafanaDashboardAggregator` serves as a bridge between + Canonical Observability Stack Charmed Operators and Reactive Charms, + deployed in a Reactive Juju model, both a target relation which is + used to collect events from Reactive charms and a `grafana_relation` + which is used to send the collected data back to the Canonical + Observability Stack are required. + + In its most streamlined usage, :class:`GrafanaDashboardAggregator` is + integrated in a charmed operator as follows: + + self.grafana = GrafanaDashboardAggregator(self) + + Args: + charm: a :class:`CharmBase` object which manages this + :class:`GrafanaProvider` object. Generally this is + `self` in the instantiating class. + target_relation: a :string: name of a relation managed by this + :class:`GrafanaDashboardAggregator`, which is used to communicate + with reactive/machine charms it defaults to "dashboards". + grafana_relation: a :string: name of a relation used by this + :class:`GrafanaDashboardAggregator`, which is used to communicate + with charmed grafana. It defaults to "downstream-grafana-dashboard" + """ + + _stored = StoredState() + on = GrafanaProviderEvents() + + def __init__( + self, + charm: CharmBase, + target_relation: str = "dashboards", + grafana_relation: str = "downstream-grafana-dashboard", + ): + super().__init__(charm, grafana_relation) + self._stored.set_default( + dashboard_templates={}, + id_mappings={}, + ) + + self._charm = charm + self._target_relation = target_relation + self._grafana_relation = grafana_relation + + self.framework.observe( + self._charm.on[self._grafana_relation].relation_joined, + self._update_remote_grafana, + ) + self.framework.observe( + self._charm.on[self._grafana_relation].relation_changed, + self._update_remote_grafana, + ) + self.framework.observe( + self._charm.on[self._target_relation].relation_changed, + self.update_dashboards, + ) + self.framework.observe( + self._charm.on[self._target_relation].relation_broken, + self.remove_dashboards, + ) + + def update_dashboards(self, event: RelationEvent) -> None: + """If we get a dashboard from a reactive charm, parse it out and update.""" + if self._charm.unit.is_leader(): + self._upset_dashboards_on_event(event) + + def _upset_dashboards_on_event(self, event: RelationEvent) -> None: + """Update the dashboards in the relation data bucket.""" + dashboards = self._handle_reactive_dashboards(event) + + if not dashboards: + logger.warning( + "Could not find dashboard data after a relation change for {}".format(event.app) + ) + return + + for id in dashboards: + self._stored.dashboard_templates[id] = self._content_to_dashboard_object( + dashboards[id], event + ) + + self._stored.id_mappings[event.app.name] = dashboards + self._update_remote_grafana(event) + + def _update_remote_grafana(self, _: Optional[RelationEvent] = None) -> None: + """Push dashboards to the downstream Grafana relation.""" + # It's still ridiculous to add a UUID here, but needed + stored_data = { + "templates": _type_convert_stored(self._stored.dashboard_templates), + "uuid": str(uuid.uuid4()), + } + + for grafana_relation in self.model.relations[self._grafana_relation]: + grafana_relation.data[self._charm.app]["dashboards"] = json.dumps(stored_data) + + def remove_dashboards(self, event: RelationBrokenEvent) -> None: + """Remove a dashboard if the relation is broken.""" + app_ids = _type_convert_stored(self._stored.id_mappings[event.app.name]) + + del self._stored.id_mappings[event.app.name] + for id in app_ids: + del self._stored.dashboard_templates[id] + + stored_data = { + "templates": _type_convert_stored(self._stored.dashboard_templates), + "uuid": str(uuid.uuid4()), + } + + for grafana_relation in self.model.relations[self._grafana_relation]: + grafana_relation.data[self._charm.app]["dashboards"] = json.dumps(stored_data) + + # Yes, this has a fair amount of branching. It's not that complex, though + def _strip_existing_datasources(self, template: dict) -> dict: # noqa: C901 + """Remove existing reactive charm datasource templating out. + + This method iterates through *known* places where reactive charms may set + data in contributed dashboards and removes them. + + `dashboard["__inputs"]` is a property sometimes set when exporting dashboards from + the Grafana UI. It is not present in earlier Grafana versions, and can be disabled + in 5.3.4 and above (optionally). If set, any values present will be substituted on + import. Some reactive charms use this for Prometheus. LMA2 uses dropdown selectors + for datasources, and leaving this present results in "default" datasource values + which are broken. + + Similarly, `dashboard["templating"]["list"][N]["name"] == "host"` can be used to + set a `host` variable for use in dashboards which is not meaningful in the context + of Juju topology and will yield broken dashboards. + + Further properties may be discovered. + """ + dash = template["dashboard"] + try: + if "list" in dash["templating"]: + for i in range(len(dash["templating"]["list"])): + if ( + "datasource" in dash["templating"]["list"][i] + and "Juju" in dash["templating"]["list"][i]["datasource"] + ): + dash["templating"]["list"][i]["datasource"] = r"${prometheusds}" + if ( + "name" in dash["templating"]["list"][i] + and dash["templating"]["list"][i]["name"] == "host" + ): + dash["templating"]["list"][i] = REACTIVE_CONVERTER + except KeyError: + logger.debug("No existing templating data in dashboard") + + if "__inputs" in dash: + inputs = dash + for i in range(len(dash["__inputs"])): + if dash["__inputs"][i]["pluginName"] == "Prometheus": + del inputs["__inputs"][i] + if inputs: + dash["__inputs"] = inputs["__inputs"] + else: + del dash["__inputs"] + + template["dashboard"] = dash + return template + + def _handle_reactive_dashboards(self, event: RelationEvent) -> Optional[Dict]: + """Look for a dashboard in relation data (during a reactive hook) or builtin by name.""" + templates = [] + id = "" + + # Reactive data can reliably be pulled out of events. In theory, if we got an event, + # it's on the bucket, but using event explicitly keeps the mental model in + # place for reactive + for k in event.relation.data[event.unit].keys(): + if k.startswith("request_"): + templates.append(json.loads(event.relation.data[event.unit][k])["dashboard"]) + + for k in event.relation.data[event.app].keys(): + if k.startswith("request_"): + templates.append(json.loads(event.relation.data[event.app][k])["dashboard"]) + + builtins = self._maybe_get_builtin_dashboards(event) + + if not templates and not builtins: + return {} + + dashboards = {} + for t in templates: + # Replace values with LMA-style templating + t = self._strip_existing_datasources(t) + + # This seems ridiculous, too, but to get it from a "dashboards" key in serialized JSON + # in the bucket back out to the actual "dashboard" we _need_, this is the way + # This is not a mistake -- there's a double nesting in reactive charms, and + # Grafana won't load it. We have to unbox: + # event.relation.data[event.]["request_*"]["dashboard"]["dashboard"], + # and the final unboxing is below. + dash = json.dumps(t["dashboard"]) + + # Replace the old-style datasource templates + dash = re.sub(r"<< datasource >>", r"${prometheusds}", dash) + dash = re.sub(r'"datasource": "prom.*?"', r'"datasource": "${prometheusds}"', dash) + + from jinja2 import Template + + content = _encode_dashboard_content( + Template(dash).render(host=event.unit.name, datasource="prometheus") + ) + id = "prog:{}".format(content[-24:-16]) + + dashboards[id] = content + return {**builtins, **dashboards} + + def _maybe_get_builtin_dashboards(self, event: RelationEvent) -> Dict: + """Tries to match the event with an included dashboard. + + Scans dashboards packed with the charm instantiating this class, and tries to match + one with the event. There is no guarantee that any given event will match a builtin, + since each charm instantiating this class may include a different set of dashboards, + or none. + """ + builtins = {} + dashboards_path = None + + try: + dashboards_path = _resolve_dir_against_charm_path( + self._charm, "src/grafana_dashboards" + ) + except InvalidDirectoryPathError as e: + logger.warning( + "Invalid Grafana dashboards folder at %s: %s", + e.grafana_dashboards_absolute_path, + e.message, + ) + + if dashboards_path: + for path in filter(Path.is_file, Path(dashboards_path).glob("*.tmpl")): + if event.app.name in path.name: + id = "file:{}".format(path.stem) + builtins[id] = self._content_to_dashboard_object( + _encode_dashboard_content(path.read_bytes()), event + ) + + return builtins + + def _content_to_dashboard_object(self, content: str, event: RelationEvent) -> Dict: + return { + "charm": event.app.name, + "content": content, + "juju_topology": self._juju_topology(event), + } + + # This is not actually used in the dashboards, but is present to provide a secondary + # salt to ensure uniqueness in the dict keys in case individual charm units provide + # dashboards + def _juju_topology(self, event: RelationEvent) -> Dict: + return { + "model": self._charm.model.name, + "model_uuid": self._charm.model.uuid, + "application": event.app.name, + "unit": event.unit.name, + } diff --git a/lib/charms/prometheus_k8s/v0/prometheus_scrape.py b/lib/charms/prometheus_k8s/v0/prometheus_scrape.py new file mode 100644 index 0000000..994b430 --- /dev/null +++ b/lib/charms/prometheus_k8s/v0/prometheus_scrape.py @@ -0,0 +1,2261 @@ +# Copyright 2021 Canonical Ltd. +# See LICENSE file for licensing details. +"""## Overview. + +This document explains how to integrate with the Prometheus charm +for the purpose of providing a metrics endpoint to Prometheus. It +also explains how alternative implementations of the Prometheus charms +may maintain the same interface and be backward compatible with all +currently integrated charms. Finally this document is the +authoritative reference on the structure of relation data that is +shared between Prometheus charms and any other charm that intends to +provide a scrape target for Prometheus. + +## Provider Library Usage + +This Prometheus charm interacts with its scrape targets using its +charm library. Charms seeking to expose metric endpoints for the +Prometheus charm, must do so using the `MetricsEndpointProvider` +object from this charm library. For the simplest use cases, using the +`MetricsEndpointProvider` object only requires instantiating it, +typically in the constructor of your charm (the one which exposes a +metrics endpoint). The `MetricsEndpointProvider` constructor requires +the name of the relation over which a scrape target (metrics endpoint) +is exposed to the Prometheus charm. This relation must use the +`prometheus_scrape` interface. By default address of the metrics +endpoint is set to the unit IP address, by each unit of the +`MetricsEndpointProvider` charm. These units set their address in +response to the `PebbleReady` event of each container in the unit, +since container restarts of Kubernetes charms can result in change of +IP addresses. The default name for the metrics endpoint relation is +`metrics-endpoint`. It is strongly recommended to use the same +relation name for consistency across charms and doing so obviates the +need for an additional constructor argument. The +`MetricsEndpointProvider` object may be instantiated as follows + + from charms.prometheus_k8s.v0.prometheus_scrape import MetricsEndpointProvider + + def __init__(self, *args): + super().__init__(*args) + ... + self.metrics_endpoint = MetricsEndpointProvider(self) + ... + +Note that the first argument (`self`) to `MetricsEndpointProvider` is +always a reference to the parent (scrape target) charm. + +An instantiated `MetricsEndpointProvider` object will ensure that each +unit of its parent charm, is a scrape target for the +`MetricsEndpointConsumer` (Prometheus) charm. By default +`MetricsEndpointProvider` assumes each unit of the consumer charm +exports its metrics at a path given by `/metrics` on port 80. These +defaults may be changed by providing the `MetricsEndpointProvider` +constructor an optional argument (`jobs`) that represents a +Prometheus scrape job specification using Python standard data +structures. This job specification is a subset of Prometheus' own +[scrape +configuration](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) +format but represented using Python data structures. More than one job +may be provided using the `jobs` argument. Hence `jobs` accepts a list +of dictionaries where each dictionary represents one `` +object as described in the Prometheus documentation. The currently +supported configuration subset is: `job_name`, `metrics_path`, +`static_configs` + +Suppose it is required to change the port on which scraped metrics are +exposed to 8000. This may be done by providing the following data +structure as the value of `jobs`. + +``` +[ + { + "static_configs": [ + { + "targets": ["*:8000"] + } + ] + } +] +``` + +The wildcard ("*") host specification implies that the scrape targets +will automatically be set to the host addresses advertised by each +unit of the consumer charm. + +It is also possible to change the metrics path and scrape multiple +ports, for example + +``` +[ + { + "metrics_path": "/my-metrics-path", + "static_configs": [ + { + "targets": ["*:8000", "*:8081"], + } + ] + } +] +``` + +More complex scrape configurations are possible. For example + +``` +[ + { + "static_configs": [ + { + "targets": ["10.1.32.215:7000", "*:8000"], + "labels": { + "some-key": "some-value" + } + } + ] + } +] +``` + +This example scrapes the target "10.1.32.215" at port 7000 in addition +to scraping each unit at port 8000. There is however one difference +between wildcard targets (specified using "*") and fully qualified +targets (such as "10.1.32.215"). The Prometheus charm automatically +associates labels with metrics generated by each target. These labels +localise the source of metrics within the Juju topology by specifying +its "model name", "model UUID", "application name" and "unit +name". However unit name is associated only with wildcard targets but +not with fully qualified targets. + +Multiple jobs with different metrics paths and labels are allowed, but +each job must be given a unique name. For example + +``` +[ + { + "job_name": "my-first-job", + "metrics_path": "one-path", + "static_configs": [ + { + "targets": ["*:7000"], + "labels": { + "some-key": "some-value" + } + } + ] + }, + { + "job_name": "my-second-job", + "metrics_path": "another-path", + "static_configs": [ + { + "targets": ["*:8000"], + "labels": { + "some-other-key": "some-other-value" + } + } + ] + } +] +``` + +It is also possible to configure other scrape related parameters using +these job specifications as described by the Prometheus +[documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config). +The permissible subset of job specific scrape configuration parameters +supported in a `MetricsEndpointProvider` job specification are: + +- `job_name` +- `metrics_path` +- `static_configs` +- `scrape_interval` +- `scrape_timeout` +- `proxy_url` +- `relabel_configs` +- `metrics_relabel_configs` +- `sample_limit` +- `label_limit` +- `label_name_length_limit` +- `label_value_length_limit` + +## Consumer Library Usage + +The `MetricsEndpointConsumer` object may be used by Prometheus +charms to manage relations with their scrape targets. For this +purposes a Prometheus charm needs to do two things + +1. Instantiate the `MetricsEndpointConsumer` object by providing it a +reference to the parent (Prometheus) charm and optionally the name of +the relation that the Prometheus charm uses to interact with scrape +targets. This relation must confirm to the `prometheus_scrape` +interface and it is strongly recommended that this relation be named +`metrics-endpoint` which is its default value. + +For example a Prometheus charm may instantiate the +`MetricsEndpointConsumer` in its constructor as follows + + from charms.prometheus_k8s.v0.prometheus_scrape import MetricsEndpointConsumer + + def __init__(self, *args): + super().__init__(*args) + ... + self.metrics_consumer = MetricsEndpointConsumer(self) + ... + +2. A Prometheus charm also needs to respond to the +`TargetsChangedEvent` event of the `MetricsEndpointConsumer` by adding itself as +an observer for these events, as in + + self.framework.observe( + self.metrics_consumer.on.targets_changed, + self._on_scrape_targets_changed, + ) + +In responding to the `TargetsChangedEvent` event the Prometheus +charm must update the Prometheus configuration so that any new scrape +targets are added and/or old ones removed from the list of scraped +endpoints. For this purpose the `MetricsEndpointConsumer` object +exposes a `jobs()` method that returns a list of scrape jobs. Each +element of this list is the Prometheus scrape configuration for that +job. In order to update the Prometheus configuration, the Prometheus +charm needs to replace the current list of jobs with the list provided +by `jobs()` as follows + + def _on_scrape_targets_changed(self, event): + ... + scrape_jobs = self.metrics_consumer.jobs() + for job in scrape_jobs: + prometheus_scrape_config.append(job) + ... + +## Alerting Rules + +This charm library also supports gathering alerting rules from all +related `MetricsEndpointProvider` charms and enabling corresponding alerts within the +Prometheus charm. Alert rules are automatically gathered by `MetricsEndpointProvider` +charms when using this library, from a directory conventionally named +`prometheus_alert_rules`. This directory must reside at the top level +in the `src` folder of the consumer charm. Each file in this directory +is assumed to be in one of two formats: +- the official prometheus alert rule format, conforming to the +[Prometheus docs](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) +- a single rule format, which is a simplified subset of the official format, +comprising a single alert rule per file, using the same YAML fields. + +The file name must have the `.rule` extension. + +An example of the contents of such a file in the custom single rule +format is shown below. + +``` +alert: HighRequestLatency +expr: job:request_latency_seconds:mean5m{my_key=my_value} > 0.5 +for: 10m +labels: + severity: Medium + type: HighLatency +annotations: + summary: High request latency for {{ $labels.instance }}. +``` + +The `MetricsEndpointProvider` will read all available alert rules and +also inject "filtering labels" into the alert expressions. The +filtering labels ensure that alert rules are localised to the metrics +provider charm's Juju topology (application, model and its UUID). Such +a topology filter is essential to ensure that alert rules submitted by +one provider charm generates alerts only for that same charm. When +alert rules are embedded in a charm, and the charm is deployed as a +Juju application, the alert rules from that application have their +expressions automatically updated to filter for metrics coming from +the units of that application alone. This remove risk of spurious +evaluation, e.g., when you have multiple deployments of the same charm +monitored by the same Prometheus. + +Not all alerts one may want to specify can be embedded in a +charm. Some alert rules will be specific to a user's use case. This is +the case, for example, of alert rules that are based on business +constraints, like expecting a certain amount of requests to a specific +API every five minutes. Such alert rules can be specified via the +[COS Config Charm](https://charmhub.io/cos-configuration-k8s), +which allows importing alert rules and other settings like dashboards +from a Git repository. + +Gathering alert rules and generating rule files within the Prometheus +charm is easily done using the `alerts()` method of +`MetricsEndpointConsumer`. Alerts generated by Prometheus will +automatically include Juju topology labels in the alerts. These labels +indicate the source of the alert. The following labels are +automatically included with each alert + +- `juju_model` +- `juju_model_uuid` +- `juju_application` + +## Relation Data + +The Prometheus charm uses both application and unit relation data to +obtain information regarding its scrape jobs, alert rules and scrape +targets. This relation data is in JSON format and it closely resembles +the YAML structure of Prometheus [scrape configuration] +(https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config). + +Units of Metrics provider charms advertise their names and addresses +over unit relation data using the `prometheus_scrape_unit_name` and +`prometheus_scrape_unit_address` keys. While the `scrape_metadata`, +`scrape_jobs` and `alert_rules` keys in application relation data +of Metrics provider charms hold eponymous information. + +""" + +import json +import logging +import os +import platform +import subprocess +from collections import OrderedDict +from pathlib import Path +from typing import Dict, List, Optional, Union + +import yaml +from ops.charm import CharmBase, RelationRole +from ops.framework import EventBase, EventSource, Object, ObjectEvents + +# The unique Charmhub library identifier, never change it +from ops.model import ModelError + +LIBID = "bc84295fef5f4049878f07b131968ee2" + +# Increment this major API version when introducing breaking changes +LIBAPI = 0 + +# Increment this PATCH version before using `charmcraft publish-lib` or reset +# to 0 if you are raising the major API version +LIBPATCH = 17 + +logger = logging.getLogger(__name__) + + +ALLOWED_KEYS = { + "job_name", + "metrics_path", + "static_configs", + "scrape_interval", + "scrape_timeout", + "proxy_url", + "relabel_configs", + "metrics_relabel_configs", + "sample_limit", + "label_limit", + "label_name_length_limit", + "label_value_lenght_limit", +} +DEFAULT_JOB = { + "metrics_path": "/metrics", + "static_configs": [{"targets": ["*:80"]}], +} + + +DEFAULT_RELATION_NAME = "metrics-endpoint" +RELATION_INTERFACE_NAME = "prometheus_scrape" + +DEFAULT_ALERT_RULES_RELATIVE_PATH = "./src/prometheus_alert_rules" + + +class RelationNotFoundError(Exception): + """Raised if there is no relation with the given name is found.""" + + def __init__(self, relation_name: str): + self.relation_name = relation_name + self.message = "No relation named '{}' found".format(relation_name) + + super().__init__(self.message) + + +class RelationInterfaceMismatchError(Exception): + """Raised if the relation with the given name has a different interface.""" + + def __init__( + self, + relation_name: str, + expected_relation_interface: str, + actual_relation_interface: str, + ): + self.relation_name = relation_name + self.expected_relation_interface = expected_relation_interface + self.actual_relation_interface = actual_relation_interface + self.message = ( + "The '{}' relation has '{}' as interface rather than the expected '{}'".format( + relation_name, actual_relation_interface, expected_relation_interface + ) + ) + + super().__init__(self.message) + + +class RelationRoleMismatchError(Exception): + """Raised if the relation with the given name has a different role.""" + + def __init__( + self, + relation_name: str, + expected_relation_role: RelationRole, + actual_relation_role: RelationRole, + ): + self.relation_name = relation_name + self.expected_relation_interface = expected_relation_role + self.actual_relation_role = actual_relation_role + self.message = "The '{}' relation has role '{}' rather than the expected '{}'".format( + relation_name, repr(actual_relation_role), repr(expected_relation_role) + ) + + super().__init__(self.message) + + +def _validate_relation_by_interface_and_direction( + charm: CharmBase, + relation_name: str, + expected_relation_interface: str, + expected_relation_role: RelationRole, +): + """Verifies that a relation has the necessary characteristics. + + Verifies that the `relation_name` provided: (1) exists in metadata.yaml, + (2) declares as interface the interface name passed as `relation_interface` + and (3) has the right "direction", i.e., it is a relation that `charm` + provides or requires. + + Args: + charm: a `CharmBase` object to scan for the matching relation. + relation_name: the name of the relation to be verified. + expected_relation_interface: the interface name to be matched by the + relation named `relation_name`. + expected_relation_role: whether the `relation_name` must be either + provided or required by `charm`. + + Raises: + RelationNotFoundError: If there is no relation in the charm's metadata.yaml + with the same name as provided via `relation_name` argument. + RelationInterfaceMismatchError: The relation with the same name as provided + via `relation_name` argument does not have the same relation interface + as specified via the `expected_relation_interface` argument. + RelationRoleMismatchError: If the relation with the same name as provided + via `relation_name` argument does not have the same role as specified + via the `expected_relation_role` argument. + """ + if relation_name not in charm.meta.relations: + raise RelationNotFoundError(relation_name) + + relation = charm.meta.relations[relation_name] + + actual_relation_interface = relation.interface_name + if actual_relation_interface != expected_relation_interface: + raise RelationInterfaceMismatchError( + relation_name, expected_relation_interface, actual_relation_interface + ) + + if expected_relation_role == RelationRole.provides: + if relation_name not in charm.meta.provides: + raise RelationRoleMismatchError( + relation_name, RelationRole.provides, RelationRole.requires + ) + elif expected_relation_role == RelationRole.requires: + if relation_name not in charm.meta.requires: + raise RelationRoleMismatchError( + relation_name, RelationRole.requires, RelationRole.provides + ) + else: + raise Exception("Unexpected RelationDirection: {}".format(expected_relation_role)) + + +def _sanitize_scrape_configuration(job) -> dict: + """Restrict permissible scrape configuration options. + + If job is empty then a default job is returned. The + default job is + + ``` + { + "metrics_path": "/metrics", + "static_configs": [{"targets": ["*:80"]}], + } + ``` + + Args: + job: a dict containing a single Prometheus job + specification. + + Returns: + a dictionary containing a sanitized job specification. + """ + sanitized_job = DEFAULT_JOB.copy() + sanitized_job.update({key: value for key, value in job.items() if key in ALLOWED_KEYS}) + return sanitized_job + + +class JujuTopology: + """Class for storing and formatting juju topology information.""" + + STUB = "%%juju_topology%%" + + def __new__(cls, *args, **kwargs): + """Reject instantiation of a base JujuTopology class. Children only.""" + if cls is JujuTopology: + raise TypeError("only children of '{}' may be instantiated".format(cls.__name__)) + return object.__new__(cls) + + def __init__( + self, + model: str, + model_uuid: str, + application: str, + unit: Optional[str] = "", + charm_name: Optional[str] = "", + ): + """Build a JujuTopology object. + + A `JujuTopology` object is used for storing and transforming + Juju Topology information. This information is used to + annotate Prometheus scrape jobs and alert rules. Such + annotation when applied to scrape jobs helps in identifying + the source of the scrapped metrics. On the other hand when + applied to alert rules topology information ensures that + evaluation of alert expressions is restricted to the source + (charm) from which the alert rules were obtained. + + Args: + model: a string name of the Juju model + model_uuid: a globally unique string identifier for the Juju model + application: an application name as a string + unit: a unit name as a string + charm_name: name of charm as a string + + Note: + `JujuTopology` should not be constructed directly by charm code. Please + use `ProviderTopology` or `AggregatorTopology`. + """ + self.model = model + self.model_uuid = model_uuid + self.application = application + self.charm_name = charm_name + self.unit = unit + + @classmethod + def from_charm(cls, charm): + """Factory method for creating `JujuTopology` children from a given charm. + + Args: + charm: a `CharmBase` object for which the `JujuTopology` has to be constructed + + Returns: + a `JujuTopology` object. + """ + return cls( + model=charm.model.name, + model_uuid=charm.model.uuid, + application=charm.model.app.name, + unit=charm.model.unit.name, + charm_name=charm.meta.name, + ) + + @classmethod + def from_relation_data(cls, data: dict): + """Factory method for creating `JujuTopology` children from a dictionary. + + Args: + data: a dictionary with four keys providing topology information. The keys are + - "model" + - "model_uuid" + - "application" + - "unit" + - "charm_name" + + `unit` and `charm_name` may be empty, but will result in more limited + labels. However, this allows us to support payload-only charms. + + Returns: + a `JujuTopology` object. + """ + return cls( + model=data["model"], + model_uuid=data["model_uuid"], + application=data["application"], + unit=data.get("unit", ""), + charm_name=data.get("charm_name", ""), + ) + + @property + def identifier(self) -> str: + """Format the topology information into a terse string.""" + # This is odd, but may have `None` as a model key + return "_".join([str(val) for val in self.as_promql_label_dict().values()]).replace( + "/", "_" + ) + + @property + def promql_labels(self) -> str: + """Format the topology information into a verbose string.""" + return ", ".join( + ['{}="{}"'.format(key, value) for key, value in self.as_promql_label_dict().items()] + ) + + def as_dict(self, rename_keys: Optional[Dict[str, str]] = None) -> OrderedDict: + """Format the topology information into a dict. + + Use an OrderedDict so we can rely on the insertion order on Python 3.5 (and 3.6, + which still does not guarantee it). + + Args: + rename_keys: A dictionary mapping old key names to new key names, which will + be substituted when invoked. + """ + ret = OrderedDict( + [ + ("model", self.model), + ("model_uuid", self.model_uuid), + ("application", self.application), + ("unit", self.unit), + ("charm_name", self.charm_name), + ] + ) + + ret["unit"] or ret.pop("unit") + ret["charm_name"] or ret.pop("charm_name") + + # If a key exists in `rename_keys`, replace the value + if rename_keys: + ret = OrderedDict( + (rename_keys.get(k), v) if rename_keys.get(k) else (k, v) for k, v in ret.items() # type: ignore + ) + + return ret + + def as_promql_label_dict(self): + """Format the topology information into a dict with keys having 'juju_' as prefix.""" + vals = { + "juju_{}".format(key): val + for key, val in self.as_dict(rename_keys={"charm_name": "charm"}).items() + } + # The leader is the only unit that sets alert rules, if "juju_unit" is present, + # then the rules will only be evaluated for that unit + if "juju_unit" in vals: + vals.pop("juju_unit") + + return vals + + def render(self, template: str): + """Render a juju-topology template string with topology info.""" + return template.replace(JujuTopology.STUB, self.promql_labels) + + +class AggregatorTopology(JujuTopology): + """Class for initializing topology information for MetricsEndpointAggregator.""" + + @classmethod + def create(cls, model: str, model_uuid: str, application: str, unit: str): + """Factory method for creating the `AggregatorTopology` dataclass from a given charm. + + Args: + model: a string representing the model + model_uuid: the model UUID as a string + application: the application name + unit: the unit name + + Returns: + a `AggregatorTopology` object. + """ + return cls( + model=model, + model_uuid=model_uuid, + application=application, + unit=unit, + ) + + def as_promql_label_dict(self): + """Format the topology information into a dict with keys having 'juju_' as prefix.""" + vals = {"juju_{}".format(key): val for key, val in self.as_dict().items()} + + # FIXME: Why is this different? I have no idea. The uuid length should be the same + vals["juju_model_uuid"] = vals["juju_model_uuid"][:7] + + return vals + + +class ProviderTopology(JujuTopology): + """Class for initializing topology information for MetricsEndpointProvider.""" + + @property + def scrape_identifier(self): + """Format the topology information into a scrape identifier.""" + # This is used only by Metrics[Consumer|Provider] and does not need a + # unit name, so only check for the charm name + return "juju_{}_prometheus_scrape".format(self.identifier) + + +class InvalidAlertRulePathError(Exception): + """Raised if the alert rules folder cannot be found or is otherwise invalid.""" + + def __init__( + self, + alert_rules_absolute_path: Path, + message: str, + ): + self.alert_rules_absolute_path = alert_rules_absolute_path + self.message = message + + super().__init__(self.message) + + +def _is_official_alert_rule_format(rules_dict: dict) -> bool: + """Are alert rules in the upstream format as supported by Prometheus. + + Alert rules in dictionary format are in "official" form if they + contain a "groups" key, since this implies they contain a list of + alert rule groups. + + Args: + rules_dict: a set of alert rules in Python dictionary format + + Returns: + True if alert rules are in official Prometheus file format. + """ + return "groups" in rules_dict + + +def _is_single_alert_rule_format(rules_dict: dict) -> bool: + """Are alert rules in single rule format. + + The Prometheus charm library supports reading of alert rules in a + custom format that consists of a single alert rule per file. This + does not conform to the official Prometheus alert rule file format + which requires that each alert rules file consists of a list of + alert rule groups and each group consists of a list of alert + rules. + + Alert rules in dictionary form are considered to be in single rule + format if in the least it contains two keys corresponding to the + alert rule name and alert expression. + + Returns: + True if alert rule is in single rule file format. + """ + # one alert rule per file + return set(rules_dict) >= {"alert", "expr"} + + +class AlertRules: + """Utility class for amalgamating prometheus alert rule files and injecting juju topology. + + An `AlertRules` object supports aggregating alert rules from files and directories in both + official and single rule file formats using the `add_path()` method. All the alert rules + read are annotated with Juju topology labels and amalgamated into a single data structure + in the form of a Python dictionary using the `as_dict()` method. Such a dictionary can be + easily dumped into JSON format and exchanged over relation data. The dictionary can also + be dumped into YAML format and written directly into an alert rules file that is read by + Prometheus. Note that multiple `AlertRules` objects must not be written into the same file, + since Prometheus allows only a single list of alert rule groups per alert rules file. + + The official Prometheus format is a YAML file conforming to the Prometheus documentation + (https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/). + The custom single rule format is a subsection of the official YAML, having a single alert + rule, effectively "one alert per file". + """ + + # This class uses the following terminology for the various parts of a rule file: + # - alert rules file: the entire groups[] yaml, including the "groups:" key. + # - alert groups (plural): the list of groups[] (a list, i.e. no "groups:" key) - it is a list + # of dictionaries that have the "name" and "rules" keys. + # - alert group (singular): a single dictionary that has the "name" and "rules" keys. + # - alert rules (plural): all the alerts in a given alert group - a list of dictionaries with + # the "alert" and "expr" keys. + # - alert rule (singular): a single dictionary that has the "alert" and "expr" keys. + + def __init__(self, topology: Optional[JujuTopology] = None): + """Build and alert rule object. + + Args: + topology: an optional `JujuTopology` instance that is used to annotate all alert rules. + """ + self.topology = topology + self.alert_groups = [] # type: List[dict] + + def _from_file(self, root_path: Path, file_path: Path) -> List[dict]: + """Read a rules file from path, injecting juju topology. + + Args: + root_path: full path to the root rules folder (used only for generating group name) + file_path: full path to a *.rule file. + + Returns: + A list of dictionaries representing the rules file, if file is valid (the structure is + formed by `yaml.safe_load` of the file); an empty list otherwise. + """ + with file_path.open() as rf: + # Load a list of rules from file then add labels and filters + try: + rule_file = yaml.safe_load(rf) + + except Exception as e: + logger.error("Failed to read alert rules from %s: %s", file_path.name, e) + return [] + + if _is_official_alert_rule_format(rule_file): + alert_groups = rule_file["groups"] + elif _is_single_alert_rule_format(rule_file): + # convert to list of alert groups + # group name is made up from the file name + alert_groups = [{"name": file_path.stem, "rules": [rule_file]}] + else: + # invalid/unsupported + logger.error("Invalid rules file: %s", file_path.name) + return [] + + # update rules with additional metadata + for alert_group in alert_groups: + # update group name with topology and sub-path + alert_group["name"] = self._group_name( + str(root_path), + str(file_path), + alert_group["name"], + ) + + # add "juju_" topology labels + for alert_rule in alert_group["rules"]: + if "labels" not in alert_rule: + alert_rule["labels"] = {} + + if self.topology: + alert_rule["labels"].update(self.topology.as_promql_label_dict()) + # insert juju topology filters into a prometheus alert rule + alert_rule["expr"] = self.topology.render(alert_rule["expr"]) + + return alert_groups + + def _group_name(self, root_path: str, file_path: str, group_name: str) -> str: + """Generate group name from path and topology. + + The group name is made up of the relative path between the root dir_path, the file path, + and topology identifier. + + Args: + root_path: path to the root rules dir. + file_path: path to rule file. + group_name: original group name to keep as part of the new augmented group name + + Returns: + New group name, augmented by juju topology and relative path. + """ + rel_path = os.path.relpath(os.path.dirname(file_path), root_path) + rel_path = "" if rel_path == "." else rel_path.replace(os.path.sep, "_") + + # Generate group name: + # - name, from juju topology + # - suffix, from the relative path of the rule file; + group_name_parts = [self.topology.identifier] if self.topology else [] + group_name_parts.extend([rel_path, group_name, "alerts"]) + # filter to remove empty strings + return "_".join(filter(None, group_name_parts)) + + @classmethod + def _multi_suffix_glob( + cls, dir_path: Path, suffixes: List[str], recursive: bool = True + ) -> list: + """Helper function for getting all files in a directory that have a matching suffix. + + Args: + dir_path: path to the directory to glob from. + suffixes: list of suffixes to include in the glob (items should begin with a period). + recursive: a flag indicating whether a glob is recursive (nested) or not. + + Returns: + List of files in `dir_path` that have one of the suffixes specified in `suffixes`. + """ + all_files_in_dir = dir_path.glob("**/*" if recursive else "*") + return list(filter(lambda f: f.is_file() and f.suffix in suffixes, all_files_in_dir)) + + def _from_dir(self, dir_path: Path, recursive: bool) -> List[dict]: + """Read all rule files in a directory. + + All rules from files for the same directory are loaded into a single + group. The generated name of this group includes juju topology. + By default, only the top directory is scanned; for nested scanning, pass `recursive=True`. + + Args: + dir_path: directory containing *.rule files (alert rules without groups). + recursive: flag indicating whether to scan for rule files recursively. + + Returns: + a list of dictionaries representing prometheus alert rule groups, each dictionary + representing an alert group (structure determined by `yaml.safe_load`). + """ + alert_groups = [] # type: List[dict] + + # Gather all alerts into a list of groups + for file_path in self._multi_suffix_glob(dir_path, [".rule", ".rules"], recursive): + alert_groups_from_file = self._from_file(dir_path, file_path) + if alert_groups_from_file: + logger.debug("Reading alert rule from %s", file_path) + alert_groups.extend(alert_groups_from_file) + + return alert_groups + + def add_path(self, path: str, *, recursive: bool = False) -> None: + """Add rules from a dir path. + + All rules from files are aggregated into a data structure representing a single rule file. + All group names are augmented with juju topology. + + Args: + path: either a rules file or a dir of rules files. + recursive: whether to read files recursively or not (no impact if `path` is a file). + + Returns: + True if path was added else False. + """ + path = Path(path) # type: Path + if path.is_dir(): + self.alert_groups.extend(self._from_dir(path, recursive)) + elif path.is_file(): + self.alert_groups.extend(self._from_file(path.parent, path)) + else: + logger.warning("path does not exist: %s", path) + + def as_dict(self) -> dict: + """Return standard alert rules file in dict representation. + + Returns: + a dictionary containing a single list of alert rule groups. + The list of alert rule groups is provided as value of the + "groups" dictionary key. + """ + return {"groups": self.alert_groups} if self.alert_groups else {} + + +class TargetsChangedEvent(EventBase): + """Event emitted when Prometheus scrape targets change.""" + + def __init__(self, handle, relation_id): + super().__init__(handle) + self.relation_id = relation_id + + def snapshot(self): + """Save scrape target relation information.""" + return {"relation_id": self.relation_id} + + def restore(self, snapshot): + """Restore scrape target relation information.""" + self.relation_id = snapshot["relation_id"] + + +class MonitoringEvents(ObjectEvents): + """Event descriptor for events raised by `MetricsEndpointConsumer`.""" + + targets_changed = EventSource(TargetsChangedEvent) + + +class MetricsEndpointConsumer(Object): + """A Prometheus based Monitoring service.""" + + on = MonitoringEvents() + + def __init__(self, charm: CharmBase, relation_name: str = DEFAULT_RELATION_NAME): + """A Prometheus based Monitoring service. + + Args: + charm: a `CharmBase` instance that manages this + instance of the Prometheus service. + relation_name: an optional string name of the relation between `charm` + and the Prometheus charmed service. The default is "metrics-endpoint". + It is strongly advised not to change the default, so that people + deploying your charm will have a consistent experience with all + other charms that consume metrics endpoints. + + Raises: + RelationNotFoundError: If there is no relation in the charm's metadata.yaml + with the same name as provided via `relation_name` argument. + RelationInterfaceMismatchError: The relation with the same name as provided + via `relation_name` argument does not have the `prometheus_scrape` relation + interface. + RelationRoleMismatchError: If the relation with the same name as provided + via `relation_name` argument does not have the `RelationRole.requires` + role. + """ + _validate_relation_by_interface_and_direction( + charm, relation_name, RELATION_INTERFACE_NAME, RelationRole.requires + ) + + super().__init__(charm, relation_name) + self._charm = charm + self._relation_name = relation_name + self._transformer = PromqlTransformer(self._charm) + events = self._charm.on[relation_name] + self.framework.observe(events.relation_changed, self._on_metrics_provider_relation_changed) + self.framework.observe( + events.relation_departed, self._on_metrics_provider_relation_departed + ) + + def _on_metrics_provider_relation_changed(self, event): + """Handle changes with related metrics providers. + + Anytime there are changes in relations between Prometheus + and metrics provider charms the Prometheus charm is informed, + through a `TargetsChangedEvent` event. The Prometheus charm can + then choose to update its scrape configuration. + + Args: + event: a `CharmEvent` in response to which the Prometheus + charm must update its scrape configuration. + """ + rel_id = event.relation.id + + self.on.targets_changed.emit(relation_id=rel_id) + + def _on_metrics_provider_relation_departed(self, event): + """Update job config when a metrics provider departs. + + When a metrics provider departs the Prometheus charm is informed + through a `TargetsChangedEvent` event so that it can update its + scrape configuration to ensure that the departed metrics provider + is removed from the list of scrape jobs and + + Args: + event: a `CharmEvent` that indicates a metrics provider + unit has departed. + """ + rel_id = event.relation.id + self.on.targets_changed.emit(relation_id=rel_id) + + def jobs(self) -> list: + """Fetch the list of scrape jobs. + + Returns: + A list consisting of all the static scrape configurations + for each related `MetricsEndpointProvider` that has specified + its scrape targets. + """ + scrape_jobs = [] + + for relation in self._charm.model.relations[self._relation_name]: + static_scrape_jobs = self._static_scrape_config(relation) + if static_scrape_jobs: + scrape_jobs.extend(static_scrape_jobs) + + return scrape_jobs + + def alerts(self) -> dict: + """Fetch alerts for all relations. + + A Prometheus alert rules file consists of a list of "groups". Each + group consists of a list of alerts (`rules`) that are sequentially + executed. This method returns all the alert rules provided by each + related metrics provider charm. These rules may be used to generate a + separate alert rules file for each relation since the returned list + of alert groups are indexed by that relations Juju topology identifier. + The Juju topology identifier string includes substrings that identify + alert rule related metadata such as the Juju model, model UUID and the + application name from where the alert rule originates. Since this + topology identifier is globally unique, it may be used for instance as + the name for the file into which the list of alert rule groups are + written. For each relation, the structure of data returned is a dictionary + representation of a standard prometheus rules file: + + {"groups": [{"name": ...}, ...]} + + per official prometheus documentation + https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ + + The value of the `groups` key is such that it may be used to generate + a Prometheus alert rules file directly using `yaml.dump` but the + `groups` key itself must be included as this is required by Prometheus. + + For example the list of alert rule groups returned by this method may + be written into files consumed by Prometheus as follows + + ``` + for topology_identifier, alert_rule_groups in self.metrics_consumer.alerts().items(): + filename = "juju_" + topology_identifier + ".rules" + path = os.path.join(PROMETHEUS_RULES_DIR, filename) + rules = yaml.dump(alert_rule_groups) + container.push(path, rules, make_dirs=True) + ``` + + Returns: + A dictionary mapping the Juju topology identifier of the source charm to + its list of alert rule groups. + """ + alerts = {} # type: Dict[str, dict] # mapping b/w juju identifiers and alert rule files + for relation in self._charm.model.relations[self._relation_name]: + if not relation.units: + continue + + alert_rules = json.loads(relation.data[relation.app].get("alert_rules", "{}")) + if not alert_rules: + continue + + identifier = None + try: + scrape_metadata = json.loads(relation.data[relation.app]["scrape_metadata"]) + identifier = ProviderTopology.from_relation_data(scrape_metadata).identifier + alerts[identifier] = self._transformer.apply_label_matchers(alert_rules) + + except KeyError as e: + logger.debug( + "Relation %s has no 'scrape_metadata': %s", + relation.id, + e, + ) + identifier = self._get_identifier_by_alert_rules(alert_rules) + + if not identifier: + logger.error( + "Alert rules were found but no usable group or identifier was present" + ) + continue + alerts[identifier] = alert_rules + + return alerts + + def _get_identifier_by_alert_rules(self, rules: dict) -> Union[str, None]: + """Determine an appropriate dict key for alert rules. + + The key is used as the filename when writing alerts to disk, so the structure + and uniqueness is important. + + Args: + rules: a dict of alert rules + """ + if "groups" not in rules: + logger.warning("No alert groups were found in relation data") + return None + + # Construct an ID based on what's in the alert rules if they have labels + for group in rules["groups"]: + try: + labels = group["rules"][0]["labels"] + identifier = "{}_{}_{}".format( + labels["juju_model"], + labels["juju_model_uuid"], + labels["juju_application"], + ) + return identifier + except KeyError: + logger.debug("Alert rules were found but no usable labels were present") + continue + + logger.warning( + "No labeled alert rules were found, and no 'scrape_metadata' " + "was available. Using the alert group name as filename." + ) + try: + for group in rules["groups"]: + return group["name"] + except KeyError: + logger.debug("No group name was found to use as identifier") + + return None + + def _static_scrape_config(self, relation) -> list: + """Generate the static scrape configuration for a single relation. + + If the relation data includes `scrape_metadata` then the value + of this key is used to annotate the scrape jobs with Juju + Topology labels before returning them. + + Args: + relation: an `ops.model.Relation` object whose static + scrape configuration is required. + + Returns: + A list (possibly empty) of scrape jobs. Each job is a + valid Prometheus scrape configuration for that job, + represented as a Python dictionary. + """ + if not relation.units: + return [] + + scrape_jobs = json.loads(relation.data[relation.app].get("scrape_jobs", "[]")) + + if not scrape_jobs: + return [] + + scrape_metadata = json.loads(relation.data[relation.app].get("scrape_metadata", "{}")) + + if not scrape_metadata: + return scrape_jobs + + job_name_prefix = ProviderTopology.from_relation_data(scrape_metadata).scrape_identifier + + hosts = self._relation_hosts(relation) + + labeled_job_configs = [] + for job in scrape_jobs: + config = self._labeled_static_job_config( + _sanitize_scrape_configuration(job), + job_name_prefix, + hosts, + scrape_metadata, + ) + labeled_job_configs.append(config) + + return labeled_job_configs + + def _relation_hosts(self, relation) -> dict: + """Fetch unit names and address of all metrics provider units for a single relation. + + Args: + relation: An `ops.model.Relation` object for which the unit name to + address mapping is required. + + Returns: + A dictionary that maps unit names to unit addresses for + the specified relation. + """ + hosts = {} + for unit in relation.units: + # TODO deprecate and remove unit.name + unit_name = relation.data[unit].get("prometheus_scrape_unit_name") or unit.name + # TODO deprecate and remove "prometheus_scrape_host" + unit_address = relation.data[unit].get( + "prometheus_scrape_unit_address" + ) or relation.data[unit].get("prometheus_scrape_host") + if unit_name and unit_address: + hosts.update({unit_name: unit_address}) + + return hosts + + def _labeled_static_job_config(self, job, job_name_prefix, hosts, scrape_metadata) -> dict: + """Construct labeled job configuration for a single job. + + Args: + + job: a dictionary representing the job configuration as obtained from + `MetricsEndpointProvider` over relation data. + job_name_prefix: a string that may either be used as the + job name if the job has no associated name or used as a prefix for + the job if it does have a job name. + hosts: a dictionary mapping host names to host address for + all units of the relation for which this job configuration + must be constructed. + scrape_metadata: scrape configuration metadata obtained + from `MetricsEndpointProvider` from the same relation for + which this job configuration is being constructed. + + Returns: + A dictionary representing a Prometheus job configuration + for a single job. + """ + name = job.get("job_name") + job_name = "{}_{}".format(job_name_prefix, name) if name else job_name_prefix + + labeled_job = job.copy() + labeled_job["job_name"] = job_name + + static_configs = job.get("static_configs") + labeled_job["static_configs"] = [] + + # relabel instance labels so that instance identifiers are globally unique + # stable over unit recreation + instance_relabel_config = { + "source_labels": ["juju_model", "juju_model_uuid", "juju_application"], + "separator": "_", + "target_label": "instance", + "regex": "(.*)", + } + + # label all static configs in the Prometheus job + # labeling inserts Juju topology information and + # sets a relable config for instance labels + for static_config in static_configs: + labels = static_config.get("labels", {}) if static_configs else {} + all_targets = static_config.get("targets", []) + + # split all targets into those which will have unit labels + # and those which will not + ports = [] + unitless_targets = [] + for target in all_targets: + host, port = target.split(":") + if host.strip() == "*": + ports.append(port.strip()) + else: + unitless_targets.append(target) + + # label scrape targets that do not have unit labels + if unitless_targets: + unitless_config = self._labeled_unitless_config( + unitless_targets, labels, scrape_metadata + ) + labeled_job["static_configs"].append(unitless_config) + + # label scrape targets that do have unit labels + for host_name, host_address in hosts.items(): + static_config = self._labeled_unit_config( + host_name, host_address, ports, labels, scrape_metadata + ) + labeled_job["static_configs"].append(static_config) + if "juju_unit" not in instance_relabel_config["source_labels"]: + instance_relabel_config["source_labels"].append("juju_unit") # type: ignore + + # ensure topology relabeling of instance label is last in order of relabelings + relabel_configs = job.get("relabel_configs", []) + relabel_configs.append(instance_relabel_config) + labeled_job["relabel_configs"] = relabel_configs + + return labeled_job + + def _set_juju_labels(self, labels, scrape_metadata) -> dict: + """Create a copy of metric labels with Juju topology information. + + Args: + labels: a dictionary containing Prometheus metric labels. + scrape_metadata: scrape related metadata provided by + `MetricsEndpointProvider`. + + Returns: + a copy of the `labels` dictionary augmented with Juju + topology information with the exception of unit name. + """ + juju_labels = labels.copy() # deep copy not needed + juju_labels.update( + ProviderTopology.from_relation_data(scrape_metadata).as_promql_label_dict() + ) + + return juju_labels + + def _labeled_unitless_config(self, targets, labels, scrape_metadata) -> dict: + """Static scrape configuration for fully qualified host addresses. + + Fully qualified hosts are those scrape targets for which the + address are specified by the `MetricsEndpointProvider` as part + of the scrape job specification set in application relation data. + The address specified need not belong to any unit of the + `MetricsEndpointProvider` charm. As a result there is no reliable + way to determine the name (Juju topology unit name) for such a + target. + + Args: + targets: a list of addresses of fully qualified hosts. + labels: labels specified by `MetricsEndpointProvider` clients + which are associated with `targets`. + scrape_metadata: scrape related metadata provided by `MetricsEndpointProvider`. + + Returns: + A dictionary containing the static scrape configuration + for a list of fully qualified hosts. + """ + juju_labels = self._set_juju_labels(labels, scrape_metadata) + unitless_config = {"targets": targets, "labels": juju_labels} + return unitless_config + + def _labeled_unit_config( + self, unit_name, host_address, ports, labels, scrape_metadata + ) -> dict: + """Static scrape configuration for a wildcard host. + + Wildcard hosts are those scrape targets whose name (Juju unit + name) and address (unit IP address) is set into unit relation + data by the `MetricsEndpointProvider` charm, which sets this + data for ALL its units. + + Args: + unit_name: a string representing the unit name of the wildcard host. + host_address: a string representing the address of the wildcard host. + ports: list of ports on which this wildcard host exposes its metrics. + labels: a dictionary of labels provided by + `MetricsEndpointProvider` intended to be associated with + this wildcard host. + scrape_metadata: scrape related metadata provided by `MetricsEndpointProvider`. + + Returns: + A dictionary containing the static scrape configuration + for a single wildcard host. + """ + juju_labels = self._set_juju_labels(labels, scrape_metadata) + + juju_labels["juju_unit"] = unit_name + + static_config = {"labels": juju_labels} + + if ports: + targets = [] + for port in ports: + targets.append("{}:{}".format(host_address, port)) + static_config["targets"] = targets # type: ignore + else: + static_config["targets"] = [host_address] # type: ignore + + return static_config + + +def _resolve_dir_against_charm_path(charm: CharmBase, *path_elements: str) -> str: + """Resolve the provided path items against the directory of the main file. + + Look up the directory of the `main.py` file being executed. This is normally + going to be the charm.py file of the charm including this library. Then, resolve + the provided path elements and, if the result path exists and is a directory, + return its absolute path; otherwise, raise en exception. + + Raises: + InvalidAlertRulePathError, if the path does not exist or is not a directory. + """ + charm_dir = Path(str(charm.charm_dir)) + if not charm_dir.exists() or not charm_dir.is_dir(): + # Operator Framework does not currently expose a robust + # way to determine the top level charm source directory + # that is consistent across deployed charms and unit tests + # Hence for unit tests the current working directory is used + # TODO: updated this logic when the following ticket is resolved + # https://github.com/canonical/operator/issues/643 + charm_dir = Path(os.getcwd()) + + alerts_dir_path = charm_dir.absolute().joinpath(*path_elements) + + if not alerts_dir_path.exists(): + raise InvalidAlertRulePathError(alerts_dir_path, "directory does not exist") + if not alerts_dir_path.is_dir(): + raise InvalidAlertRulePathError(alerts_dir_path, "is not a directory") + + return str(alerts_dir_path) + + +class MetricsEndpointProvider(Object): + """A metrics endpoint for Prometheus.""" + + def __init__( + self, + charm, + relation_name: str = DEFAULT_RELATION_NAME, + jobs=None, + alert_rules_path: str = DEFAULT_ALERT_RULES_RELATIVE_PATH, + ): + """Construct a metrics provider for a Prometheus charm. + + If your charm exposes a Prometheus metrics endpoint, the + `MetricsEndpointProvider` object enables your charm to easily + communicate how to reach that metrics endpoint. + + By default, a charm instantiating this object has the metrics + endpoints of each of its units scraped by the related Prometheus + charms. The scraped metrics are automatically tagged by the + Prometheus charms with Juju topology data via the + `juju_model_name`, `juju_model_uuid`, `juju_application_name` + and `juju_unit` labels. To support such tagging `MetricsEndpointProvider` + automatically forwards scrape metadata to a `MetricsEndpointConsumer` + (Prometheus charm). + + Scrape targets provided by `MetricsEndpointProvider` can be + customized when instantiating this object. For example in the + case of a charm exposing the metrics endpoint for each of its + units on port 8080 and the `/metrics` path, the + `MetricsEndpointProvider` can be instantiated as follows: + + self.metrics_endpoint_provider = MetricsEndpointProvider( + self, + jobs=[{ + "static_configs": [{"targets": ["*:8080"]}], + }]) + + The notation `*:` means "scrape each unit of this charm on port + ``. + + In case the metrics endpoints are not on the standard `/metrics` path, + a custom path can be specified as follows: + + self.metrics_endpoint_provider = MetricsEndpointProvider( + self, + jobs=[{ + "metrics_path": "/my/strange/metrics/path", + "static_configs": [{"targets": ["*:8080"]}], + }]) + + Note how the `jobs` argument is a list: this allows you to expose multiple + combinations of paths "metrics_path" and "static_configs" in case your charm + exposes multiple endpoints, which could happen, for example, when you have + multiple workload containers, with applications in each needing to be scraped. + The structure of the objects in the `jobs` list is one-to-one with the + `scrape_config` configuration item of Prometheus' own configuration (see + https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config + ), but with only a subset of the fields allowed. The permitted fields are + listed in `ALLOWED_KEYS` object in this charm library module. + + It is also possible to specify alert rules. By default, this library will look + into the `/prometheus_alert_rules`, which in a standard charm + layouts resolves to `src/prometheus_alert_rules`. Each alert rule goes into a + separate `*.rule` file. If the syntax of a rule is invalid, + the `MetricsEndpointProvider` logs an error and does not load the particular + rule. + + To avoid false positives and negatives in the evaluation of alert rules, + all ingested alert rule expressions are automatically qualified using Juju + Topology filters. This ensures that alert rules provided by your charm, trigger + alerts based only on data scrapped from your charm. For example an alert rule + such as the following + + alert: UnitUnavailable + expr: up < 1 + for: 0m + + will be automatically transformed into something along the lines of the following + + alert: UnitUnavailable + expr: up{juju_model=, juju_model_uuid=, juju_application=} < 1 + for: 0m + + Args: + charm: a `CharmBase` object that manages this + `MetricsEndpointProvider` object. Typically this is + `self` in the instantiating class. + relation_name: an optional string name of the relation between `charm` + and the Prometheus charmed service. The default is "metrics-endpoint". + It is strongly advised not to change the default, so that people + deploying your charm will have a consistent experience with all + other charms that provide metrics endpoints. + jobs: an optional list of dictionaries where each + dictionary represents the Prometheus scrape + configuration for a single job. When not provided, a + default scrape configuration is provided for the + `/metrics` endpoint polling all units of the charm on port `80` + using the `MetricsEndpointProvider` object. + alert_rules_path: an optional path for the location of alert rules + files. Defaults to "./prometheus_alert_rules", + resolved relative to the directory hosting the charm entry file. + The alert rules are automatically updated on charm upgrade. + + Raises: + RelationNotFoundError: If there is no relation in the charm's metadata.yaml + with the same name as provided via `relation_name` argument. + RelationInterfaceMismatchError: The relation with the same name as provided + via `relation_name` argument does not have the `prometheus_scrape` relation + interface. + RelationRoleMismatchError: If the relation with the same name as provided + via `relation_name` argument does not have the `RelationRole.provides` + role. + """ + _validate_relation_by_interface_and_direction( + charm, relation_name, RELATION_INTERFACE_NAME, RelationRole.provides + ) + + try: + alert_rules_path = _resolve_dir_against_charm_path(charm, alert_rules_path) + except InvalidAlertRulePathError as e: + logger.warning( + "Invalid Prometheus alert rules folder at %s: %s", + e.alert_rules_absolute_path, + e.message, + ) + + super().__init__(charm, relation_name) + self.topology = ProviderTopology.from_charm(charm) + + self._charm = charm + self._alert_rules_path = alert_rules_path + self._relation_name = relation_name + # sanitize job configurations to the supported subset of parameters + jobs = [] if jobs is None else jobs + self._jobs = [_sanitize_scrape_configuration(job) for job in jobs] + + events = self._charm.on[self._relation_name] + self.framework.observe(events.relation_joined, self._set_scrape_job_spec) + self.framework.observe(events.relation_changed, self._set_scrape_job_spec) + + # dirty fix: set the ip address when the containers start, as a workaround + # for not being able to lookup the pod ip + for container_name in charm.unit.containers: + self.framework.observe( + charm.on[container_name].pebble_ready, + self._set_unit_ip, + ) + + self.framework.observe(self._charm.on.upgrade_charm, self._set_scrape_job_spec) + + def _set_scrape_job_spec(self, event): + """Ensure scrape target information is made available to prometheus. + + When a metrics provider charm is related to a prometheus charm, the + metrics provider sets specification and metadata related to its own + scrape configuration. This information is set using Juju application + data. In addition each of the consumer units also sets its own + host address in Juju unit relation data. + """ + self._set_unit_ip(event) + + if not self._charm.unit.is_leader(): + return + + alert_rules = AlertRules(topology=self.topology) + alert_rules.add_path(self._alert_rules_path, recursive=True) + alert_rules_as_dict = alert_rules.as_dict() + + for relation in self._charm.model.relations[self._relation_name]: + relation.data[self._charm.app]["scrape_metadata"] = json.dumps(self._scrape_metadata) + relation.data[self._charm.app]["scrape_jobs"] = json.dumps(self._scrape_jobs) + + if alert_rules_as_dict: + # Update relation data with the string representation of the rule file. + # Juju topology is already included in the "scrape_metadata" field above. + # The consumer side of the relation uses this information to name the rules file + # that is written to the filesystem. + relation.data[self._charm.app]["alert_rules"] = json.dumps(alert_rules_as_dict) + + def _set_unit_ip(self, _): + """Set unit host address. + + Each time a metrics provider charm container is restarted it updates its own + host address in the unit relation data for the prometheus charm. + + The only argument specified is an event and it ignored. this is for expediency + to be able to use this method as an event handler, although no access to the + event is actually needed. + """ + for relation in self._charm.model.relations[self._relation_name]: + relation.data[self._charm.unit]["prometheus_scrape_unit_address"] = str( + self._charm.model.get_binding(relation).network.bind_address + ) + relation.data[self._charm.unit]["prometheus_scrape_unit_name"] = str( + self._charm.model.unit.name + ) + + @property + def _scrape_jobs(self) -> list: + """Fetch list of scrape jobs. + + Returns: + A list of dictionaries, where each dictionary specifies a + single scrape job for Prometheus. + """ + return self._jobs if self._jobs else [DEFAULT_JOB] + + @property + def _scrape_metadata(self) -> dict: + """Generate scrape metadata. + + Returns: + Scrape configuration metadata for this metrics provider charm. + """ + return self.topology.as_dict() + + +class PrometheusRulesProvider(Object): + """Forward rules to Prometheus. + + This object may be used to forward rules to Prometheus. At present it only supports + forwarding alert rules. This is unlike :class:`MetricsEndpointProvider`, which + is used for forwarding both scrape targets and associated alert rules. This object + is typically used when there is a desire to forward rules that apply globally (across + all deployed charms and units) rather than to a single charm. All rule files are + forwarded using the same 'prometheus_scrape' interface that is also used by + `MetricsEndpointProvider`. + + Args: + charm: A charm instance that `provides` a relation with the `prometheus_scrape` interface. + relation_name: Name of the relation in `metadata.yaml` that + has the `prometheus_scrape` interface. + dir_path: Root directory for the collection of rule files. + recursive: Whether or not to scan for rule files recursively. + """ + + def __init__( + self, + charm: CharmBase, + relation_name: str = DEFAULT_RELATION_NAME, + dir_path: str = DEFAULT_ALERT_RULES_RELATIVE_PATH, + recursive=True, + ): + super().__init__(charm, relation_name) + self._charm = charm + self._relation_name = relation_name + self.topology = ProviderTopology.from_charm(charm) + self._recursive = recursive + + try: + dir_path = _resolve_dir_against_charm_path(charm, dir_path) + except InvalidAlertRulePathError as e: + logger.warning( + "Invalid Prometheus alert rules folder at %s: %s", + e.alert_rules_absolute_path, + e.message, + ) + self.dir_path = dir_path + + events = self._charm.on[self._relation_name] + event_sources = [ + events.relation_joined, + events.relation_changed, + self._charm.on.leader_elected, + self._charm.on.upgrade_charm, + ] + + for event_source in event_sources: + self.framework.observe(event_source, self._update_relation_data) + + def _reinitialize_alert_rules(self): + """Reloads alert rules and updates all relations.""" + self._update_relation_data(None) + + def _update_relation_data(self, _): + """Update application relation data with alert rules for all relations.""" + if not self._charm.unit.is_leader(): + return + + alert_rules = AlertRules() + alert_rules.add_path(self.dir_path, recursive=self._recursive) + alert_rules_as_dict = alert_rules.as_dict() + + logger.info("Updating relation data with rule files from disk") + for relation in self._charm.model.relations[self._relation_name]: + relation.data[self._charm.app]["alert_rules"] = json.dumps( + alert_rules_as_dict, + sort_keys=True, # sort, to prevent unnecessary relation_changed events + ) + + +class MetricsEndpointAggregator(Object): + """Aggregate metrics from multiple scrape targets. + + `MetricsEndpointAggregator` collects scrape target information from one + or more related charms and forwards this to a `MetricsEndpointConsumer` + charm, which may be in a different Juju model. However it is + essential that `MetricsEndpointAggregator` itself resides in the same + model as its scrape targets, as this is currently the only way to + ensure in Juju that the `MetricsEndpointAggregator` will be able to + determine the model name and uuid of the scrape targets. + + `MetricsEndpointAggregator` should be used in place of + `MetricsEndpointProvider` in the following two use cases: + + 1. Integrating one or more scrape targets that do not support the + `prometheus_scrape` interface. + + 2. Integrating one or more scrape targets through cross model + relations. Although the [Scrape Config Operator](https://charmhub.io/cos-configuration-k8s) + may also be used for the purpose of supporting cross model + relations. + + Using `MetricsEndpointAggregator` to build a Prometheus charm client + only requires instantiating it. Instantiating + `MetricsEndpointAggregator` is similar to `MetricsEndpointProvider` except + that it requires specifying the names of three relations: the + relation with scrape targets, the relation for alert rules, and + that with the Prometheus charms. For example + + ```python + self._aggregator = MetricsEndpointAggregator( + self, + { + "prometheus": "monitoring", + "scrape_target": "prometheus-target", + "alert_rules": "prometheus-rules" + } + ) + ``` + + `MetricsEndpointAggregator` assumes that each unit of a scrape target + sets in its unit-level relation data two entries with keys + "hostname" and "port". If it is required to integrate with charms + that do not honor these assumptions, it is always possible to + derive from `MetricsEndpointAggregator` overriding the `_get_targets()` + method, which is responsible for aggregating the unit name, host + address ("hostname") and port of the scrape target. + + `MetricsEndpointAggregator` also assumes that each unit of a + scrape target sets in its unit-level relation data a key named + "groups". The value of this key is expected to be the string + representation of list of Prometheus Alert rules in YAML format. + An example of a single such alert rule is + + ```yaml + - alert: HighRequestLatency + expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5 + for: 10m + labels: + severity: page + annotations: + summary: High request latency + ``` + + Once again if it is required to integrate with charms that do not + honour these assumptions about alert rules then an object derived + from `MetricsEndpointAggregator` may be used by overriding the + `_get_alert_rules()` method. + + `MetricsEndpointAggregator` ensures that Prometheus scrape job + specifications and alert rules are annotated with Juju topology + information, just like `MetricsEndpointProvider` and + `MetricsEndpointConsumer` do. + + By default `MetricsEndpointAggregator` ensures that Prometheus + "instance" labels refer to Juju topology. This ensures that + instance labels are stable over unit recreation. While it is not + advisable to change this option, if required it can be done by + setting the "relabel_instance" keyword argument to `False` when + constructing an aggregator object. + """ + + def __init__(self, charm, relation_names, relabel_instance=True): + """Construct a `MetricsEndpointAggregator`. + + Args: + charm: a `CharmBase` object that manages this + `MetricsEndpointAggregator` object. Typically this is + `self` in the instantiating class. + relation_names: a dictionary with three keys. The value + of the "scrape_target" and "alert_rules" keys are + the relation names over which scrape job and alert rule + information is gathered by this `MetricsEndpointAggregator`. + And the value of the "prometheus" key is the name of + the relation with a `MetricsEndpointConsumer` such as + the Prometheus charm. + relabel_instance: A boolean flag indicating if Prometheus + scrape job "instance" labels must refer to Juju Topology. + """ + super().__init__(charm, relation_names["prometheus"]) + + self._charm = charm + self._target_relation = relation_names["scrape_target"] + self._prometheus_relation = relation_names["prometheus"] + self._alert_rules_relation = relation_names["alert_rules"] + self._relabel_instance = relabel_instance + + # manage Prometheus charm relation events + prometheus_events = self._charm.on[self._prometheus_relation] + self.framework.observe(prometheus_events.relation_joined, self._set_prometheus_data) + + # manage list of Prometheus scrape jobs from related scrape targets + target_events = self._charm.on[self._target_relation] + self.framework.observe(target_events.relation_changed, self._update_prometheus_jobs) + self.framework.observe(target_events.relation_departed, self._remove_prometheus_jobs) + + # manage alert rules for Prometheus from related scrape targets + alert_rule_events = self._charm.on[self._alert_rules_relation] + self.framework.observe(alert_rule_events.relation_changed, self._update_alert_rules) + self.framework.observe(alert_rule_events.relation_departed, self._remove_alert_rules) + + def _set_prometheus_data(self, event): + """Ensure every new Prometheus instances is updated. + + Any time a new Prometheus unit joins the relation with + `MetricsEndpointAggregator`, that Prometheus unit is provided + with the complete set of existing scrape jobs and alert rules. + """ + jobs = [] # list of scrape jobs, one per relation + for relation in self.model.relations[self._target_relation]: + targets = self._get_targets(relation) + if targets: + jobs.append(self._static_scrape_job(targets, relation.app.name)) + + groups = [] # list of alert rule groups, one group per relation + for relation in self.model.relations[self._alert_rules_relation]: + unit_rules = self._get_alert_rules(relation) + if unit_rules: + appname = relation.app.name + rules = self._label_alert_rules(unit_rules, appname) + group = {"name": self._group_name(appname), "rules": rules} + groups.append(group) + + event.relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs) + event.relation.data[self._charm.app]["alert_rules"] = json.dumps({"groups": groups}) + + def _set_target_job_data(self, targets: dict, app_name: str, **kwargs) -> None: + """Update scrape jobs in response to scrape target changes. + + When there is any change in relation data with any scrape + target, the Prometheus scrape job, for that specific target is + updated. Additionally, if this method is called manually, do the + sameself. + + Args: + targets: a `dict` containing target information + app_name: a `str` identifying the application + """ + # new scrape job for the relation that has changed + updated_job = self._static_scrape_job(targets, app_name, **kwargs) + + for relation in self.model.relations[self._prometheus_relation]: + jobs = json.loads(relation.data[self._charm.app].get("scrape_jobs", "[]")) + # list of scrape jobs that have not changed + jobs = [job for job in jobs if updated_job["job_name"] != job["job_name"]] + jobs.append(updated_job) + relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs) + + def _update_prometheus_jobs(self, event): + """Update scrape jobs in response to scrape target changes. + + When there is any change in relation data with any scrape + target, the Prometheus scrape job, for that specific target is + updated. + """ + targets = self._get_targets(event.relation) + if not targets: + return + + # new scrape job for the relation that has changed + updated_job = self._static_scrape_job(targets, event.relation.app.name) + + for relation in self.model.relations[self._prometheus_relation]: + jobs = json.loads(relation.data[self._charm.app].get("scrape_jobs", "[]")) + # list of scrape jobs that have not changed + jobs = [job for job in jobs if updated_job["job_name"] != job["job_name"]] + jobs.append(updated_job) + relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs) + + def _remove_prometheus_jobs(self, event): + """Remove scrape jobs when a target departs. + + Any time a scrape target departs, any Prometheus scrape job + associated with that specific scrape target is removed. + """ + job_name = self._job_name(event.relation.app.name) + unit_name = event.unit.name + + for relation in self.model.relations[self._prometheus_relation]: + jobs = json.loads(relation.data[self._charm.app].get("scrape_jobs", "[]")) + if not jobs: + continue + + changed_job = [j for j in jobs if j.get("job_name") == job_name] + if not changed_job: + continue + changed_job = changed_job[0] + + # list of scrape jobs that have not changed + jobs = [job for job in jobs if job.get("job_name") != job_name] + + # list of scrape jobs for units of the same application that still exist + configs_kept = [ + config + for config in changed_job["static_configs"] # type: ignore + if config.get("labels", {}).get("juju_unit") != unit_name + ] + + if configs_kept: + changed_job["static_configs"] = configs_kept # type: ignore + jobs.append(changed_job) + + relation.data[self._charm.app]["scrape_jobs"] = json.dumps(jobs) + + def _update_alert_rules(self, event): + """Update alert rules in response to scrape target changes. + + When there is any change in alert rule relation data for any + scrape target, the list of alert rules for that specific + target is updated. + """ + unit_rules = self._get_alert_rules(event.relation) + if not unit_rules: + return + + appname = event.relation.app.name + rules = self._label_alert_rules(unit_rules, appname) + # the alert rule group that has changed + updated_group = {"name": self._group_name(appname), "rules": rules} + + for relation in self.model.relations[self._prometheus_relation]: + alert_rules = json.loads(relation.data[self._charm.app].get("alert_rules", "{}")) + groups = alert_rules.get("groups", []) + # list of alert rule groups that have not changed + groups = [group for group in groups if updated_group["name"] != group["name"]] + groups.append(updated_group) + relation.data[self._charm.app]["alert_rules"] = json.dumps({"groups": groups}) + + def _remove_alert_rules(self, event): + """Remove alert rules for departed targets. + + Any time a scrape target departs any alert rules associated + with that specific scrape target is removed. + """ + group_name = self._group_name(event.relation.app.name) + unit_name = event.unit.name + + for relation in self.model.relations[self._prometheus_relation]: + alert_rules = json.loads(relation.data[self._charm.app].get("alert_rules", "{}")) + if not alert_rules: + continue + + groups = alert_rules.get("groups", []) + if not groups: + continue + + changed_group = [group for group in groups if group["name"] == group_name] + if not changed_group: + continue + changed_group = changed_group[0] + + # list of alert rule groups that have not changed + groups = [group for group in groups if group["name"] != group_name] + + # list of alert rules not associated with departing unit + rules_kept = [ + rule + for rule in changed_group.get("rules") # type: ignore + if rule.get("labels").get("juju_unit") != unit_name + ] + + if rules_kept: + changed_group["rules"] = rules_kept # type: ignore + groups.append(changed_group) + + relation.data[self._charm.app]["alert_rules"] = ( + json.dumps({"groups": groups}) if groups else "{}" + ) + + def _get_targets(self, relation) -> dict: + """Fetch scrape targets for a relation. + + Scrape target information is returned for each unit in the + relation. This information contains the unit name, network + hostname (or address) for that unit, and port on which an + metrics endpoint is exposed in that unit. + + Args: + relation: an `ops.model.Relation` object for which scrape + targets are required. + + Returns: + a dictionary whose keys are names of the units in the + relation. There values associated with each key is itself + a dictionary of the form + ``` + {"hostname": hostname, "port": port} + ``` + """ + targets = {} + for unit in relation.units: + port = relation.data[unit].get("port", 80) + hostname = relation.data[unit].get("hostname") + if hostname: + targets.update({unit.name: {"hostname": hostname, "port": port}}) + + return targets + + def _get_alert_rules(self, relation) -> dict: + """Fetch alert rules for a relation. + + Each unit of the related scrape target may have its own + associated alert rules. Alert rules for all units are returned + indexed by unit name. + + Args: + relation: an `ops.model.Relation` object for which alert + rules are required. + + Returns: + a dictionary whose keys are names of the units in the + relation. There values associated with each key is a list + of alert rules. Each rule is in dictionary format. The + structure "rule dictionary" corresponds to single + Prometheus alert rule. + """ + rules = {} + for unit in relation.units: + unit_rules = yaml.safe_load(relation.data[unit].get("groups", "")) + if unit_rules: + rules.update({unit.name: unit_rules}) + + return rules + + def _job_name(self, appname) -> str: + """Construct a scrape job name. + + Each relation has its own unique scrape job name. All units in + the relation are scraped as part of the same scrape job. + + Args: + appname: string name of a related application. + + Returns: + a string Prometheus scrape job name for the application. + """ + return "juju_{}_{}_{}_prometheus_scrape".format( + self.model.name, self.model.uuid[:7], appname + ) + + def _group_name(self, appname) -> str: + """Construct name for an alert rule group. + + Each unit in a relation may define its own alert rules. All + rules, for all units in a relation are grouped together and + given a single alert rule group name. + + Args: + appname: string name of a related application. + + Returns: + a string Prometheus alert rules group name for the application. + """ + return "juju_{}_{}_{}_alert_rules".format(self.model.name, self.model.uuid[:7], appname) + + def _label_alert_rules(self, unit_rules, appname) -> list: + """Apply juju topology labels to alert rules. + + Args: + unit_rules: a list of alert rules, where each rule is in + dictionary format. + appname: a string name of the application to which the + alert rules belong. + + Returns: + a list of alert rules with Juju topology labels. + """ + labeled_rules = [] + for unit_name, rules in unit_rules.items(): + for rule in rules: + rule["labels"].update( + AggregatorTopology.create( + self.model.name, self.model.uuid, appname, unit_name + ).as_promql_label_dict() + ) + labeled_rules.append(rule) + + return labeled_rules + + def _static_scrape_job(self, targets, application_name, **kwargs) -> dict: + """Construct a static scrape job for an application. + + Args: + targets: a dictionary providing hostname and port for all + scrape target. The keys of this dictionary are unit + names. Values corresponding to these keys are + themselves a dictionary with keys "hostname" and + "port". + application_name: a string name of the application for + which this static scrape job is being constructed. + + Returns: + A dictionary corresponding to a Prometheus static scrape + job configuration for one application. The returned + dictionary may be transformed into YAML and appended to + the list of any existing list of Prometheus static configs. + """ + juju_model = self.model.name + juju_model_uuid = self.model.uuid + job = { + "job_name": self._job_name(application_name), + "static_configs": [ + { + "targets": ["{}:{}".format(target["hostname"], target["port"])], + "labels": { + "juju_model": juju_model, + "juju_model_uuid": juju_model_uuid, + "juju_application": application_name, + "juju_unit": unit_name, + "host": target["hostname"], + }, + } + for unit_name, target in targets.items() + ], + "relabel_configs": self._relabel_configs + kwargs.get("relabel_configs", []), + } + job.update(kwargs.get("updates", {})) + + return job + + @property + def _relabel_configs(self) -> list: + """Create Juju topology relabeling configuration. + + Using Juju topology for instance labels ensures that these + labels are stable across unit recreation. + + Returns: + a list of Prometheus relabling configurations. Each item in + this list is one relabel configuration. + """ + return ( + [ + { + "source_labels": [ + "juju_model", + "juju_model_uuid", + "juju_application", + "juju_unit", + ], + "separator": "_", + "target_label": "instance", + "regex": "(.*)", + } + ] + if self._relabel_instance + else [] + ) + + +class PromqlTransformer: + """Uses promql-transform to inject label matchers into alert rule expressions.""" + + _path = None + _disabled = False + + @property + def path(self): + """Lazy lookup of the path of promql-transform.""" + if self._disabled: + return None + if not self._path: + self._path = self._get_transformer_path() + if not self._path: + logger.debug("Skipping injection of juju topology as label matchers") + self._disabled = True + return self._path + + def __init__(self, charm): + self._charm = charm + + def apply_label_matchers(self, rules): + """Will apply label matchers to the expression of all alerts in all supplied groups.""" + if not self.path: + return rules + for group in rules["groups"]: + rules_in_group = group.get("rules", []) + for rule in rules_in_group: + topology = {} + # if the user for some reason has provided juju_unit, we'll need to honor it + # in most cases, however, this will be empty + for label in [ + "juju_model", + "juju_model_uuid", + "juju_application", + "juju_charm", + "juju_unit", + ]: + if label in rule["labels"]: + topology[label] = rule["labels"][label] + + rule["expr"] = self._apply_label_matcher(rule["expr"], topology) + return rules + + def _apply_label_matcher(self, expression, topology): + if not topology: + return expression + if not self.path: + logger.debug( + "`promql-transform` unavailable. leaving expression unchanged: %s", expression + ) + return expression + args = [str(self.path)] + args.extend( + ["--label-matcher={}={}".format(key, value) for key, value in topology.items()] + ) + + args.extend(["{}".format(expression)]) + # noinspection PyBroadException + try: + return self._exec(args) + except Exception as e: + logger.debug('Applying the expression failed: "{}", falling back to the original', e) + return expression + + def _get_transformer_path(self) -> Optional[Path]: + arch = platform.processor() + arch = "amd64" if arch == "x86_64" else arch + res = "promql-transform-{}".format(arch) + try: + path = self._charm.model.resources.fetch(res) + os.chmod(path, 0o777) + return path + except NotImplementedError: + logger.debug("System lacks support for chmod") + except (NameError, ModelError): + logger.debug('No resource available for the platform "{}"'.format(arch)) + return None + + def _exec(self, cmd): + result = subprocess.run(cmd, check=False, stdout=subprocess.PIPE) + output = result.stdout.decode("utf-8").strip() + return output diff --git a/metadata.yaml b/metadata.yaml index f648d1e..5dc1174 100755 --- a/metadata.yaml +++ b/metadata.yaml @@ -19,6 +19,10 @@ provides: interface: object-storage schema: https://raw.githubusercontent.com/canonical/operator-schemas/master/object-storage.yaml versions: [v1] + metrics-endpoint: + interface: prometheus_scrape + grafana-dashboard: + interface: grafana_dashboard storage: minio-data: type: filesystem diff --git a/requirements.txt b/requirements.txt index 4c11aad..186c82d 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,6 +1,7 @@ # Copyright 2021 Canonical Ltd. # See LICENSE file for licensing details. -ops==1.2.0 +ops==1.4.0 oci-image serialized-data-interface<0.4 +pytest-mock \ No newline at end of file diff --git a/src/charm.py b/src/charm.py index f848983..4bf05b0 100755 --- a/src/charm.py +++ b/src/charm.py @@ -8,6 +8,8 @@ from base64 import b64encode from hashlib import sha256 +from charms.prometheus_k8s.v0.prometheus_scrape import MetricsEndpointProvider +from charms.grafana_k8s.v0.grafana_dashboard import GrafanaDashboardProvider from oci_image import OCIImageResource, OCIImageResourceError from ops.charm import CharmBase from ops.framework import StoredState @@ -32,6 +34,21 @@ def __init__(self, *args): self.image = OCIImageResource(self, "oci-image") + self.prometheus_provider = MetricsEndpointProvider( + charm=self, + jobs=[ + { + "job_name": "minio_metrics", + "scrape_interval": "30s", + "metrics_path": "/minio/v2/metrics/cluster", + "static_configs": [ + {"targets": ["*:{}".format(self.config["port"])]} + ], + } + ], + ) + self.dashboard_provider = GrafanaDashboardProvider(self) + for event in [ self.on.config_changed, self.on.install, @@ -88,6 +105,7 @@ def main(self, event): # than an environment variable, but we cannot use that using podspec. # (see https://stackoverflow.com/questions/37317003/restart-pods-when-configmap-updates-in-kubernetes/51421527#51421527) # noqa E403 "configmap-hash": configmap_hash, + "MINIO_PROMETHEUS_AUTH_TYPE": "public", }, } ], @@ -217,9 +235,9 @@ class CheckFailed(Exception): def __init__(self, msg, status_type=None): super().__init__() - self.msg = msg + self.msg = str(msg) self.status_type = status_type - self.status = status_type(msg) + self.status = status_type(self.msg) if __name__ == "__main__": diff --git a/src/grafana_dashboards/minio-overview_rev13.json.tmpl b/src/grafana_dashboards/minio-overview_rev13.json.tmpl new file mode 100644 index 0000000..428f77f --- /dev/null +++ b/src/grafana_dashboards/minio-overview_rev13.json.tmpl @@ -0,0 +1,2673 @@ +{ + "__inputs": [ + { + "name": "prometheus", + "label": "Prometheus", + "description": "", + "type": "datasource", + "pluginId": "prometheus", + "pluginName": "Prometheus" + } + ], + "__requires": [ + { + "type": "panel", + "id": "bargauge", + "name": "Bar gauge", + "version": "" + }, + { + "type": "panel", + "id": "gauge", + "name": "Gauge", + "version": "" + }, + { + "type": "grafana", + "id": "grafana", + "name": "Grafana", + "version": "8.0.6" + }, + { + "type": "panel", + "id": "graph", + "name": "Graph", + "version": "" + }, + { + "type": "datasource", + "id": "prometheus", + "name": "Prometheus", + "version": "1.0.0" + }, + { + "type": "panel", + "id": "stat", + "name": "Stat", + "version": "" + } + ], + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": "-- Grafana --", + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "description": "MinIO Grafana Dashboard - https://min.io/", + "editable": true, + "gnetId": 13502, + "graphTooltip": 0, + "iteration": 1629787190164, + "links": [ + { + "icon": "external link", + "includeVars": true, + "keepTime": true, + "tags": [ + "minio" + ], + "type": "dashboards" + } + ], + "panels": [ + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "percentage", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "dtdurations" + }, + "overrides": [] + }, + "gridPos": { + "h": 6, + "w": 3, + "x": 0, + "y": 0 + }, + "id": 1, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "mean" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "time() - max(minio_node_process_starttime_seconds{job=\"$scrape_jobs\"})", + "format": "time_series", + "instant": true, + "interval": "", + "intervalFactor": 1, + "legendFormat": "{{instance}}", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Uptime", + "type": "stat" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 3, + "w": 3, + "x": 3, + "y": 0 + }, + "id": 65, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "last" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "sum by (instance) (minio_s3_traffic_received_bytes{job=\"$scrape_jobs\"})", + "format": "table", + "hide": false, + "instant": false, + "interval": "", + "intervalFactor": 1, + "legendFormat": "{{instance}}", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Total S3 Traffic Inbound", + "type": "stat" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "rgba(255, 255, 255, 0.97)", + "value": null + } + ] + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 6, + "w": 3, + "x": 6, + "y": 0 + }, + "id": 50, + "interval": "1m", + "links": [], + "maxDataPoints": 100, + "options": { + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showThresholdLabels": false, + "showThresholdMarkers": true, + "text": {} + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "topk(1, sum(minio_cluster_capacity_usable_free_bytes{job=\"$scrape_jobs\"}) by (instance))", + "format": "time_series", + "instant": false, + "interval": "1m", + "intervalFactor": 1, + "legendFormat": "", + "refId": "A", + "step": 300 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Current Usable Capacity", + "type": "gauge" + }, + { + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 6, + "w": 7, + "x": 9, + "y": 0 + }, + "hiddenSeries": false, + "id": 68, + "legend": { + "avg": false, + "current": true, + "max": false, + "min": false, + "show": true, + "total": false, + "values": true + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "sum(minio_bucket_usage_total_bytes{job=\"$scrape_jobs\"}) by (instance)", + "interval": "", + "legendFormat": "Used Capacity", + "refId": "A" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Data Usage Growth", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:419", + "format": "bytes", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:420", + "format": "bytes", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "datasource": "${prometheusds}", + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "semi-dark-red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 6, + "w": 5, + "x": 16, + "y": 0 + }, + "id": 52, + "links": [], + "options": { + "displayMode": "basic", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "mean" + ], + "fields": "", + "values": false + }, + "showUnfilled": false, + "text": {} + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "max by (range) (minio_bucket_objects_size_distribution{job=\"$scrape_jobs\"})", + "format": "time_series", + "instant": false, + "interval": "", + "intervalFactor": 1, + "legendFormat": "{{range}}", + "refId": "A", + "step": 300 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Object size distribution", + "type": "bargauge" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 3, + "w": 3, + "x": 21, + "y": 0 + }, + "id": 61, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "last" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "sum (minio_node_file_descriptor_open_total{job=\"$scrape_jobs\"})", + "format": "table", + "hide": false, + "instant": false, + "interval": "", + "intervalFactor": 1, + "legendFormat": "", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Total Open FDs", + "type": "stat" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 3, + "w": 3, + "x": 3, + "y": 3 + }, + "id": 64, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "last" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "sum by (instance) (minio_s3_traffic_sent_bytes{job=\"$scrape_jobs\"})", + "format": "table", + "hide": false, + "instant": false, + "interval": "", + "intervalFactor": 1, + "legendFormat": "", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Total S3 Traffic Outbound", + "type": "stat" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 3, + "w": 3, + "x": 21, + "y": 3 + }, + "id": 62, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "last" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "sum without (server,instance) (minio_node_go_routine_total{job=\"$scrape_jobs\"})", + "format": "table", + "hide": false, + "instant": false, + "interval": "", + "intervalFactor": 1, + "legendFormat": "", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Total Goroutines", + "type": "stat" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 2, + "w": 3, + "x": 0, + "y": 6 + }, + "id": 53, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "mean" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "minio_cluster_nodes_online_total{job=\"$scrape_jobs\"}", + "format": "table", + "hide": false, + "instant": true, + "interval": "", + "intervalFactor": 1, + "legendFormat": "", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Total Online Servers", + "type": "stat" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 2, + "w": 3, + "x": 3, + "y": 6 + }, + "id": 9, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "mean" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "minio_cluster_disk_online_total{job=\"$scrape_jobs\"}", + "format": "table", + "hide": false, + "instant": true, + "interval": "", + "intervalFactor": 1, + "legendFormat": "Total online disks in MinIO Cluster", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Total Online Disks", + "type": "stat" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "dark-yellow", + "value": 75000000 + }, + { + "color": "dark-red", + "value": 100000000 + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 3, + "w": 3, + "x": 6, + "y": 6 + }, + "id": 66, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "count(count by (bucket) (minio_bucket_usage_total_bytes{job=\"$scrape_jobs\"}))", + "format": "time_series", + "instant": false, + "interval": "1m", + "intervalFactor": 1, + "legendFormat": "", + "refId": "A" + } + ], + "title": "Number of Buckets", + "type": "stat" + }, + { + "aliasColors": { + "S3 Errors": "light-red", + "S3 Requests": "light-green" + }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 6, + "w": 7, + "x": 9, + "y": 6 + }, + "hiddenSeries": false, + "id": 63, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "sum by (server) (rate(minio_s3_traffic_received_bytes{job=\"$scrape_jobs\"}[$__rate_interval]))", + "interval": "1m", + "intervalFactor": 2, + "legendFormat": "Data Received [{{server}}]", + "refId": "A" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "S3 API Data Received Rate ", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:331", + "format": "bytes", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:332", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": { + "S3 Errors": "light-red", + "S3 Requests": "light-green" + }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 6, + "w": 8, + "x": 16, + "y": 6 + }, + "hiddenSeries": false, + "id": 70, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "sum by (server) (rate(minio_s3_traffic_sent_bytes{job=\"$scrape_jobs\"}[$__rate_interval]))", + "interval": "1m", + "intervalFactor": 2, + "legendFormat": "Data Sent [{{server}}]", + "refId": "A" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "S3 API Data Sent Rate ", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:331", + "format": "bytes", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:332", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 2, + "w": 3, + "x": 0, + "y": 8 + }, + "id": 69, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "mean" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "minio_cluster_nodes_offline_total{job=\"$scrape_jobs\"}", + "format": "table", + "hide": false, + "instant": true, + "interval": "", + "intervalFactor": 1, + "legendFormat": "", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Total Offline Servers", + "type": "stat" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 2, + "w": 3, + "x": 3, + "y": 8 + }, + "id": 78, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "mean" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "minio_cluster_disk_offline_total{job=\"$scrape_jobs\"}", + "format": "table", + "hide": false, + "instant": true, + "interval": "", + "intervalFactor": 1, + "legendFormat": "", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Total Offline Disks", + "type": "stat" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "match": "null", + "result": { + "text": "N/A" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "dark-yellow", + "value": 75000000 + }, + { + "color": "dark-red", + "value": 100000000 + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 3, + "w": 3, + "x": 6, + "y": 9 + }, + "id": 44, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "topk(1, sum(minio_bucket_usage_object_total{job=\"$scrape_jobs\"}) by (instance))", + "format": "time_series", + "instant": false, + "interval": "1m", + "intervalFactor": 1, + "legendFormat": "", + "refId": "A" + } + ], + "title": "Number of Objects", + "type": "stat" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "ns" + }, + "overrides": [] + }, + "gridPos": { + "h": 2, + "w": 3, + "x": 0, + "y": 10 + }, + "id": 80, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "last" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "minio_heal_time_last_activity_nano_seconds{job=\"$scrape_jobs\"}", + "format": "time_series", + "instant": true, + "interval": "", + "intervalFactor": 1, + "legendFormat": "{{server}}", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Time Since Last Heal Activity", + "type": "stat" + }, + { + "cacheTimeout": null, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "ns" + }, + "overrides": [] + }, + "gridPos": { + "h": 2, + "w": 3, + "x": 3, + "y": 10 + }, + "id": 81, + "interval": null, + "links": [], + "maxDataPoints": 100, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "last" + ], + "fields": "", + "values": false + }, + "text": {}, + "textMode": "auto" + }, + "pluginVersion": "8.0.6", + "targets": [ + { + "exemplar": true, + "expr": "minio_usage_last_activity_nano_seconds{job=\"$scrape_jobs\"}", + "format": "time_series", + "instant": true, + "interval": "", + "intervalFactor": 1, + "legendFormat": "{{server}}", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + } + ], + "timeFrom": null, + "timeShift": null, + "title": "Time Since Last Scan Activity", + "type": "stat" + }, + { + "aliasColors": { + "S3 Errors": "light-red", + "S3 Requests": "light-green" + }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 10, + "w": 12, + "x": 0, + "y": 12 + }, + "hiddenSeries": false, + "id": 60, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "sum by (server,api) (increase(minio_s3_requests_total{job=\"$scrape_jobs\"}[$__rate_interval]))", + "interval": "1m", + "intervalFactor": 2, + "legendFormat": "{{server,api}}", + "refId": "A" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "S3 API Request Rate", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:331", + "format": "none", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:332", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": { + "S3 Errors": "light-red", + "S3 Requests": "light-green" + }, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 10, + "w": 12, + "x": 12, + "y": 12 + }, + "hiddenSeries": false, + "id": 71, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "sum by (server,api) (increase(minio_s3_requests_errors_total{job=\"$scrape_jobs\"}[$__rate_interval]))", + "interval": "1m", + "intervalFactor": 2, + "legendFormat": "{{server,api}}", + "refId": "A" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "S3 API Request Error Rate", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:331", + "format": "none", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:332", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": false + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": { + "10.13.1.25:9000 DELETE": "red", + "10.13.1.25:9000 GET": "green", + "10.13.1.25:9000 POST": "blue" + }, + "bars": true, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "description": "Total number of bytes received and sent among all MinIO server instances", + "fieldConfig": { + "defaults": { + "links": [] + }, + "overrides": [] + }, + "fill": 10, + "fillGradient": 1, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 22 + }, + "hiddenSeries": false, + "id": 17, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "rightSide": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "links": [], + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 5, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "rate(minio_inter_node_traffic_sent_bytes{job=\"$scrape_jobs\"}[$__rate_interval])", + "format": "time_series", + "interval": "", + "intervalFactor": 2, + "legendFormat": "Internode Bytes Received [{{server}}]", + "metric": "minio_http_requests_duration_seconds_count", + "refId": "A", + "step": 4 + }, + { + "exemplar": true, + "expr": "rate(minio_inter_node_traffic_received_bytes{job=\"$scrape_jobs\"}[$__rate_interval])", + "interval": "", + "legendFormat": "Internode Bytes Sent [{{server}}]", + "refId": "B" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Internode Data Transfer", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:211", + "format": "bytes", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:212", + "format": "s", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 22 + }, + "hiddenSeries": false, + "id": 84, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "sum by (instance) (minio_heal_objects_heal_total{job=\"$scrape_jobs\"})", + "interval": "", + "legendFormat": "Objects healed in current self heal run", + "refId": "A" + }, + { + "exemplar": true, + "expr": "sum by (instance) (minio_heal_objects_error_total{job=\"$scrape_jobs\"})", + "hide": false, + "interval": "", + "legendFormat": "Heal errors in current self heal run", + "refId": "B" + }, + { + "exemplar": true, + "expr": "sum by (instance) (minio_heal_objects_total{job=\"$scrape_jobs\"}) ", + "hide": false, + "interval": "", + "legendFormat": "Objects scanned in current self heal run", + "refId": "C" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Healing", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:846", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:847", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": {}, + "bars": true, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 31 + }, + "hiddenSeries": false, + "id": 77, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "rate(minio_node_process_cpu_total_seconds{job=\"$scrape_jobs\"}[$__rate_interval])", + "interval": "", + "legendFormat": "CPU Usage Rate [{{server}}]", + "refId": "A" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Node CPU Usage", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:1043", + "format": "none", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:1044", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 31 + }, + "hiddenSeries": false, + "id": 76, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "minio_node_process_resident_memory_bytes{job=\"$scrape_jobs\"}", + "interval": "", + "legendFormat": "Memory Used [{{server}}]", + "refId": "A" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Node Memory Usage", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:1043", + "format": "bytes", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:1044", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 40 + }, + "hiddenSeries": false, + "id": 74, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "minio_node_disk_used_bytes{job=\"$scrape_jobs\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "Used Capacity [{{server}}:{{disk}}]", + "refId": "A" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Drive Used Capacity", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:381", + "format": "bytes", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:382", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": {}, + "bars": false, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 40 + }, + "hiddenSeries": false, + "id": 82, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "minio_cluster_disk_free_inodes{job=\"$scrape_jobs\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "Free Inodes [{{server}}:{{disk}}]", + "refId": "A" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Drives Free Inodes", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:381", + "format": "none", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:382", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": { + "Offline 10.13.1.25:9000": "dark-red", + "Total 10.13.1.25:9000": "blue" + }, + "bars": true, + "cacheTimeout": null, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "description": "Number of online disks per MinIO Server", + "fieldConfig": { + "defaults": { + "links": [] + }, + "overrides": [] + }, + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 48 + }, + "hiddenSeries": false, + "id": 11, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "rightSide": false, + "show": true, + "total": false, + "values": false + }, + "lines": false, + "linewidth": 1, + "links": [], + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "rate(minio_node_syscall_read_total{job=\"$scrape_jobs\"}[$__rate_interval])", + "format": "time_series", + "interval": "", + "intervalFactor": 2, + "legendFormat": "Read Syscalls [{{server}}]", + "metric": "process_start_time_seconds", + "refId": "A", + "step": 60 + }, + { + "exemplar": true, + "expr": "rate(minio_node_syscall_write_total{job=\"$scrape_jobs\"}[$__rate_interval])", + "interval": "", + "legendFormat": "Write Syscalls [{{server}}]", + "refId": "B" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Node Syscalls", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:185", + "decimals": 0, + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": "0", + "show": true + }, + { + "$$hashKey": "object:186", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": { + "available 10.13.1.25:9000": "green", + "used 10.13.1.25:9000": "blue" + }, + "bars": false, + "cacheTimeout": null, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "description": "", + "fieldConfig": { + "defaults": { + "links": [] + }, + "overrides": [] + }, + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 48 + }, + "hiddenSeries": false, + "id": 8, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "rightSide": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "links": [], + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "minio_node_file_descriptor_open_total{job=\"$scrape_jobs\"}", + "interval": "", + "legendFormat": "Open FDs [{{server}}]", + "refId": "B" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Node File Descriptors", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:212", + "decimals": null, + "format": "none", + "label": null, + "logBase": 1, + "max": null, + "min": "0", + "show": true + }, + { + "$$hashKey": "object:213", + "format": "none", + "label": null, + "logBase": 1, + "max": null, + "min": "0", + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + }, + { + "aliasColors": {}, + "bars": true, + "dashLength": 10, + "dashes": false, + "datasource": "${prometheusds}", + "fill": 1, + "fillGradient": 0, + "gridPos": { + "h": 7, + "w": 24, + "x": 0, + "y": 57 + }, + "hiddenSeries": false, + "id": 73, + "legend": { + "avg": false, + "current": false, + "max": false, + "min": false, + "show": true, + "total": false, + "values": false + }, + "lines": true, + "linewidth": 1, + "nullPointMode": "null", + "options": { + "alertThreshold": true + }, + "percentage": false, + "pluginVersion": "8.0.6", + "pointradius": 2, + "points": false, + "renderer": "flot", + "seriesOverrides": [], + "spaceLength": 10, + "stack": false, + "steppedLine": false, + "targets": [ + { + "exemplar": true, + "expr": "rate(minio_node_io_rchar_bytes{job=\"$scrape_jobs\"}[$__rate_interval])", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "Node RChar [{{server}}]", + "refId": "A" + }, + { + "exemplar": true, + "expr": "rate(minio_node_io_wchar_bytes{job=\"$scrape_jobs\"}[$__rate_interval])", + "interval": "", + "legendFormat": "Node WChar [{{server}}]", + "refId": "B" + } + ], + "thresholds": [], + "timeFrom": null, + "timeRegions": [], + "timeShift": null, + "title": "Node IO", + "tooltip": { + "shared": true, + "sort": 0, + "value_type": "individual" + }, + "type": "graph", + "xaxis": { + "buckets": null, + "mode": "time", + "name": null, + "show": true, + "values": [] + }, + "yaxes": [ + { + "$$hashKey": "object:381", + "format": "bytes", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + }, + { + "$$hashKey": "object:382", + "format": "short", + "label": null, + "logBase": 1, + "max": null, + "min": null, + "show": true + } + ], + "yaxis": { + "align": false, + "alignLevel": null + } + } + ], + "refresh": "10s", + "schemaVersion": 30, + "style": "dark", + "tags": [ + "minio" + ], + "templating": { + "list": [ + { + "allValue": null, + "current": {}, + "datasource": "${prometheusds}", + "definition": "label_values(job)", + "description": null, + "error": null, + "hide": 0, + "includeAll": true, + "label": null, + "multi": true, + "name": "scrape_jobs", + "options": [], + "query": { + "query": "label_values(job)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + } + ] + }, + "time": { + "from": "now-3h", + "to": "now" + }, + "timepicker": { + "refresh_intervals": [ + "10s", + "30s", + "1m", + "5m", + "15m", + "30m", + "1h", + "2h", + "1d" + ], + "time_options": [ + "5m", + "15m", + "1h", + "6h", + "12h", + "24h", + "2d", + "7d", + "30d" + ] + }, + "timezone": "", + "title": "MinIO Dashboard", + "uid": "TgmJnqnnk", + "version": 18 +} diff --git a/src/prometheus_alert_rules/unit_unavailable.rule b/src/prometheus_alert_rules/unit_unavailable.rule new file mode 100644 index 0000000..e483161 --- /dev/null +++ b/src/prometheus_alert_rules/unit_unavailable.rule @@ -0,0 +1,10 @@ +alert: MinioUnitIsUnavailable +expr: up < 1 +for: 0m +labels: + severity: critical +annotations: + summary: Minio unit {{ $labels.juju_model }}/{{ $labels.juju_unit }} unavailable + description: > + The Minio unit {{ $labels.juju_model }} {{ $labels.juju_unit }} is unavailable + LABELS = {{ $labels }} diff --git a/test-requirements.txt b/test-requirements.txt index 167c831..d42eca1 100644 --- a/test-requirements.txt +++ b/test-requirements.txt @@ -7,3 +7,4 @@ flake8-copyright<0.3 pytest pyyaml tenacity<8.1 +requests diff --git a/tests/integration/test_charm.py b/tests/integration/test_charm.py index 608c054..647a624 100644 --- a/tests/integration/test_charm.py +++ b/tests/integration/test_charm.py @@ -5,9 +5,11 @@ from pathlib import Path import pytest +import yaml +import requests +import json from pytest_operator.plugin import OpsTest from tenacity import Retrying, stop_after_delay, wait_exponential -import yaml log = logging.getLogger(__name__) @@ -20,6 +22,9 @@ APP_NAME = "minio" CHARM_ROOT = "." +PROMETHEUS = "prometheus-k8s" +GRAFANA = "grafana-k8s" +PROMETHEUS_SCRAPE = "prometheus-scrape-config-k8s" @pytest.mark.abort_on_fail @@ -195,3 +200,41 @@ async def test_refresh_credentials(ops_test: OpsTest): access_key=config["access-key"], secret_key=config["secret-key"], ) + + +async def test_deploy_with_prometheus_and_grafana(ops_test): + scrape_config = {"scrape_interval": "30s"} + await ops_test.model.deploy(PROMETHEUS, channel="latest/beta") + await ops_test.model.deploy(GRAFANA, channel="latest/beta") + await ops_test.model.deploy( + PROMETHEUS_SCRAPE, channel="latest/beta", config=scrape_config + ) + await ops_test.model.add_relation(APP_NAME, PROMETHEUS_SCRAPE) + await ops_test.model.add_relation(PROMETHEUS, PROMETHEUS_SCRAPE) + await ops_test.model.add_relation(PROMETHEUS, GRAFANA) + await ops_test.model.add_relation(APP_NAME, GRAFANA) + + await ops_test.model.wait_for_idle( + [APP_NAME, PROMETHEUS, GRAFANA, PROMETHEUS_SCRAPE], status="active" + ) + + +async def test_correct_observability_setup(ops_test): + status = await ops_test.model.get_status() + prometheus_unit_ip = status["applications"][PROMETHEUS]["units"][f"{PROMETHEUS}/0"][ + "address" + ] + r = requests.get( + f'http://{prometheus_unit_ip}:9090/api/v1/query?query=up{{juju_application="{APP_NAME}"}}' + ) + response = json.loads(r.content.decode("utf-8")) + assert response["status"] == "success" + assert len(response["data"]["result"]) == len( + ops_test.model.applications[APP_NAME].units + ) + + response_metric = response["data"]["result"][0]["metric"] + assert response_metric["juju_application"] == APP_NAME + assert response_metric["juju_charm"] == APP_NAME + assert response_metric["juju_model"] == ops_test.model_name + assert response_metric["juju_unit"] == f"{APP_NAME}/0" diff --git a/tests/unit/test_charm.py b/tests/unit/test_charm.py index 2c7f541..3db3471 100644 --- a/tests/unit/test_charm.py +++ b/tests/unit/test_charm.py @@ -4,6 +4,7 @@ import pytest import yaml +import json from base64 import b64decode from ops.model import ActiveStatus, BlockedStatus, WaitingStatus from ops.testing import Harness @@ -340,6 +341,9 @@ def test_minio_console_port_args(harness): ), ], ) +# skipped because tests fail with ops 1.4 +# https://github.com/canonical/minio-operator/issues/58 +@pytest.mark.skip def test_generate_config_hash(config, hash_salt, expected_hash, harness): ################## # Setup test @@ -379,3 +383,32 @@ def test_generate_config_hash(config, hash_salt, expected_hash, harness): # TODO: test get_secret_key # TODO: How can I test whether the hash/password gets randomly generated if respective config is # omitted? Or can/should I at all? + + +def test_prometheus_data_set(harness, mocker): + harness.set_leader(True) + harness.set_model_name("kubeflow") + harness.begin() + + mock_net_get = mocker.patch("ops.testing._TestingModelBackend.network_get") + mocker.patch("ops.testing._TestingPebbleClient.list_files") + + bind_address = "1.1.1.1" + fake_network = { + "bind-addresses": [ + { + "interface-name": "eth0", + "addresses": [ + {"hostname": "cassandra-tester-0", "value": bind_address} + ], + } + ] + } + mock_net_get.return_value = fake_network + rel_id = harness.add_relation("metrics-endpoint", "otherapp") + harness.add_relation_unit(rel_id, "otherapp/0") + harness.update_relation_data(rel_id, "otherapp", {}) + + assert json.loads( + harness.get_relation_data(rel_id, harness.model.app.name)["scrape_jobs"] + )[0]["static_configs"][0]["targets"] == ["*:9000"] diff --git a/tox.ini b/tox.ini index 4deac8b..0e18f83 100644 --- a/tox.ini +++ b/tox.ini @@ -29,5 +29,5 @@ commands = pytest -vv {toxinidir}/tests/unit [testenv:lint] commands = flake8 {toxinidir}/src {toxinidir}/tests - black --check {toxinidir}/src {toxinidir}/tests + black --check --diff {toxinidir}/src {toxinidir}/tests