Document OpenShift-on-OpenStack correlated observability

Add instructions how to query OpenShift and OpenStack metrics together. Co-authored-by: Pierre Prinetti <[email protected]> Co-authored-by: Martin André <[email protected]>
openshift · Oct 17, 2024 · b946306 · b946306
1 parent c69dd7a
commit b946306
Show file tree

Hide file tree

Showing 2 changed files with 383 additions and 0 deletions.
diff --git a/docs/user/openstack/README.md b/docs/user/openstack/README.md
@@ -54,6 +54,7 @@ It covers the installation with the default CNI (OVNKubernetes).
 
 ## Reference Documents
 
+- [Observability](observability.md)
 - [Privileges](privileges.md)
 - [Control plane machine set](control-plane-machine-set.md)
 - [Known Issues and Workarounds](known-issues.md)

diff --git a/docs/user/openstack/observability.md b/docs/user/openstack/observability.md
@@ -0,0 +1,382 @@
+# Observability of OpenShift on OpenStack
+
+This document explains how it is possible to correlate OpenStack and OpenShift
+metrics to have a better view of the stack and help troubleshoot issues
+affecting your clusters.
+
+## Make your OpenStack and OpenShift metrics available in the same metric store
+
+The strategy we will be outlining in this document is to make both OpenStack
+and OpenShift metrics available in a single Prometheus instance.
+
+There are a number of ways to achieve this goal. Here we document two methods:
+
+* Method A: use the Prometheus feature
+  [Remote-Write][prometheus-docs-remote-write] to send both OpenStack and
+  OpenShift metrics to an external instance
+* Method B: configure the OpenStack prometheus instance to pull certain data
+  from the OpenShift federation endpoint allowing data to be combined in the
+  single OpenStack prometheus.
+
+[prometheus-docs-remote-write]: https://prometheus.io/docs/specs/remote_write_spec/ "Prometheus Remote-Write Specification"
+
+### Method A: Use Remote-Write to send RHOSO and OCP metrics to an external instance
+
+#### Set up the external storage
+
+In this example, we are using an external Prometheus instance to store the
+metrics.
+
+We will set up remote-write from both OpenStack and OpenShift, authenticating
+them with mTLS (mutual TLS). The target Prometheus needs to be configured to
+[accept client TLS certificates][prometheus-mtls], and
+[Remote-Write][prometheus-remote-write-receiver-flag].
+
+[prometheus-mtls]: https://prometheus.io/docs/prometheus/latest/configuration/https/
+[prometheus-remote-write-receiver-flag]: https://prometheus.io/docs/prometheus/latest/feature_flags/#remote-write-receiver "Prometheus feature flags: Remote-Write receiver"
+
+
+<!--
+To generate test certificates:
+
+```bash
+# Generate a CA if you don't have one already
+openssl genrsa -out ca.key 4096
+openssl req -batch -new -x509 -key ca.key -out ca.crt
+
+# Generate the client certificates and sign them:
+for target in server ocp-client osp-client; do
+    openssl genrsa -out "${target}.key" 4096
+    openssl req -batch -new -key "${target}.key" -out "${target}.csr"
+    openssl x509 -req -CA ca.crt -CAkey ca.key -CAcreateserial -in "${target}.csr" -out "${target}.crt"
+done
+```
+-->
+
+<!--
+For testing purpose, we can do the following to set up basic auth in addition
+or in stead of mTLS:
+
+1. Provision a Fedora VM
+2. Install `dnf install golang-github-prometheus caddy`
+3. Configure prometheus to enable remote write (and limit retention to avoid
+   filling up disk space). In `/etc/default/prometheus`, add the following
+   line:
+
+```
+ARGS='--enable-feature=remote-write-receiver --storage.tsdb.retention.time=1d'
+```
+
+1. Enable and restart the Prometheus systemd unit
+2. Add a security group rule to allow HTTPS (port 443)
+3. Setup Caddy with (`/etc/caddy/Caddyfile`):
+
+```Caddyfile
+https://external-prometheus.example {
+
+    basicauth {
+        # caddy hash-password
+        user hashed-password
+    }
+
+    reverse_proxy http://localhost:9090
+}
+```
+-->
+
+
+
+We will assume that the external Prometheus is reachable at the URL
+`https://external-prometheus.example`.
+
+#### Set up remote-write from RHOSO's telemetry-operator
+
+Telemetry should be enabled in the RHOSO environment. If it is not the case,
+refer to the
+[documentation](https://docs.redhat.com/en/documentation/red_hat_openstack_services_on_openshift/18.0/html/customizing_the_red_hat_openstack_services_on_openshift_deployment/rhoso-observability_custom_dataplane#rhoso-observability_rhoso-observability).
+
+<!--
+Essentially, enabling telemetry boils down to flipping a property of the
+openstackcontrolplane object:
+
+```bash
+oc patch OpenStackControlPlane/controlplane --type merge -p '{"spec":{"telemetry":{"enabled": true, "template":{"ceilometer":{"enabled": true}}}}}'
+```
+-->
+
+> [!NOTE]
+Make sure you have the Cluster Observability Operator installed in the
+OpenShift cluster running the OpenStack control plane, as this is a requirement
+for the OpenStack Telemetry Operator. Follow [these
+directions](https://github.com/openstack-k8s-operators/architecture/blob/main/examples/dt/uni01alpha/control-plane.md#cluster-observability-operator)
+to install it.
+
+To check that the telemetry machinery is correctly installed, issue this
+command:
+
+```bash
+oc -n openstack get monitoringstacks metric-storage -o yaml
+```
+
+The `monitoringstacks` CRD being installed is a good indicator that telemetry
+is functional.
+
+Before configuring remote-write in RHOSO's telemetry operator, create a secret
+in the `openstack namespace` containing the HTTPS client certificates for
+authenticating to Prometheus. We'll call it `mtls-bundle`:
+
+```bash
+oc --namespace openstack \
+    create secret generic mtls-bundle \
+        --from-file=./ca.crt \
+        --from-file=osp-client.crt \
+        --from-file=osp-client.key
+```
+
+Then, edit the controlplane configuration to setup the metric storage:
+
+```bash
+oc edit openstackcontrolplane/controlplane
+```
+
+We will configure RHOSO's telemetry operator to write metrics to our external
+Prometheus instance.
+
+Look for the `metricStorage` stanza. It can be found at the
+`.spec.telemetry.template.metricStorage` path. We will need to use a
+`customMonitoringStack` structure that cannot coexist with the
+`monitoringStack` one. Replace the `metricStorage` structure with one that
+looks like this:
+
+```yaml
+      metricStorage:
+        customMonitoringStack:
+          alertmanagerConfig:
+            disabled: false
+          logLevel: info
+          prometheusConfig:
+            scrapeInterval: 30s
+            remoteWrite:
+            - url: https://external-prometheus.example/api/v1/write
+              tlsConfig:
+                ca:
+                  secret:
+                    name: mtls-bundle
+                    key: ca.crt
+                cert:
+                  secret:
+                    name: mtls-bundle
+                    key: ocp-client.crt
+                keySecret:
+                  name: mtls-bundle
+                  key: ocp-client.key
+            replicas: 2
+          resourceSelector:
+            matchLabels:
+              service: metricStorage
+          resources:
+            limits:
+              cpu: 500m
+              memory: 512Mi
+            requests:
+              cpu: 100m
+              memory: 256Mi
+          retention: 1d # Set the desired retention interval
+        dashboardsEnabled: false
+        dataplaneNetwork: ctlplane
+        enabled: true
+        prometheusTls: {}
+```
+
+After saving the file and letting the change propagate, verify that you receive
+OpenStack metrics in the external Prometheus.
+
+#### Set up remote-write from the OCP cluster-monitoring-operator
+
+Refer to the [OpenShift documentation][ocp_docs] for configuring its monitoring stack.
+
+In this example we will [create a cluster monitoring
+configuration][create_cluster_monitoring_config], [setup
+remote-write][setup_remote_write], and [label the cluster metrics with
+a cluster identifier][add_labels].
+
+Optionally, since metrics will be collected externally, you can set a reduced retention for local metrics.
+
+The resulting `cluster-monitoring-config` ConfigMap could then resemble this:
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: cluster-monitoring-config
+  namespace: openshift-monitoring
+data:
+  config.yaml: |
+    prometheusK8s:
+      retention: 1d # Set the desired retention interval
+      remoteWrite:
+      - url: "https://external-prometheus.example/api/v1/write"
+        writeRelabelConfigs:
+        - sourceLabels:
+          - __tmp_openshift_cluster_id__
+          targetLabel: cluster_id
+          action: replace
+        tlsConfig:
+          ca:
+            secret:
+              name: mtls-bundle
+              key: ca.crt
+          cert:
+            secret:
+              name: mtls-bundle
+              key: ocp-client.crt
+          keySecret:
+            name: mtls-bundle
+            key: ocp-client.key
+```
+
+Save it to a file named `cluster-monitoring-config.yaml`. Before applying it,
+create the secret containing the HTTPS client certificates, similar to what we
+did for RHOSO. We're still calling the secret `mtls-bundle`, but this time in
+the `openshift-monitoring` namespace:
+
+```bash
+oc --namespace openshift-monitoring \
+    create secret generic mtls-bundle \
+        --from-file=./ca.crt \
+        --from-file=ocp-client.crt \
+        --from-file=ocp-client.key
+```
+
+Once you have created the secret, it's time to apply the cluster-monitoring configuration:
+
+```bash
+oc apply -f cluster-monitoring-config.yaml
+```
+
+Let the change propagate and verify that you receive OpenShift metrics in the
+external Prometheus.
+
+[ocp_docs]: https://docs.openshift.com/container-platform/4.17/observability/monitoring/configuring-the-monitoring-stack.html#configuring_remote_write_storage_configuring-the-monitoring-stack "Configuring the monitoring stack"
+[create_cluster_monitoring_config]: https://docs.openshift.com/container-platform/4.17/observability/monitoring/configuring-the-monitoring-stack.html#creating-cluster-monitoring-configmap_configuring-the-monitoring-stack "Creating a cluster monitoring config map"
+[setup_remote_write]: https://docs.openshift.com/container-platform/4.17/observability/monitoring/configuring-the-monitoring-stack.html#configuring-remote-write-storage_configuring-the-monitoring-stack "Configuring remote write storage"
+[add_labels]: https://docs.openshift.com/container-platform/4.17/observability/monitoring/configuring-the-monitoring-stack.html#adding-cluster-id-labels-to-metrics_configuring-the-monitoring-stack "Adding cluster ID labels to metrics"
+
+### Method B: Scrap OCP metrics from RHOSO
+
+As an alternative to write the different metrics to a single store, it's
+possible to concentrate the storage of metrics to RHOSO's existing
+telemetry-operator.
+
+OpenShift exposes a federation endpoint to expose a subset of metrics to an
+external scraper. You can follow [these instructions][federation] to get
+acquainted to the endpoint.
+
+[federation]: https://docs.redhat.com/en/documentation/openshift_container_platform/4.17/html/monitoring/accessing-third-party-monitoring-apis#monitoring-querying-metrics-by-using-the-federation-endpoint-for-prometheus_accessing-monitoring-apis-by-using-the-cli "OpenShift documentation: Querying metrics by using the federation endpoint for Prometheus"
+
+#### Step 1: Gather credentials and coordinates
+
+While connected to the OpenShift cluster through a username identified by password (as opposed to logging in using the `kubeconfig` file generated by the installer), fetch a token:
+
+```bash
+oc whoami -t
+```
+
+Then get the Prometheus federation route URL:
+
+```bash
+oc -n openshift-monitoring get route prometheus-k8s-federate -ojsonpath={'.status.ingress[].host'}
+```
+
+#### Let RHOSO scrape OpenShift's federation endpoint
+
+As stated in the [OpenShift documentation][ocp-federation-docs], it is recommended to limit scraping
+to fewer than 1000 samples for each request, and with a maximum frequency of
+once every 30 seconds.
+
+[ocp-federation-docs]: https://docs.openshift.com/container-platform/4.17/observability/monitoring/accessing-third-party-monitoring-apis.html#monitoring-querying-metrics-by-using-the-federation-endpoint-for-prometheus_accessing-monitoring-apis-by-using-the-cli
+
+In this example, we will only request two metrics: `kube_node_info` and
+`kube_persistentvolume_info` (see the `params.match[]` query below).
+
+While connected to the RHOSO cluster, apply this manifest:
+
+```yaml
+apiVersion: monitoring.rhobs/v1alpha1
+kind: ScrapeConfig
+metadata:
+  labels:
+    service: metricStorage
+  name: sos1-federated
+  namespace: openstack
+spec:
+  params:
+    'match[]':
+    - '{__name__=~"kube_node_info|kube_persistentvolume_info"}'
+  metricsPath: '/federate'
+  authorization:
+    type: Bearer
+    credentials:
+      name: ocp-federated
+      key: token
+  scheme: HTTPS # or HTTP
+  scrapeInterval: 30s
+  staticConfigs:
+  - targets:
+    - prometheus-k8s-federate-openshift-monitoring.apps.openshift.example # This is the URL fetched previously
+  # add a tlsConfig stanza in case the endpoint is HTTPS but uses a custom CA
+```
+
+Don't forget to make the token available as a secret (in the example above, the name is `ocp-federated`):
+
+```bash
+oc -n openstack ceate secret generic ocp-federated --from-literal=token=<the token fetched previously>
+```
+
+Once the new scrapeconfig propagates, the requested OpenShift metrics will be
+accessible for querying in RHOSO's OpenShift UI.
+
+## Available mappings
+
+To query metrics and identifying resources across the stack, OpenShift exposes
+helper metrics that establish a correlation between OpenStack infrastructure
+resources and their representation in OpenShift.
+
+To map **Kubernetes nodes** with **OpenStack Nova instances**:
+* in the metric `kube_node_info`:
+  * `node` is the Kubernetes node name
+  * `provider_id` contains the identifier of the corresponding OpenStack Nova instance
+
+To map **Kubernetes persistent volumes** with **OpenStack Cinder volume or Manila share**:
+* in the metric `kube_persistentvolume_info`:
+  * `persistentvolume` is the Kubernetes volume name
+  * `csi_volume_handle` is the Cinder volume or Manila share identifier
+
+### Example
+
+By default, the Nova VMs backing the OpenShift control plane nodes are created
+in a server group with policy "soft-anti-affinity". As a consequence, Nova will
+create them on separate hypervisors, on a best effort basis. However, if the
+state of the OpenStack cluster doesn't permit it (for example, because only two
+hypervisors are available), the VMs will be created anyway.
+
+In combination with the default soft-anti-affinity policy, it might be
+interesting to set up an alert firing when a hypervisor hosts more than one
+control plane node of a given OpenShift cluster, to highlight the degraded
+level of high availability.
+
+This query returns the number of OpenShift master nodes per OpenStack host:
+
+```PromQL
+sum by (vm_instance) (
+  group by (vm_instance, resource) (ceilometer_cpu)
+    / on (resource) group_right(vm_instance) (
+      group by (node, resource) (
+        label_replace(kube_node_info, "resource", "$1", "system_uuid", "(.+)")
+      )
+    / on (node) group_left group by (node) (
+      cluster:master_nodes
+    )
+  )
+)
+```