Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USHIFT-5279: MicroShift telemetry enhancement #1742

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

pacevedom
Copy link
Contributor

No description provided.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 27, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 27, 2025

@pacevedom: This pull request references USHIFT-5279 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from jcantrill and sosiouxme January 27, 2025 13:30
deployments this might be a bit excessive due to resource usage, network
traffic and usage patterns.

MicroShift should send data once a day to minimize network traffic as it can be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't how Prometheus/Thanos works: if you don't send metrics at least every 5m, they will be marked as stale and disappear from the query result. Working with intermittent metrics is very hard in practice and not following the same principle as OCP Telemetry metrics would complicate usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the retention period for metrics in the thanos server? It might be challenging for microshift deployments to send metrics in such a high cadence because of customer limitations (provided they are connected, which is also not common use case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by retention period. The first-level Telemetry backend keeps metrics for 15 days, they are then "copied" to another tier and kept virtually forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was asking because of the querying. MicroShift's use cases are usually disconnected, but for those that are connected they might not have the best network, which is why we thought of reducing the amount of metrics we send. The current queries you mention are based on OpenShift, I understand? We should have separate dashboards/aggregation layers because the product is different than OCP. Are the ones you are talking about in tableau or are they in grafana?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about the Prometheus query API & Grafana (I'm not knowledgeable about Tableau).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that be an issue if we had different dashboards? Our primary interest is to know how many clusters there are out there, usage is a secondary need in this case. Having sparse metrics should not make a huge difference for our intentions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparse metrics lead to gaps when querying. A workaround would be to use functions like last_over_time(foo[24d]) (e.g. "return the last received datapoint in the last 24 hours"). From my personal experience, this is less than ideal:

  • you'll have to educate users about this particularity
  • you'll need to maintain different dashboards
  • you can't differentiate between an instance being deleted vs. an instance which hasn't reported yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke to our PM and decided to change the requirement from "once every 24h" to "once every hour", as there is not a huge impact (each request is about 2KB in size). None of the issues listed above should apply, right?
We were also wondering, is there some kind of rate limiting on the server side? Or some kind of specific 4xx error saying "send again in X minutes/at Y time"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before, the Telemetry server expects to receive samples from a given cluster at least every 5m because of how Prometheus/Thanos handles stale metrics. This is why the OCP telemeter client is configured to report metrics every 4m30s. From https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness

If a target scrape or rule evaluation no longer returns a sample for a time series that was previously present, this time series will be marked as stale. If a target is removed, the previously retrieved time series will be marked as stale soon after removal.

If a query is evaluated at a sampling timestamp after a time series is marked as stale, then no value is returned for that time series. If new samples are subsequently ingested for that time series, they will be returned as expected.

A time series will go stale when it is no longer exported, or the target no longer exists. Such time series will disappear from graphs at the times of their latest collected sample, and they will not be returned in queries after they are marked stale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added notes about staleness and dashboards here.

|version|$microshift_version|
|type|microshift-rpm, microshift-ostree|

Metrics from MicroShift are already supported in the API because OpenShift is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Telemetry server has an allow-list of metrics which comes from OCP and incoming samples which don't match the list are dropped. What's the strategy to keep MicroShift metrics in sync with OCP metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the characteristics of MicroShift we think the list included here is enough for our purposes, as there are many parts of OCP that are missing and we do not plan to adopt any of them.
Also, MicroShift is an rpm based product, so we also have other means of reporting what's installed and what isn't. These metrics help build a closer to reality picture of what a system is running.
Keeping MicroShift metrics in sync may be done by means of having CI clusters report their metrics and check our stats using Grafana as part of our CI duties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put that in the enhancement?

having CI clusters report their metrics and check our stats using Grafana as part of our CI duties.

I expect the process to be automated otherwise I suspect that it will drift fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added info here.

instance, not viable in the typical resource constrained MicroShift deployment.
The direct write requests take raw Prometheus data and require both special
labels (`_id` for the cluster id) and crafting the authentication headers
(cluster id and pull secret in a specific format in HTTP headers).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to assess the potential impact on the Telemetry write endpoint (https://infogw.api.openshift.com/metrics/v1/receive).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in bandwidth used per day? Number of connections?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moadz would have more insights but my concerns would mainly be about network traffic and authentication requests.

Copy link
Contributor Author

@pacevedom pacevedom Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included an estimation of data sent per metrics payload/report.

@openshift-ci openshift-ci bot requested a review from moadz January 30, 2025 13:59
API to get data from connected MicroShift clusters.

### User Stories
As Red Hat, I want to enable MicroShift clusters to report back to me to get

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do users see this telemetry data? Are there any increased system requirements, storage, or networking to use telemetry?

Copy link
Contributor Author

@pacevedom pacevedom Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understanding user as an application admin using MicroShift, users cant see any of this as the reports are intended to provide information for Red Hat. Users who want to have metrics on their systems should look into an observability stack instead (like Prometheus).
As for system requirements, the resource usage of this feature is negligible so no increase in any aspect. There is an estimation here.

telemetry data is sent.


### Goals

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a user potentially apply this information to their use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understanding user as an application admin using MicroShift, the only relevant metrics for them might be those of resource usage, besides their own applications running on top. All metrics here are system and MicroShift core components related.


As a MicroShift admin, I want to have the option to opt-out of telemetry.

As a MicroShift admin, I want to have the option to configure how often

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about which moment in the day (or rather, after how many hours/minutes/seconds since MicroShift started) you want MicroShift to send metrics to Red Hat. It will default to 24h.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of the latest changes, we dont have this goal anymore (and the configuration option disappeared).

* Get better understanding of deployment and usage patterns on MicroShift.

### Non-Goals
* Have an in-cluster metrics service.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is basically mentioning the monitoring stack (Prometheus, alert-manager, etc). The document in the link is targeting OCP systems and operators and we have none of them. In this enhancement we are using the same API on the backend, but the client is not the same.

these reasons there must be an opt-out for this functionality, as OpenShift
already [allows](https://docs.openshift.com/container-platform/4.17/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html).

Taking advantage of the configuration options that the feature requires, an

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So will a user toggle off and also have to use the OCP steps, or can we automate those steps after the opt-out toggle is selected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just using the configuration option to disable it (status: Disabled) will do. No need to follow OCP's opt out procedure, which involves changing pull secrets.

**MicroShift** is the MicroShift main process.

1. MicroShift starts up.
2. MicroShift reads configuration. If telemetry is not enabled, finish here. If telemetry is enabled proceed to next step.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so it seems like we want to configure this at install, prior to first boot. Can we turn this on later and restart? (And vice versa, turn it off and restart.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you can enable/disable this everytime you restart MicroShift, it is not tied to the install/first boot.

### Risks and Mitigations
Using the telemeter API requires connected clusters, and this might not always
be the case with MicroShift. For those clusters that are not connected there
is no easy way of getting metrics from them. A possible mitigation would be to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is this something we gather that an admin can then use when an edge device is about to be upgraded or decommissioned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would also be interesting, it would need to be connected though. Let me give it some thought and include it here.


MicroShift runs as an application on top of R4E (Rhel4Edge), the usage of RHEL
insights may yield some information about MicroShift, but there might be gaps.
RHEL insights knows about which packages are installed, but not whether

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to mention RHEL Insights as a counterpoint in the MicroShift docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something about this in the R4E docs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the RHEL docs have that completely separate from the install types.

@pacevedom pacevedom force-pushed the USHIFT-5279 branch 2 times, most recently from f5df57c to f88ea83 Compare February 6, 2025 08:51
@simonpasquier
Copy link
Contributor

/assign @jan--f
/assign @moadz

Copy link
Contributor

openshift-ci bot commented Feb 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jan--f. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pacevedom pacevedom force-pushed the USHIFT-5279 branch 2 times, most recently from 49ce572 to 46ee728 Compare February 7, 2025 16:07
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the intent is to

  1. collect metrics at short intervals and send everything in batch every hour.
  2. or collect and send metrics every hour.

|resource|Used to specify K8s resource names: pods, namespaces, etc.|
|instance|ip address of the node|
|version|microshift version|
|ostree_commit|OStree commit id, if the system is deployed using ostree|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand where these labels apply. Is it expected that they are added to all reported metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a summary on which labels each metric uses here.

@pacevedom
Copy link
Contributor Author

I'm not sure if the intent is to
collect metrics at short intervals and send everything in batch every hour.
or collect and send metrics every hour.

@simonpasquier the intention is to send them every hour collecting them on the spot, there will be no batching. There are no references to batching anymore and the 1h sending schedule is described here.

### Sending metrics
OpenShift is sending metrics through this API every 4.5 minutes. For MicroShift
deployments this might be a bit excessive due to resource usage, network
traffic and usage patterns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect concrete data explaining the cost in terms of CPU/mem/network usage. Also if you go with sending metrics every hour, the trade offs need to be documented IMHO (e.g. the query tweaks I've described in a previous comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included some more info on this and the trade-offs

Combining metrics with their labels:
| Metric | Labels |
|:---|:---|
|cluster:capacity_cpu_cores:sum|_id, instance, ostree_commit, version|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ostree_commit and version should go into another metric (microshift_info for instance) instead of being duplicated with every metric.
Also the corresponding OCP metrics have no instance label (they are cluster-aggregated metrics).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

|cluster:usage:resources:sum|_id, instance, ostree_commit, resource|
|cluster:usage:containers:sum|_id, instance, ostree_commit|

Each metric, including a single sample and all labels, is under 250B in size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster:usage:resources:sum will have more than one sample (many resource label values).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included an estimation of sizing for each label and the whole write request. Also included a note about this metric having 6 different time series based on the resource.

@pacevedom pacevedom force-pushed the USHIFT-5279 branch 2 times, most recently from d8972fd to 80ec9c3 Compare February 10, 2025 22:02

Using the names, labels sizes listed above, each sample requiring a float64
and an int64 value, `cluster:usage:resources:sum` having 6 different time
series (one per resource type, as listed in the proposal), all metrics combined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster:usage:resources:sum having 6 different time
series (one per resource type, as listed in the proposal)

hmm I don't understand why only 6? Are you saying that you will only report a specific set of resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is because of the requirements from the feature. For k8s resources it lists namespaces, pods, routes, ingress, services and number of CRDs. They are also listed in the enhancement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm cluster:usage:resources:sum wouldn't report the number of CRDs?

I'd also go with the detailed characteristics for each metric, it'd be more explicit from my POV. E.g.

### Metrics and their labels:
#### `cluster:usage:resources:sum`

This metric reports the number of Kubernetes resources by resource type.

Labels
* `_id`
** The ID of the cluster (UUID string).
** storage requirements: ...
* `resource`
** The name of the Kubernetes resource. The values can be `pod`, `namespace`, <full list>
** storage requirements: ...
* ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the format to the one in the comment, which is better imho. I am not clear on the storage requirements, though?

|cluster:memory_usage_bytes:sum|_id, ostree_commit|
|cluster:usage:resources:sum|_id, ostree_commit, resource|
|cluster:usage:containers:sum|_id, ostree_commit|
|cluster_version|_id, ostree_commit, version|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd advocate for a different metric name otherwise it might break other users of the existing cluster_version since the labels are going to be different.

Copy link
Contributor Author

@pacevedom pacevedom Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Would using different values for the type label help in any way? We would like to avoid creating dependencies on other teams, if possible.
MicroShift does not have the concept of current/failed upgrade because it is atomic and requires a restart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using different values for the type label help in any way?

If 2 different systems emit the same metric, we want ideally that the metrics have the same semantics and that the label sets are the same. Otherwise it makes querying less robust (queries might assume that a particular label is always present). In particular cluster_version is a base metric used to derive other Telemetry metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with keeping semantics. Will apply for a new metric because this one serves as a base to many other things and we do not want to disrupt OpenShift metrics queries.

Combining metrics with their labels:
| Metric | Labels |
|:---|:---|
|cluster:capacity_cpu_cores:sum|_id, label_kubernetes_io_arch, ostree_commit|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other labels found in OCP: label_beta_kubernetes_io_instance_type, label_node_role_kubernetes_io, label_node_openshift_io_os_id? Are there useful values in the case of MicroShift?

Note that cluster:capacity_cpu_cores:sum is eventually used by the subscription watch service to report core usage and we don't to introduce side effects that would disrupt this reporting. From what I can tell, the proposed implementation should be safe but it would be good to mention it in the proposal.

https://github.com/openshift/telemeter/blob/f63fef91d659c4127838da6e3dc696e07c0e5536/jsonnet/telemeter/rules.libsonnet#L182-L198

Copy link
Contributor Author

@pacevedom pacevedom Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MicroShift does not have support for some of the labels in OpenShift. For example, these are from my dev deployment:

{
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/os": "linux",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "localhost.localdomain",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/control-plane": "",
  "node-role.kubernetes.io/master": "",
  "node-role.kubernetes.io/worker": "",
  "node.openshift.io/os_id": "rhel",
  "topology.topolvm.io/node": "localhost.localdomain"
}

All of them except for the hostname are always the same.

I understand that OpenShift is always having rhcos instead of linux/rhel as the OS in every deployment (including hypershift, hcp, etc.)? Would that be something we can use to differentiate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that OpenShift is always having rhcos instead of linux/rhel as the OS in every deployment (including hypershift, hcp, etc.)?

No there are different values depending on the node OS (e.g. rhel, linux, windows, ...). Again the goal here should be to emit the same label set: even if the values are always the same for MicroShift instances, data consistency on the Telemetry server side is very important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since MicroShift is not using label_node_kubernetes_io_instance_type yet, we will fix it to rhde to signal this cluster belongs to Red Hat Device Edge. This way we have the same labels, same semantics, only another possible value to help us distinguish the product.

@jan--f
Copy link
Contributor

jan--f commented Feb 19, 2025

This looks good with all limitations already pointed out in mind.
One thing that might be good to mention explicitly is that metrics will be collected ad-hoc before they are send and that no batching (or caching) will be implemented, as this would have unforeseen consequences in the telemetry backend.

@pacevedom
Copy link
Contributor Author

This looks good with all limitations already pointed out in mind. One thing that might be good to mention explicitly is that metrics will be collected ad-hoc before they are send and that no batching (or caching) will be implemented, as this would have unforeseen consequences in the telemetry backend.

Added that info in a note. Thanks!

Copy link
Contributor

openshift-ci bot commented Feb 20, 2025

@pacevedom: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

`ostree`, `bootc`.
* `deployment_id`: In case `deployment_type` is `ostree`, this will hold the
ostree commit. In case of `bootc`, include image SHA. Format is a 64
character SHA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the cardinality of this label? E.g. given 10,000 MicroShift instances reporting to Telemetry, how many different values can we expect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of this label is to get to know about fleets of devices, which should all have the same ostree commit/bootc image id's. The typical use case will be one value per MicroShift version, but it could be above that.
Each customer will have their own id's because they customize base images/commits. So a rough estimation for the different values for this would be number of customers * number of different microshift versions they have.

For these reasons, and because the intention of this enhancement is to get the
basic structure of a metrics reporting system, MicroShift will start using 1h
intervals to send metrics, which is a good compromise between resource usage
and data granularity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading it again, I'm not sure how sending metrics every hour helps with unreliable networking. It's quite the opposite in fact since there's less chance for the transfer to complete if it isn't attempted often. Expect if the system retries until it succeeds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By mentioning unreliable networks the enhancement refers to the possibility of having gaps in the metrics.
There is no mention of retries because we allow gaps to happen and we do not require continuous reporting, so if some request gets lost we can tolerate waiting until the next one.

hour.
It will also send metrics when issued to stop, as part of the finish process.
This helps in sending a last report and also hint when the cluster has been
stopped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) from experience: when a cluster stops reporting metrics, it's almost impossible to differentiate between the cluster has been shutdown and there's an issue in the telemetry exchange.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants