-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USHIFT-5279: MicroShift telemetry enhancement #1742
base: master
Are you sure you want to change the base?
Conversation
@pacevedom: This pull request references USHIFT-5279 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
deployments this might be a bit excessive due to resource usage, network | ||
traffic and usage patterns. | ||
|
||
MicroShift should send data once a day to minimize network traffic as it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't how Prometheus/Thanos works: if you don't send metrics at least every 5m, they will be marked as stale and disappear from the query result. Working with intermittent metrics is very hard in practice and not following the same principle as OCP Telemetry metrics would complicate usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the retention period for metrics in the thanos server? It might be challenging for microshift deployments to send metrics in such a high cadence because of customer limitations (provided they are connected, which is also not common use case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by retention period. The first-level Telemetry backend keeps metrics for 15 days, they are then "copied" to another tier and kept virtually forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was asking because of the querying. MicroShift's use cases are usually disconnected, but for those that are connected they might not have the best network, which is why we thought of reducing the amount of metrics we send. The current queries you mention are based on OpenShift, I understand? We should have separate dashboards/aggregation layers because the product is different than OCP. Are the ones you are talking about in tableau or are they in grafana?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about the Prometheus query API & Grafana (I'm not knowledgeable about Tableau).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that be an issue if we had different dashboards? Our primary interest is to know how many clusters there are out there, usage is a secondary need in this case. Having sparse metrics should not make a huge difference for our intentions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sparse metrics lead to gaps when querying. A workaround would be to use functions like last_over_time(foo[24d])
(e.g. "return the last received datapoint in the last 24 hours"). From my personal experience, this is less than ideal:
- you'll have to educate users about this particularity
- you'll need to maintain different dashboards
- you can't differentiate between an instance being deleted vs. an instance which hasn't reported yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke to our PM and decided to change the requirement from "once every 24h" to "once every hour", as there is not a huge impact (each request is about 2KB in size). None of the issues listed above should apply, right?
We were also wondering, is there some kind of rate limiting on the server side? Or some kind of specific 4xx error saying "send again in X minutes/at Y time"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said before, the Telemetry server expects to receive samples from a given cluster at least every 5m because of how Prometheus/Thanos handles stale metrics. This is why the OCP telemeter client is configured to report metrics every 4m30s. From https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness
If a target scrape or rule evaluation no longer returns a sample for a time series that was previously present, this time series will be marked as stale. If a target is removed, the previously retrieved time series will be marked as stale soon after removal.
If a query is evaluated at a sampling timestamp after a time series is marked as stale, then no value is returned for that time series. If new samples are subsequently ingested for that time series, they will be returned as expected.
A time series will go stale when it is no longer exported, or the target no longer exists. Such time series will disappear from graphs at the times of their latest collected sample, and they will not be returned in queries after they are marked stale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added notes about staleness and dashboards here.
|version|$microshift_version| | ||
|type|microshift-rpm, microshift-ostree| | ||
|
||
Metrics from MicroShift are already supported in the API because OpenShift is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Telemetry server has an allow-list of metrics which comes from OCP and incoming samples which don't match the list are dropped. What's the strategy to keep MicroShift metrics in sync with OCP metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the characteristics of MicroShift we think the list included here is enough for our purposes, as there are many parts of OCP that are missing and we do not plan to adopt any of them.
Also, MicroShift is an rpm based product, so we also have other means of reporting what's installed and what isn't. These metrics help build a closer to reality picture of what a system is running.
Keeping MicroShift metrics in sync may be done by means of having CI clusters report their metrics and check our stats using Grafana as part of our CI duties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put that in the enhancement?
having CI clusters report their metrics and check our stats using Grafana as part of our CI duties.
I expect the process to be automated otherwise I suspect that it will drift fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added info here.
instance, not viable in the typical resource constrained MicroShift deployment. | ||
The direct write requests take raw Prometheus data and require both special | ||
labels (`_id` for the cluster id) and crafting the authentication headers | ||
(cluster id and pull secret in a specific format in HTTP headers). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to assess the potential impact on the Telemetry write endpoint (https://infogw.api.openshift.com/metrics/v1/receive
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean in bandwidth used per day? Number of connections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moadz would have more insights but my concerns would mainly be about network traffic and authentication requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included an estimation of data sent per metrics payload/report.
a5c1d4b
to
459c573
Compare
API to get data from connected MicroShift clusters. | ||
|
||
### User Stories | ||
As Red Hat, I want to enable MicroShift clusters to report back to me to get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do users see this telemetry data? Are there any increased system requirements, storage, or networking to use telemetry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understanding user as an application admin using MicroShift, users cant see any of this as the reports are intended to provide information for Red Hat. Users who want to have metrics on their systems should look into an observability stack instead (like Prometheus).
As for system requirements, the resource usage of this feature is negligible so no increase in any aspect. There is an estimation here.
telemetry data is sent. | ||
|
||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would a user potentially apply this information to their use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understanding user as an application admin using MicroShift, the only relevant metrics for them might be those of resource usage, besides their own applications running on top. All metrics here are system and MicroShift core components related.
|
||
As a MicroShift admin, I want to have the option to opt-out of telemetry. | ||
|
||
As a MicroShift admin, I want to have the option to configure how often |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about which moment in the day (or rather, after how many hours/minutes/seconds since MicroShift started) you want MicroShift to send metrics to Red Hat. It will default to 24h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of the latest changes, we dont have this goal anymore (and the configuration option disappeared).
* Get better understanding of deployment and usage patterns on MicroShift. | ||
|
||
### Non-Goals | ||
* Have an in-cluster metrics service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per https://rhobs-handbook.netlify.app/products/openshiftmonitoring/telemetry.md/#requirements, is this required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is basically mentioning the monitoring stack (Prometheus, alert-manager, etc). The document in the link is targeting OCP systems and operators and we have none of them. In this enhancement we are using the same API on the backend, but the client is not the same.
these reasons there must be an opt-out for this functionality, as OpenShift | ||
already [allows](https://docs.openshift.com/container-platform/4.17/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html). | ||
|
||
Taking advantage of the configuration options that the feature requires, an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So will a user toggle off and also have to use the OCP steps, or can we automate those steps after the opt-out toggle is selected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just using the configuration option to disable it (status: Disabled
) will do. No need to follow OCP's opt out procedure, which involves changing pull secrets.
**MicroShift** is the MicroShift main process. | ||
|
||
1. MicroShift starts up. | ||
2. MicroShift reads configuration. If telemetry is not enabled, finish here. If telemetry is enabled proceed to next step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so it seems like we want to configure this at install, prior to first boot. Can we turn this on later and restart? (And vice versa, turn it off and restart.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you can enable/disable this everytime you restart MicroShift, it is not tied to the install/first boot.
### Risks and Mitigations | ||
Using the telemeter API requires connected clusters, and this might not always | ||
be the case with MicroShift. For those clusters that are not connected there | ||
is no easy way of getting metrics from them. A possible mitigation would be to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is this something we gather that an admin can then use when an edge device is about to be upgraded or decommissioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would also be interesting, it would need to be connected though. Let me give it some thought and include it here.
|
||
MicroShift runs as an application on top of R4E (Rhel4Edge), the usage of RHEL | ||
insights may yield some information about MicroShift, but there might be gaps. | ||
RHEL insights knows about which packages are installed, but not whether |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it useful to mention RHEL Insights as a counterpoint in the MicroShift docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something about this in the R4E docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the RHEL docs have that completely separate from the install types.
f5df57c
to
f88ea83
Compare
f88ea83
to
8c73ed6
Compare
8c73ed6
to
25828f0
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
49ce572
to
46ee728
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the intent is to
- collect metrics at short intervals and send everything in batch every hour.
- or collect and send metrics every hour.
|resource|Used to specify K8s resource names: pods, namespaces, etc.| | ||
|instance|ip address of the node| | ||
|version|microshift version| | ||
|ostree_commit|OStree commit id, if the system is deployed using ostree| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand where these labels apply. Is it expected that they are added to all reported metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a summary on which labels each metric uses here.
46ee728
to
56e58f3
Compare
@simonpasquier the intention is to send them every hour collecting them on the spot, there will be no batching. There are no references to batching anymore and the 1h sending schedule is described here. |
### Sending metrics | ||
OpenShift is sending metrics through this API every 4.5 minutes. For MicroShift | ||
deployments this might be a bit excessive due to resource usage, network | ||
traffic and usage patterns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect concrete data explaining the cost in terms of CPU/mem/network usage. Also if you go with sending metrics every hour, the trade offs need to be documented IMHO (e.g. the query tweaks I've described in a previous comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included some more info on this and the trade-offs
Combining metrics with their labels: | ||
| Metric | Labels | | ||
|:---|:---| | ||
|cluster:capacity_cpu_cores:sum|_id, instance, ostree_commit, version| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ostree_commit
and version
should go into another metric (microshift_info
for instance) instead of being duplicated with every metric.
Also the corresponding OCP metrics have no instance
label (they are cluster-aggregated metrics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|cluster:usage:resources:sum|_id, instance, ostree_commit, resource| | ||
|cluster:usage:containers:sum|_id, instance, ostree_commit| | ||
|
||
Each metric, including a single sample and all labels, is under 250B in size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster:usage:resources:sum
will have more than one sample (many resource
label values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included an estimation of sizing for each label and the whole write request. Also included a note about this metric having 6 different time series based on the resource.
d8972fd
to
80ec9c3
Compare
80ec9c3
to
f68bf45
Compare
|
||
Using the names, labels sizes listed above, each sample requiring a float64 | ||
and an int64 value, `cluster:usage:resources:sum` having 6 different time | ||
series (one per resource type, as listed in the proposal), all metrics combined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster:usage:resources:sum
having 6 different time
series (one per resource type, as listed in the proposal)
hmm I don't understand why only 6? Are you saying that you will only report a specific set of resources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is because of the requirements from the feature. For k8s resources it lists namespaces, pods, routes, ingress, services and number of CRDs. They are also listed in the enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm cluster:usage:resources:sum
wouldn't report the number of CRDs?
I'd also go with the detailed characteristics for each metric, it'd be more explicit from my POV. E.g.
### Metrics and their labels:
#### `cluster:usage:resources:sum`
This metric reports the number of Kubernetes resources by resource type.
Labels
* `_id`
** The ID of the cluster (UUID string).
** storage requirements: ...
* `resource`
** The name of the Kubernetes resource. The values can be `pod`, `namespace`, <full list>
** storage requirements: ...
* ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the format to the one in the comment, which is better imho. I am not clear on the storage requirements, though?
|cluster:memory_usage_bytes:sum|_id, ostree_commit| | ||
|cluster:usage:resources:sum|_id, ostree_commit, resource| | ||
|cluster:usage:containers:sum|_id, ostree_commit| | ||
|cluster_version|_id, ostree_commit, version| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd advocate for a different metric name otherwise it might break other users of the existing cluster_version
since the labels are going to be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Would using different values for the type
label help in any way? We would like to avoid creating dependencies on other teams, if possible.
MicroShift does not have the concept of current/failed upgrade because it is atomic and requires a restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using different values for the type label help in any way?
If 2 different systems emit the same metric, we want ideally that the metrics have the same semantics and that the label sets are the same. Otherwise it makes querying less robust (queries might assume that a particular label is always present). In particular cluster_version
is a base metric used to derive other Telemetry metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with keeping semantics. Will apply for a new metric because this one serves as a base to many other things and we do not want to disrupt OpenShift metrics queries.
Combining metrics with their labels: | ||
| Metric | Labels | | ||
|:---|:---| | ||
|cluster:capacity_cpu_cores:sum|_id, label_kubernetes_io_arch, ostree_commit| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the other labels found in OCP: label_beta_kubernetes_io_instance_type, label_node_role_kubernetes_io, label_node_openshift_io_os_id? Are there useful values in the case of MicroShift?
Note that cluster:capacity_cpu_cores:sum is eventually used by the subscription watch service to report core usage and we don't to introduce side effects that would disrupt this reporting. From what I can tell, the proposed implementation should be safe but it would be good to mention it in the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MicroShift does not have support for some of the labels in OpenShift. For example, these are from my dev deployment:
{
"beta.kubernetes.io/arch": "amd64",
"beta.kubernetes.io/os": "linux",
"kubernetes.io/arch": "amd64",
"kubernetes.io/hostname": "localhost.localdomain",
"kubernetes.io/os": "linux",
"node-role.kubernetes.io/control-plane": "",
"node-role.kubernetes.io/master": "",
"node-role.kubernetes.io/worker": "",
"node.openshift.io/os_id": "rhel",
"topology.topolvm.io/node": "localhost.localdomain"
}
All of them except for the hostname are always the same.
I understand that OpenShift is always having rhcos
instead of linux
/rhel
as the OS in every deployment (including hypershift, hcp, etc.)? Would that be something we can use to differentiate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that OpenShift is always having rhcos instead of linux/rhel as the OS in every deployment (including hypershift, hcp, etc.)?
No there are different values depending on the node OS (e.g. rhel, linux, windows, ...). Again the goal here should be to emit the same label set: even if the values are always the same for MicroShift instances, data consistency on the Telemetry server side is very important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since MicroShift is not using label_node_kubernetes_io_instance_type
yet, we will fix it to rhde
to signal this cluster belongs to Red Hat Device Edge. This way we have the same labels, same semantics, only another possible value to help us distinguish the product.
74f8003
to
c7b8319
Compare
This looks good with all limitations already pointed out in mind. |
Added that info in a note. Thanks! |
@pacevedom: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
`ostree`, `bootc`. | ||
* `deployment_id`: In case `deployment_type` is `ostree`, this will hold the | ||
ostree commit. In case of `bootc`, include image SHA. Format is a 64 | ||
character SHA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the cardinality of this label? E.g. given 10,000 MicroShift instances reporting to Telemetry, how many different values can we expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention of this label is to get to know about fleets of devices, which should all have the same ostree commit/bootc image id's. The typical use case will be one value per MicroShift version, but it could be above that.
Each customer will have their own id's because they customize base images/commits. So a rough estimation for the different values for this would be number of customers * number of different microshift versions they have.
For these reasons, and because the intention of this enhancement is to get the | ||
basic structure of a metrics reporting system, MicroShift will start using 1h | ||
intervals to send metrics, which is a good compromise between resource usage | ||
and data granularity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading it again, I'm not sure how sending metrics every hour helps with unreliable networking. It's quite the opposite in fact since there's less chance for the transfer to complete if it isn't attempted often. Expect if the system retries until it succeeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By mentioning unreliable networks the enhancement refers to the possibility of having gaps in the metrics.
There is no mention of retries because we allow gaps to happen and we do not require continuous reporting, so if some request gets lost we can tolerate waiting until the next one.
hour. | ||
It will also send metrics when issued to stop, as part of the finish process. | ||
This helps in sending a last report and also hint when the cluster has been | ||
stopped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) from experience: when a cluster stops reporting metrics, it's almost impossible to differentiate between the cluster has been shutdown and there's an issue in the telemetry exchange.
No description provided.