-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GCE recommended alerts for GPU VMs #774
base: master
Are you sure you want to change the base?
Add GCE recommended alerts for GPU VMs #774
Conversation
a5b648d
to
d33b8b0
Compare
@@ -69,3 +69,31 @@ alert_policy_templates: | |||
related_integrations: | |||
- id: gce | |||
platform: GCP | |||
- | |||
id: gpu-utilization-too-high | |||
description: "Monitors GPU utilization across all GCE VMs in the current project and will notify you if the GPU utilization on any VM instance rises above 90% for 5 minutes or more. This requires the Ops Agent to be installed on VMs to collect the gpu utilization metric." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu -> GPU (update for all descriptions please)
"trigger": { | ||
"count": 1 | ||
}, | ||
"query": "{ fetch gce_instance\n | metric 'agent.googleapis.com/gpu/memory/bytes_used'\n | filter (metadata.system_labels.name == '${INSTANCE_NAME}')\n | filter metric.memory_state == 'used'\n | group_by 5m, [value_bytes_used_mean: mean(value.bytes_used)]\n | every 5m\n | group_by [metric.gpu_number, metric.model, metric.uuid, resource.instance_id, resource.project_id, resource.zone, metadata.system_labels.name], [value_bytes_used_mean_aggregate: aggregate(value_bytes_used_mean)]\n; fetch gce_instance\n | metric 'agent.googleapis.com/gpu/memory/bytes_used' \n | filter (metadata.system_labels.name == '${INSTANCE_NAME}')\n | group_by 5m, [value_bytes_used_mean: mean(value.bytes_used)]\n | every 5m\n | group_by [metric.gpu_number, metric.model, metric.uuid, resource.instance_id, resource.project_id, resource.zone, metadata.system_labels.name], [value_bytes_used_mean_aggregate: aggregate(value_bytes_used_mean)] }\n| ratio\n| mul (100)\n| cast_units ('%')\n| every 5m\n| condition val() > 0.9 '10^2.%'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mild preference to express these new recommended alerts as equivalent PromQL instead of MQL going forward
cc @lyanco
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strong preference. We're announcing MQL Deprecation on July 17th.
b/343920635
This PR adds 4 new alert templates covering:
agent.googleapis.com/gpu/memory/bytes_used
, and the queries are implemented using MQL.Screenshots of alert notifications: