Skip to content

Commit

Permalink
Merge pull request #176 from jhajduk-microsoft/hpc-alerting-amba
Browse files Browse the repository at this point in the history
Hpc alerting amba
  • Loading branch information
JoeyBarnes authored Mar 26, 2024
2 parents 9a6d516 + 37926e0 commit 2fb9a0d
Show file tree
Hide file tree
Showing 13 changed files with 336 additions and 5 deletions.
4 changes: 2 additions & 2 deletions docs/content/patterns/alz/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@ Monitoring baselines for the above components are proposed to be deployed levera
- Network security groups
- Azure route tables

In addition to the component specific alerts mentioned above the repo also contains policies for deploying service health alerts by subscription.
In addition to the component specific alerts mentioned above the repo also contains policies for deploying service health alerts by subscription.

Alerts are based on Microsoft public guidance where available, and on practical application experience where public guidance is not available. For more details on which alerts are included please refer to [Alert Details](../alz/Alerts-Details).
Alerts are based on Microsoft public guidance where available, and on practical application experience where public guidance is not available. For more details on which alerts are included please refer to [Alert Details](../alz/Alerts-Details).

For details on how policies are grouped into initiatives please refer to [Azure Policy Initiatives](../alz/Policy-Initiatives)

Expand Down
13 changes: 13 additions & 0 deletions docs/content/patterns/specialized/hpc/Alerting-and-Monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
title: HPC Monitoring and Alerting
geekdocCollapseSection: true
weight: 30
---

## Overview

This page provides the alert setting for HPC infrastructure. We may update these setting as we continue to work with a breadth of customers.

## Alerts

{{< hpcMetricAlerts >}}
24 changes: 24 additions & 0 deletions docs/content/patterns/specialized/hpc/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
title: High Performance Compute
geekdocCollapseSection: true
---

## Overview

High Performance Compute supports a variety of workloads. Seismic modeling, fluid dynamics, Artificial Intelligence workloads all require a more powerful level of compute, networking, and storage than other traditional workloads. Monitoring these environments is critical to ensure continuity in business. You cannot measure what you do not measure. Monitoring HPC workload infrastructure involves implementing alerts and monitoring for Virtual Machines, Storage and Networking across the stack. Alerting for these resources involve monitoring CPU/GPU utilization, throughput/availability, and stability. In this section we provide alert recommendations for the following HPC centric resources:

* Virtual Machines
* Azure Batch Service
* Azure NetApp Files
* Azure Blob Storage
* Azure Managed Lustre Filesystem - Coming Soon!

Please note that an HPC Landing Zone is built on top of the best practices of the Azure Landing Zone. The approach for broader monitoring and alerting in the context of the Azure Landing Zone can be found [here](https://azure.github.io/azure-monitor-baseline-alerts/patterns/alz/Monitoring-and-Alerting/).

## Azure High Performance Computing on Demand

[Azure High Performance Computing on Demand (Az-HOP)](https://learn.microsoft.com/azure/cloud-adoption-framework/scenarios/azure-hpc/azure-hpc-landing-zone-accelerator) is our HPC Landing Zone accelerator. It provides Grafana Dashboards to monitor you cluster. It uses [Azure CycleCloud](https://learn.microsoft.com/azure/cyclecloud/overview?view=cyclecloud-8) as a scheduler.

## GPU Monitoring

We are working on explicit GPU metrics to monitor to for HPC/AI workloads. Until then, Azure HPC Ubuntu VMs come with [Moneo](https://github.com/Azure/Moneo)
95 changes: 95 additions & 0 deletions docs/layouts/shortcodes/hpcMetricAlerts.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
<div><table>
<tr>
<th>Alert Name</th>
<th>Component</th>
<th>Metric</th>
<th>Aggregation</th>
<th>Operator</th>
<th>Threshold</th>
<th>Window</th>
<th>Frequency</th>
<th>Severity</th>
<th>Scope</th>
<th>Support for Multiple Resources</th>
<th>Verified</th>
<th>References</th>
</tr>

{{ range $category, $types := $.Site.Data }}
{{ range $type, $rules := $types }}
{{ range $rules.alerts }}
{{ if or (eq .visible true) (eq $.Site.Params.ambaDevMode true) }}
{{ if and (eq .type "Metric") (in .tags "hpc") }}
{{ $data := newScratch }}
{{ if isset . "deployments" }}
{{ range where .deployments "type" "Policy" }}
{{ if and (in .tags "hpc") }}
{{ $data.Set "name" .name }}
{{ $data.Set "url" (relURL (path.Join "services" $category $type .template)) }}
{{ $data.Set "scope" .properties.scope }}
{{ $data.Set "multiResource" .properties.multiResource }}
{{ end }}
{{ end }}
{{ end }}
<tr>
<td>
<a href='{{ $data.Get "url" }}'>{{ $data.Get "name" }}</a>
</td>
<td>
{{ .properties.metricNamespace }}
</td>
<td>
{{ .properties.metricName }}
</td>
<td>
{{ .properties.timeAggregation }}
</td>
<td>
{{ .properties.operator }}
</td>
<td>
{{ if eq .properties.criterionType "DynamicThresholdCriterion" }}
dynamic
{{ else }}
{{ .properties.threshold }}
{{ end }}
</td>
<td>
{{ .properties.windowSize }}
</td>
<td>
{{ .properties.evaluationFrequency }}
</td>
<td>
{{ .properties.severity }}
</td>
<td>
{{ $data.Get "scope" }}
</td>
<td>
{{ if ($data.Get "multiResource") }}
Yes
{{ else }}
No
{{ end }}
</td>
<td>
{{ if .verified }}
Y
{{ else }}
N
{{ end }}
</td>
<td>
{{ range .references }}
<a href="{{ .url }}" target="_blank">{{ .name }}</a>
{{ end }}
</td>
</tr>
{{ end }}
{{ end }}
{{ end }}
{{ end }}
{{ end }}

</table></div>
60 changes: 59 additions & 1 deletion services/Batch/batchAccounts/alerts.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,32 @@
- name: UnusableNodeCount
description: Number of unusable nodes
type: Metric
verified: false
visible: true
tags:
- auto-generated
- agc-520
- hpc
properties:
metricName: UnusableNodeCount
metricNamespace: Microsoft.Batch/batchAccounts
severity: 2
windowSize: PT5M
evaluationFrequency: PT1M
timeAggregation: Total
operator: GreaterThan
criterionType: StaticThresholdCriterion
threshold: 2.5
autoMitigate: false
- name: OfflineNodeCount
description: Number of offline nodes
type: Metric
verified: false
visible: false
visible: true
tags:
- auto-generated
- agc-416
- hpc
properties:
metricName: OfflineNodeCount
metricNamespace: Microsoft.Batch/batchAccounts
Expand All @@ -24,6 +45,8 @@
visible: true
tags:
- auto-generated
- agc-329
- hpc
- agc-371
properties:
metricName: TaskFailEvent
Expand All @@ -35,3 +58,38 @@
operator: GreaterThan
criterionType: StaticThresholdCriterion
threshold: 0.0
autoMitigate: false
- name: Rebooting Node Count
description: Number of rebooting nodes
type: Metric
verified: false
visible: true
tags: hpc
properties:
metricName: RebootingNodeCount
metricNamespace: Microsoft.Batch/batchAccounts
severity: 1
windowSize: PT5M
evaluationFrequency: PT1M
timeAggregation: Total
operator: GreaterThan
criterionType: StaticThresholdCriterion
threshold: 0.0
autoMitigate: false
- name: Preempted Node Count
description: Number of preempted nodes
type: Metric
verified: false
visible: true
tags: hpc
properties:
metricName: PreemptedNodeCount
metricNamespace: Microsoft.Batch/batchAccounts
severity: 1
windowSize: PT5M
evaluationFrequency: PT1M
timeAggregation: Total
operator: GreaterThan
criterionType: StaticThresholdCriterion
threshold: 0.0
autoMitigate: false
6 changes: 6 additions & 0 deletions services/Compute/virtualMachineScaleSets/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@
visible: true
tags:
- auto-generated
- agc-11764
- hpc
- agc-9755
properties:
metricName: Percentage CPU
Expand Down Expand Up @@ -164,6 +166,8 @@
visible: true
tags:
- auto-generated
- agc-1543
- hpc
- agc-1740
properties:
metricName: Available Memory Bytes
Expand Down Expand Up @@ -261,6 +265,8 @@
visible: true
tags:
- auto-generated
- agc-422
- hpc
- agc-258
properties:
metricName: Network In
Expand Down
20 changes: 20 additions & 0 deletions services/Compute/virtualMachines/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
visible: false
tags:
- alz
- hpc
properties:
metricName: Available Memory Bytes
metricNamespace: Microsoft.Compute/virtualMachines
Expand Down Expand Up @@ -36,6 +37,7 @@
visible: true
tags:
- alz
- hpc
properties:
severity: 2
operator: GreaterThan
Expand Down Expand Up @@ -93,6 +95,7 @@
visible: true
tags:
- alz
- hpc
properties:
severity: 2
operator: LessThan
Expand Down Expand Up @@ -150,6 +153,7 @@
visible: true
tags:
- alz
- hpc
properties:
severity: 2
operator: GreaterThan
Expand Down Expand Up @@ -248,6 +252,7 @@
type: Policy
tags:
- alz
- hpc
properties:
scope: Subscription
multiResource: false
Expand All @@ -266,6 +271,7 @@
visible: true
tags:
- alz
- hpc
properties:
severity: 2
operator: GreaterThan
Expand Down Expand Up @@ -321,6 +327,7 @@
visible: true
tags:
- alz
- hpc
properties:
severity: 2
operator: GreaterThan
Expand Down Expand Up @@ -376,6 +383,7 @@
visible: true
tags:
- alz
- hpc
properties:
severity: 2
operator: GreaterThan
Expand Down Expand Up @@ -431,6 +439,7 @@
visible: true
tags:
- alz
- hpc
properties:
severity: 2
operator: LessThan
Expand Down Expand Up @@ -486,6 +495,7 @@
visible: true
tags:
- alz
- hpc
properties:
severity: 2
operator: GreaterThan
Expand Down Expand Up @@ -539,6 +549,7 @@
visible: true
tags:
- alz
- hpc
properties:
severity: 2
operator: GreaterThan
Expand Down Expand Up @@ -588,6 +599,7 @@
visible: true
tags:
- alz
- hpc
properties:
severity: 2
operator: LessThan
Expand Down Expand Up @@ -692,6 +704,9 @@
verified: true
visible: true
tags:
- auto-generated
- agc-130712
- hpc
properties:
metricName: Available Memory Bytes
metricNamespace: Microsoft.Compute/virtualMachines
Expand Down Expand Up @@ -742,6 +757,9 @@
verified: true
visible: true
tags:
- auto-generated
- agc-83394
- hpc
properties:
metricName: VmAvailabilityMetric
metricNamespace: Microsoft.Compute/virtualMachines
Expand Down Expand Up @@ -835,6 +853,8 @@
visible: true
tags:
- auto-generated
- agc-8619
- hpc
- agc-6701
properties:
metricName: Data Disk Queue Depth
Expand Down
Loading

0 comments on commit 2fb9a0d

Please sign in to comment.