-
Notifications
You must be signed in to change notification settings - Fork 270
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #176 from jhajduk-microsoft/hpc-alerting-amba
Hpc alerting amba
- Loading branch information
Showing
13 changed files
with
336 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
13 changes: 13 additions & 0 deletions
13
docs/content/patterns/specialized/hpc/Alerting-and-Monitoring.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
--- | ||
title: HPC Monitoring and Alerting | ||
geekdocCollapseSection: true | ||
weight: 30 | ||
--- | ||
|
||
## Overview | ||
|
||
This page provides the alert setting for HPC infrastructure. We may update these setting as we continue to work with a breadth of customers. | ||
|
||
## Alerts | ||
|
||
{{< hpcMetricAlerts >}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
--- | ||
title: High Performance Compute | ||
geekdocCollapseSection: true | ||
--- | ||
|
||
## Overview | ||
|
||
High Performance Compute supports a variety of workloads. Seismic modeling, fluid dynamics, Artificial Intelligence workloads all require a more powerful level of compute, networking, and storage than other traditional workloads. Monitoring these environments is critical to ensure continuity in business. You cannot measure what you do not measure. Monitoring HPC workload infrastructure involves implementing alerts and monitoring for Virtual Machines, Storage and Networking across the stack. Alerting for these resources involve monitoring CPU/GPU utilization, throughput/availability, and stability. In this section we provide alert recommendations for the following HPC centric resources: | ||
|
||
* Virtual Machines | ||
* Azure Batch Service | ||
* Azure NetApp Files | ||
* Azure Blob Storage | ||
* Azure Managed Lustre Filesystem - Coming Soon! | ||
|
||
Please note that an HPC Landing Zone is built on top of the best practices of the Azure Landing Zone. The approach for broader monitoring and alerting in the context of the Azure Landing Zone can be found [here](https://azure.github.io/azure-monitor-baseline-alerts/patterns/alz/Monitoring-and-Alerting/). | ||
|
||
## Azure High Performance Computing on Demand | ||
|
||
[Azure High Performance Computing on Demand (Az-HOP)](https://learn.microsoft.com/azure/cloud-adoption-framework/scenarios/azure-hpc/azure-hpc-landing-zone-accelerator) is our HPC Landing Zone accelerator. It provides Grafana Dashboards to monitor you cluster. It uses [Azure CycleCloud](https://learn.microsoft.com/azure/cyclecloud/overview?view=cyclecloud-8) as a scheduler. | ||
|
||
## GPU Monitoring | ||
|
||
We are working on explicit GPU metrics to monitor to for HPC/AI workloads. Until then, Azure HPC Ubuntu VMs come with [Moneo](https://github.com/Azure/Moneo) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
<div><table> | ||
<tr> | ||
<th>Alert Name</th> | ||
<th>Component</th> | ||
<th>Metric</th> | ||
<th>Aggregation</th> | ||
<th>Operator</th> | ||
<th>Threshold</th> | ||
<th>Window</th> | ||
<th>Frequency</th> | ||
<th>Severity</th> | ||
<th>Scope</th> | ||
<th>Support for Multiple Resources</th> | ||
<th>Verified</th> | ||
<th>References</th> | ||
</tr> | ||
|
||
{{ range $category, $types := $.Site.Data }} | ||
{{ range $type, $rules := $types }} | ||
{{ range $rules.alerts }} | ||
{{ if or (eq .visible true) (eq $.Site.Params.ambaDevMode true) }} | ||
{{ if and (eq .type "Metric") (in .tags "hpc") }} | ||
{{ $data := newScratch }} | ||
{{ if isset . "deployments" }} | ||
{{ range where .deployments "type" "Policy" }} | ||
{{ if and (in .tags "hpc") }} | ||
{{ $data.Set "name" .name }} | ||
{{ $data.Set "url" (relURL (path.Join "services" $category $type .template)) }} | ||
{{ $data.Set "scope" .properties.scope }} | ||
{{ $data.Set "multiResource" .properties.multiResource }} | ||
{{ end }} | ||
{{ end }} | ||
{{ end }} | ||
<tr> | ||
<td> | ||
<a href='{{ $data.Get "url" }}'>{{ $data.Get "name" }}</a> | ||
</td> | ||
<td> | ||
{{ .properties.metricNamespace }} | ||
</td> | ||
<td> | ||
{{ .properties.metricName }} | ||
</td> | ||
<td> | ||
{{ .properties.timeAggregation }} | ||
</td> | ||
<td> | ||
{{ .properties.operator }} | ||
</td> | ||
<td> | ||
{{ if eq .properties.criterionType "DynamicThresholdCriterion" }} | ||
dynamic | ||
{{ else }} | ||
{{ .properties.threshold }} | ||
{{ end }} | ||
</td> | ||
<td> | ||
{{ .properties.windowSize }} | ||
</td> | ||
<td> | ||
{{ .properties.evaluationFrequency }} | ||
</td> | ||
<td> | ||
{{ .properties.severity }} | ||
</td> | ||
<td> | ||
{{ $data.Get "scope" }} | ||
</td> | ||
<td> | ||
{{ if ($data.Get "multiResource") }} | ||
Yes | ||
{{ else }} | ||
No | ||
{{ end }} | ||
</td> | ||
<td> | ||
{{ if .verified }} | ||
Y | ||
{{ else }} | ||
N | ||
{{ end }} | ||
</td> | ||
<td> | ||
{{ range .references }} | ||
<a href="{{ .url }}" target="_blank">{{ .name }}</a> | ||
{{ end }} | ||
</td> | ||
</tr> | ||
{{ end }} | ||
{{ end }} | ||
{{ end }} | ||
{{ end }} | ||
{{ end }} | ||
|
||
</table></div> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.