Skip to content

Commit

Permalink
Merge pull request #375 from jhajduk-microsoft/main
Browse files Browse the repository at this point in the history
adding AMLFS alerting
  • Loading branch information
JoeyBarnes authored Oct 14, 2024
2 parents 78ca438 + 3135e40 commit 85e267f
Show file tree
Hide file tree
Showing 13 changed files with 489 additions and 1,390 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Overview

There are numerous ways to implement AI solution on Azure, and each comes with its own monitoring solution. Monitoring AI solutions involves a combination of the infra or paas resources, along with monitoring any utilization metrics that can be exposed through the platform or other tooling. This page will summarize the recommended monitoring solutions for different scenarios.

## AI on Infrastructure (BYOM)

Running AI workloads on Azure infrastructure involves monitoring each of the components of the solution, including virtual machines, storage, and networking. Refer to the defined metrics in [HPC](../../specialized/hpc/Alerting-and-Monitoring.md). For monitoring the GPU/CPU metrics, use [Moneo](https://github.com/Azure/Moneo)
4 changes: 4 additions & 0 deletions docs/content/patterns/artificial intelligence/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
title: Artificial Intelligence
geekdocCollapseSection: true
---
10 changes: 3 additions & 7 deletions docs/content/patterns/specialized/hpc/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,10 @@ High Performance Compute supports a variety of workloads. Seismic modeling, flui
* Azure Batch Service
* Azure NetApp Files
* Azure Blob Storage
* Azure Managed Lustre Filesystem - Coming Soon!
* Azure Managed Lustre Filesystem

Please note that an HPC Landing Zone is built on top of the best practices of the Azure Landing Zone. The approach for broader monitoring and alerting in the context of the Azure Landing Zone can be found [here](https://azure.github.io/azure-monitor-baseline-alerts/patterns/alz/Monitoring-and-Alerting/).

## Azure High Performance Computing on Demand
## Azure CycleCloud Workspace for Slurm

[Azure High Performance Computing on Demand (Az-HOP)](https://learn.microsoft.com/azure/cloud-adoption-framework/scenarios/azure-hpc/azure-hpc-landing-zone-accelerator) is our HPC Landing Zone accelerator. It provides Grafana Dashboards to monitor you cluster. It uses [Azure CycleCloud](https://learn.microsoft.com/azure/cyclecloud/overview?view=cyclecloud-8) as a scheduler.

## GPU Monitoring

We are working on explicit GPU metrics to monitor to for HPC/AI workloads. Until then, Azure HPC Ubuntu VMs come with [Moneo](https://github.com/Azure/Moneo)
[Azure CycleCloud Workspace for Slurm](https://learn.microsoft.com/azure/cyclecloud/overview-ccws?view=cyclecloud-8) is our HPC Landing Zone accelerator.
5 changes: 5 additions & 0 deletions services/NetApp/netAppAccounts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
tags:
- auto-generated
- agc-19726
- hpc
properties:
metricName: VolumeConsumedSizePercentage
metricNamespace: Microsoft.NetApp/netAppAccounts/capacityPools/volumes
Expand All @@ -25,6 +26,7 @@
tags:
- auto-generated
- agc-1914
- hpc
properties:
metricName: VolumeLogicalSize
metricNamespace: Microsoft.NetApp/netAppAccounts/capacityPools/volumes
Expand Down Expand Up @@ -87,6 +89,7 @@
tags:
- auto-generated
- agc-374
- hpc
properties:
metricName: AverageReadLatency
metricNamespace: Microsoft.NetApp/netAppAccounts/capacityPools/volumes
Expand All @@ -107,6 +110,7 @@
tags:
- auto-generated
- agc-305
- hpc
properties:
metricName: CbsVolumeOperationComplete
metricNamespace: Microsoft.NetApp/netAppAccounts/capacityPools/volumes
Expand All @@ -126,6 +130,7 @@
tags:
- auto-generated
- agc-301
- hpc
properties:
metricName: VolumeAllocatedSize
metricNamespace: Microsoft.NetApp/netAppAccounts/capacityPools/volumes
Expand Down
4 changes: 3 additions & 1 deletion services/StorageCache/AmlFilesystems/_index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
---
title: amlFilesystems
geekdocCollapseSection: true
geekdocHidden: true
geekdocHidden: false
---

{{< alertList name="alertList" >}}
Loading

0 comments on commit 85e267f

Please sign in to comment.