Skip to content

Commit

Permalink
Merge branch 'main' into 185870_padmajayaraman_ChangeNETMONParserPort
Browse files Browse the repository at this point in the history
  • Loading branch information
padmajayaraman authored May 28, 2024
2 parents 5be7888 + aa499f1 commit 2caad9f
Show file tree
Hide file tree
Showing 309 changed files with 2,664 additions and 1,172 deletions.
15 changes: 15 additions & 0 deletions .openpublishing.redirection.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
{
"redirections": [
{
"source_path": "support/windows-server/group-policy/give-users-access-group-policy-objects.md",
"redirect_url": "/troubleshoot/windows-server/group-policy/cannot-apply-user-gpo-when-computer-objects-dont-have-read-permissions",
"redirect_document_id": false
},
{
"source_path": "support/windows-server/networking/periodic-smbclient-event-id-30818.md",
"redirect_url": "/troubleshoot/windows-server/networking/troubleshoot-smb-guidance#windows-server-2012-r2-periodically-logs-SMBClient-event-id-30818",
Expand Down Expand Up @@ -11124,6 +11129,16 @@
"source_path": "support/windows-server/active-directory/syntax-build-answer-files-unattended-installation-ad-ds.md",
"redirect_url": "/windows-server/identity/ad-ds/deploy/dcpromo",
"redirect_document_id": false
},
{
"source_path": "support/windows-server/active-directory/dcdiag-verifyreferences-test-fails.md",
"redirect_url": "/previous-versions/troubleshoot/windows-server/dcdiag-verifyreferences-test-fails",
"redirect_document_id": false
},
{
"source_path": "support/windows-server/active-directory/lingering-objects-remain.md",
"redirect_url": "/previous-versions/troubleshoot/windows-server/lingering-objects-remain",
"redirect_document_id": false
}
]
}
5 changes: 5 additions & 0 deletions support/azure/.openpublishing.redirection.azure.json
Original file line number Diff line number Diff line change
Expand Up @@ -5449,6 +5449,11 @@
"source_path": "general/cannot-sign-up-subscription.md",
"redirect_url": "/azure/cost-management-billing/troubleshoot-subscription/cannot-sign-up-subscription",
"redirect_document_id": false
},
{
"source_path": "virtual-machines/windows/swap-file-not-recreated-linux-vm-restart.md",
"redirect_url": "/troubleshoot/azure/virtual-machines/linux/swap-file-not-recreated-linux-vm-restart",
"redirect_document_id": true
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: Azure Container Instances fails to run GPU-enabled containers
description: Provides solutions to GPU-enabled Azure container instance deployment failures.
ms.date: 05/27/2024
ms.reviewer: v-rekhanain, momajed, v-weizhu
ms.service: container-instances
ms.custom: sap:Configuration and Setup
---
# GPU-enabled container deployment fails with "service unavailable" error

This article discusses the causes of GPU-enabled Azure container instance deployment failures and provides solutions.

## Symptoms

When you try to deploy a GPU-enabled container to Azure Container Instances (ACI), you encounter the following symptoms:

- You receive the following error message:

> Service unavailable. Please try again later or contact support if this problem persists.
- When you check the status of your container group, you see that the GPU ACI provisioning state is "Failed" and the error code is "ServiceUnavailable."
- When you view the logs of your container group, you see that the GPU driver installation failed or timed out.

## Cause 1: Not enough GPU quota

Your subscription or region doesn't have enough GPU quota to deploy the container group. GPU quota is limited and subject to availability.

### Solution 1: Increase GPU quota

Check the GPU quota and availability for your subscription and region, and request GPU quota increases by using the Azure CLI or Azure portal.

## Cause 2: Incompatible container group configuration

Your container group configuration is incompatible with the GPU SKU. Running GPU-enabled containers requires specific CPU, memory, and operating system (OS) settings.

### Solution 2: Update container group configuration to match GPU SKU

Check your container group configuration and make sure that it matches the GPU SKU requirements. You can re-create or update your container group configuration by using the Azure CLI or Azure portal.

Check the region availability of the GPU SKUs that you want to use. Not all regions support all GPU SKUs. The following table shows the current region availability of GPU SKUs for Linux OS.

|Region| OS |Available GPU SKU|
|---|---|---|
|East US| Linux| V100|
|West Europe| Linux| V100|
|West US 2| Linux| V100|
|Southeast Asia| Linux| V100|
|Central India| Linux |V100|

If your region doesn't support the GPU SKU that you need, you can choose a different region or a GPU SKU that's available in your region.


## Cause 3: Incorrect GPU driver or toolkit is installed

Your container image doesn't have the correct GPU driver or toolkit installed. GPU-enabled containers require NVIDIA drivers and CUDA or TensorRT libraries to access GPU resources.

### Solution 3: Install the NVIDIA Container Toolkit or use Azure Machine Learning base images

Check your container image and make sure that it has the correct GPU driver and toolkit installed. You can install the NVIDIA Container Toolkit or use the Azure Machine Learning base images to build and run your GPU-enabled containers.

## Cause 4: No NVIDIA drivers or libraries are installed

Your container image doesn't have NVIDIA drivers or libraries installed. GPU-enabled containers require NVIDIA drivers and CUDA or TensorRT libraries to access the GPU resources.

### Solution 4: Use the NVIDIA GPU Cloud (NGC) repository

Check your container image and make sure that it has the NVIDIA drivers and libraries installed. You can use the NVIDIA GPU Cloud (NGC) repository to find and pull prebuilt GPU-accelerated images for various frameworks and applications.

[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]
2 changes: 2 additions & 0 deletions support/azure/azure-container-instances/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ items:
href: configuration-setup/error-codes-spot-container-creation.md
- name: Image pull takes a long time to run
href: configuration-setup/long-image-pulls.md
- name: Error "service unavailable" when deploying GPU-enabled containers
href: configuration-setup/service-unavailable-gpu-enabled-aci-deployment-failures.md
- name: ServiceUnavailable - container group quota exceeded in region
href: configuration-setup/service-unavailable-container-group-quota-exceeded.md
- name: ServiceUnavailable (409) - requested resource is not available in the location
Expand Down
Loading

0 comments on commit 2caad9f

Please sign in to comment.