https://aka.ms/hpcdiag redirects to this repo.
This repo holds a script that, when run on an Azure VM, gathers a variety of diagnostic information for the purposes of diagnosing common HPC, Infiniband, and GPU problems. It runs a suite of diagnostic tools ranging from built-in Linux tools like lscpu to vendor-specific CLI's like nvidia-smi. The resulting information is packaged up into a tarball, so that it can be shared with support engineers to speed up the troubleshooting process.
If you are reading this, you are likely troubleshooting problems on an Azure HPC VM, in which case we suggest you contact support if you have not already and run this tool on your VM so that you can provide the output to support engineers when prompted.
If you have special privacy requirements concerning logs leaving your VM, make sure to open up the tarball and redact any sensitive information before re-tarring it and handing it off to support engineers.
This tool is meant for diagnosing inactive systems. It runs benchmarks that stress various system devices such as memory, GPU, and Infiniband. It will cause performance degradation for or otherwise interfere with other active processes that use these resources. It is not advised to use this tool on systems where other jobs are currenlty running.
To stop the tool while it is running, interrupt the process (i.e. ctrl-c) to force it to reset system state and terminate.
After cloning this repo, no further installation is required. To run the script, run the following command, replacing {repo-root} with the name of this repo's directory on your VM:
sudo bash {repo-root}/Linux/src/gather_azhpc_vm_diagnostics.sh
Alternatively, a version of this tool is included in PerfInsights for Linux under the HPC scenario. Running this scenario directly from the Azure Portal is not supported at this time, so PerfInsights must be downloaded and run from the command line, but the results of this tool are included in the report generated.
This section describes the output of the script and the configuration options available.
Option (Short) | Option (Long) | Parameters | Description | Example | Example Description |
---|---|---|---|---|---|
-d | --dir | Directory Name | Specify custom output location | --dir=. | Put the tarball in the current directory |
-V | --version | display version information and exit | --version | Outputs 0.0.1 | |
-h | --help | display help text | -h | Outputs the help message | |
-v | --verbose | verbose output | --verbose | Enables more verbose terminal output | |
--gpu-level | 1 (default), 2, or 3 | GPU diagnostics run-level | --gpu-level=3 | Sets dcgmi run-level to 3 | |
--mem-level | 0 (default) or 1 | Memory diagnostics run-level | --mem-level=1 | Enables stream benchmark test | |
--no-update | Disables auto-update | --no-update | Refrains from checking for updates to the script | ||
--offline | Prevents internet access | --offline | Skips stream benchmark and lsvmbus if not installed |
Note that not all these files will be generated on all runs. What appears below is union of all files that could be generated, which depends on script parameters and VM size:
{vm-id}.{timestamp}.tar.gz
|-- transcript.log (logs for the tool itself)
|-- hpcdiag.err (stderr output from the run, including set -x trace)
|-- VM
| -- dmesg.log
| -- waagent.log
| -- lspci.txt
| -- lsvmbus.log
| -- ipconfig.txt
| -- sysctl.txt
| -- uname.txt
| -- dmidecode.txt
| -- lsmod.txt
| -- journald.log|syslog|messages
| -- services
| -- selinux
| -- hyperv/kvp_pool*.txt
|-- CPU
| -- lscpu.txt
| -- ulimit
| -- zone_reclaim_mode
|-- Memory
| -- stream.txt
|-- Infiniband
| -- ib-vmext.log
| -- ibstat.out
| -- ibstatus.out
| -- ibv_devinfo.out
| -- pkeys/*
| -- ethtool.out (ENDURE)
| -- rate (ENDURE)
| -- state (ENDURE)
| -- phys_state (ENDURE)
|-- Nvidia
-- nvidia-bug-report.log.gz
-- nvidia-installer.log
-- nvidia-vmext.log
-- nvidia-smi.out
-- nvidia-smi-q.out
-- nvidia-smi-nvlink.out
-- nvidia-debugdump.zip (only Nvidia can read)
-- dcgm-diag-2.log
-- dcgm-diag-3.log
-- nvvs.log
-- stats_*.json
Tool | Command | Output File(s) | Description | EULA |
---|---|---|---|---|
dmesg | dmesg | VM/dmesg.log | Dump of kernel ring buffer | |
rsyslog | cp syslog|messages | VM/syslog|messages | Dump of system log | |
journald | journalctl | VM/journald.log | Dump of system log | |
Azure IMDS | curl http://169.254.169.254/metadata/... | transcript.log | VM Metadata (ID,Region,OS Image, etc) | |
Azure VM Agent | cp /var/log/waagent.log | waagent.log | Logs from the Azure VM Agent | |
lspci | lspci | VM/lspci.txt | Info on installed PCI devices | |
lsvmbus | lsvmbus | VM/lsvmbus.log | Displays devices attached to the Hyper-V VMBus | |
Hyper-V KVP | custom-made | VM/hyperv/kvp_pool*.txt | Exposes certain Windows Registry data from the Azure Host | |
ipconfig | ipconfig | VM/ipconfig.txt | Checking TCP/IP configuration | |
sysctl | sysctl | VM/sysctl.txt | Checking kernel parameters | |
uname | uname | VM/uname.txt | Checking system information | |
systemd | systemctl | VM/services | Checking for certain active services (tuning only) | |
selinux | cp /etc/sysconfig/selinux | VM/selinux | Checking for selinux activity (tuning only) | |
ulimit | cp /etc/security/limits.conf | Memory/ulimit | Checking for default user resource limits (tuning only) | |
- | cp /proc/sys/vm/zone_reclaim_mode | Memory/zone_reclaim_mode | Checking NUMA memory reclamation policy (tuning only) | |
dmidecode | dmidecode | VM/dmidecode.txt | DMI table dump (info on hardware components) | |
lsmod | lsmod | VM/lsmod.txt | List of active kernel modules | |
lscpu | lscpu | CPU/lscpu.txt | Information about the system CPU architecture | |
stream | stream_zen_double | Memory/stream.txt | The stream benchmark suite (AMD Only) | Stream License |
ibstat | ibstat | Infiniband/ibstat.out | Mellanox OFED command for checking Infiniband status | MOFED End-User Agreement |
ibstatus | ibstatus | Infiniband/ibstat.out | Lightweight Mellanox OFED command for checking Infiniband status | MOFED End-User Agreement |
ibv_devinfo | ibv_devinfo | Infiniband/ibv_devinfo.out | Mellanox OFED commnd for checking Infiniband Device info | MOFED End-User Agreement |
Partition Key | cp /sys/class/infiniband/.../pkeys/... | Infiniband/.../pkeys/... | Checks the configured Infinband Partition Keys | |
Infiniband Driver Extension Logs | cp /var/log/azure/ib-vmext-status | Infiniband/ib-vmext-status | Logs from the Infiniband Driver Extension | |
ethtool | ethtool eth1 | Infiniband/ethtool.out | Status of IB interface on ENDURE VMs | |
sysfs | cp /sys/class/infiniband/... | Infiniband/rate,state,phys_state | Status of IB interface on ENDURE VMs | |
NVIDIA Bug Report | nvidia-bug-report.sh | Nvidia/nvidia-bug-report.log.gz | A script that Nvidia has customers run when reporting hardware problems. | CUDA EULA GRID EULA |
NVIDIA System Management Interface | nvidia-smi | Nvidia/nvidia-smi.out Nvidia/nvidia-smi-q.out Nvidia/nvidia-smi-nvlink.out | Checks GPU health and configuration | CUDA EULA GRID EULA |
NVIDIA Debug Dump | nvidia-debugbump | Nvidia/nvidia-debugdump.zip | Generates a binary blob for use with Nvidia internal engineering tools | CUDA EULA GRID EULA |
NVIDIA Data Center GPU Manager | dcgmi | Nvidia/dcgm-diag-2.log Nvidia/dcgm-diag-3.log Nvidia/nvvs.log Nvidia/stats_*.json | Health monitoring for GPUs in cluster environments | DCGM EULA |
GPU Driver Extension Logs | cp /var/log/azure/nvidia-vmext-status | Nvidia/nvidia-vmext-status | Logs from the GPU Driver Extension |
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.