Bug: GPU metrics don't show #145

mdagost · 2023-11-04T17:39:02Z

I'm not sure if this is a problem with the container or with the Databricks platform, but the GPU metrics don't display properly when using these GPU containers.

I ran the script at the very bottom of this issue on a regular (i.e. non-docker) cluster running Databricks runtime 14.1 ML on a g4dn.xlarge instance type. It shows that that GPU is being used, and when I click on the metrics tab in the Databricks UI for the GPU I see time series and valid entries like this:

However, when I run the same code on a 14.1 cluster with the official image databricksruntime/gpu-pytorch:cuda11.8, the metrics are messed up. The script shows the same output, i.e. that the GPU is being used. But the metrics tab for the GPU doesn't show any time series, it just shows this:

Any idea what's going on or how to get this fixed?

Script is below:

# Databricks notebook source
!pip install nvidia-ml-py3

# COMMAND ----------

import torch
from pynvml import *

# COMMAND ----------

torch.cuda.is_available()

# COMMAND ----------

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

# COMMAND ----------

print_gpu_utilization()

# COMMAND ----------

# should be 4 GB of GPU memory, which is roughly 25% of the T4's memory
t = torch.rand(10_000, 100_000, dtype=torch.float32, device="cuda")

# COMMAND ----------

print_gpu_utilization()

# COMMAND ----------

t.sum()

# COMMAND ----------

print_gpu_utilization()

# COMMAND ----------

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: GPU metrics don't show #145

Bug: GPU metrics don't show #145

mdagost commented Nov 4, 2023

Bug: GPU metrics don't show #145

Bug: GPU metrics don't show #145

Comments

mdagost commented Nov 4, 2023