You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm not sure if this is a problem with the container or with the Databricks platform, but the GPU metrics don't display properly when using these GPU containers.
I ran the script at the very bottom of this issue on a regular (i.e. non-docker) cluster running Databricks runtime 14.1 ML on a g4dn.xlarge instance type. It shows that that GPU is being used, and when I click on the metrics tab in the Databricks UI for the GPU I see time series and valid entries like this:
However, when I run the same code on a 14.1 cluster with the official image databricksruntime/gpu-pytorch:cuda11.8, the metrics are messed up. The script shows the same output, i.e. that the GPU is being used. But the metrics tab for the GPU doesn't show any time series, it just shows this:
Any idea what's going on or how to get this fixed?
Script is below:
# Databricks notebook source
!pip install nvidia-ml-py3
# COMMAND ----------
import torch
from pynvml import *
# COMMAND ----------
torch.cuda.is_available()
# COMMAND ----------
def print_gpu_utilization():
nvmlInit()
handle = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(handle)
print(f"GPU memory occupied: {info.used//1024**2} MB.")
# COMMAND ----------
print_gpu_utilization()
# COMMAND ----------
# should be 4 GB of GPU memory, which is roughly 25% of the T4's memory
t = torch.rand(10_000, 100_000, dtype=torch.float32, device="cuda")
# COMMAND ----------
print_gpu_utilization()
# COMMAND ----------
t.sum()
# COMMAND ----------
print_gpu_utilization()
# COMMAND ----------
The text was updated successfully, but these errors were encountered:
I'm not sure if this is a problem with the container or with the Databricks platform, but the GPU metrics don't display properly when using these GPU containers.
I ran the script at the very bottom of this issue on a regular (i.e. non-docker) cluster running Databricks runtime 14.1 ML on a g4dn.xlarge instance type. It shows that that GPU is being used, and when I click on the metrics tab in the Databricks UI for the GPU I see time series and valid entries like this:
However, when I run the same code on a 14.1 cluster with the official image
databricksruntime/gpu-pytorch:cuda11.8
, the metrics are messed up. The script shows the same output, i.e. that the GPU is being used. But the metrics tab for the GPU doesn't show any time series, it just shows this:Any idea what's going on or how to get this fixed?
Script is below:
The text was updated successfully, but these errors were encountered: