Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: GPU metrics don't show #145

Open
mdagost opened this issue Nov 4, 2023 · 0 comments
Open

Bug: GPU metrics don't show #145

mdagost opened this issue Nov 4, 2023 · 0 comments

Comments

@mdagost
Copy link

mdagost commented Nov 4, 2023

I'm not sure if this is a problem with the container or with the Databricks platform, but the GPU metrics don't display properly when using these GPU containers.

I ran the script at the very bottom of this issue on a regular (i.e. non-docker) cluster running Databricks runtime 14.1 ML on a g4dn.xlarge instance type. It shows that that GPU is being used, and when I click on the metrics tab in the Databricks UI for the GPU I see time series and valid entries like this:

Screen Shot 2023-11-04 at 12 34 43 PM

However, when I run the same code on a 14.1 cluster with the official image databricksruntime/gpu-pytorch:cuda11.8, the metrics are messed up. The script shows the same output, i.e. that the GPU is being used. But the metrics tab for the GPU doesn't show any time series, it just shows this:

Screen Shot 2023-11-04 at 12 36 46 PM

Any idea what's going on or how to get this fixed?

Script is below:

# Databricks notebook source
!pip install nvidia-ml-py3

# COMMAND ----------

import torch
from pynvml import *

# COMMAND ----------

torch.cuda.is_available()

# COMMAND ----------

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

# COMMAND ----------

print_gpu_utilization()

# COMMAND ----------

# should be 4 GB of GPU memory, which is roughly 25% of the T4's memory
t = torch.rand(10_000, 100_000, dtype=torch.float32, device="cuda")

# COMMAND ----------

print_gpu_utilization()

# COMMAND ----------

t.sum()

# COMMAND ----------

print_gpu_utilization()

# COMMAND ----------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant