Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default CHGnet.load(check_cuda_mem: bool) to False #164

Merged
merged 4 commits into from
Jun 11, 2024

Conversation

janosh
Copy link
Collaborator

@janosh janosh commented Jun 11, 2024

there's a problem with cuda_devices_sorted_by_free_mem on slurm clusters

def cuda_devices_sorted_by_free_mem() -> list[int]:
"""List available CUDA devices sorted by increasing available memory.
To get the device with the most free memory, use the last list item.
"""
if not torch.cuda.is_available():
return []
free_memories = []
nvidia_smi.nvmlInit()
device_count = nvidia_smi.nvmlDeviceGetCount()
for idx in range(device_count):
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(idx)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
free_memories.append(info.free)

it will return whatever GPU has most free memory and so the model tries to use that even if the job was allocated a different GPU. this results in a cryptic CUDA error

    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal

Process finished with exit code 1

given CHGNet is expected to often be used on queued HPC infra where this error can happen and the error message is not obvious to debug, @BowenD-UCB and I agreed to change the default from True to False

@janosh janosh added ux User experience breaking Breaking change hardware Running on accelerated hardware labels Jun 11, 2024
@janosh janosh changed the title Change default check_cuda_mem: bool to False Change default CHGnet.load(check_cuda_mem: bool) to False Jun 11, 2024
@janosh janosh merged commit d3f1b30 into main Jun 11, 2024
10 checks passed
@janosh janosh deleted the default-check_cuda_mem-False branch June 11, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change hardware Running on accelerated hardware ux User experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant