Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialization error on MIG enabled device #122

Open
jacquetpi opened this issue Jan 29, 2025 · 0 comments
Open

Initialization error on MIG enabled device #122

jacquetpi opened this issue Jan 29, 2025 · 0 comments

Comments

@jacquetpi
Copy link

I encounter an initialization error when MIG is enabled on at least one of the GPUs. It occurs even if the MIG-enabled GPU is not selected to run gpu-burn
Any help is appreciated!

Behavior when MIG is disabled on all GPUs:

user@server:~/gpu-burn$ sudo nvidia-smi -i 0 -mig 0
Disabled MIG Mode for GPU 00000000:06:00.0
All done.
user@server:~/gpu-burn$ ./gpu_burn -i 2 -d 60
Using compare file: compare.ptx
Burning for 60 seconds.
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-44e23424-da96-3a01-9e59-896c4de6ee90)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-ccb5ec78-7977-9753-365c-d527095b8bd9)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-924c81a8-7df7-f55c-c7cc-cdaaa9730f2e)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-ce2334f1-d892-fb37-e4d9-57d788bccb46)
Initialized device 2 with 40339 MB of memory (39711 MB available, using 35739 MB of it), using DOUBLES
Results are 536870912 bytes each, thus performing 67 iterations
8.3%  proc'd: 67 (15881 Gflop/s)   errors: 0   temps: 35 C ^C

Behavior when MIG is enabled on one GPU:

user@server:~/gpu-burn$ sudo nvidia-smi -i 0 -mig 1
Enabled MIG Mode for GPU 00000000:06:00.0
All done.
user@server:~/gpu-burn$ ./gpu_burn -i 2 -d 60
Using compare file: compare.ptx
Burning for 60 seconds.
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-44e23424-da96-3a01-9e59-896c4de6ee90)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-ccb5ec78-7977-9753-365c-d527095b8bd9)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-924c81a8-7df7-f55c-c7cc-cdaaa9730f2e)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-ce2334f1-d892-fb37-e4d9-57d788bccb46)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error (gpu_burn-drv.cpp:302): initialization error
0.0%  proc'd: -1 (0 Gflop/s)   errors: 1738166190  (DIED!)  temps: -- 

No clients are alive!  Aborting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant