[BUG REPORT] "an illegal memory access was encountered" and "nanobind leak" #10

xk-huang · 2022-08-31T02:53:42Z

When I used joblib.Parallel with loky backend to launch multiple jobs in parallel, the below error occurred:

cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:473.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:474.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:475.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:476.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:477.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:478.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:479.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:480.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:481.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:482.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:483.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:484.

Also, the GPU memory allocation was strange: multiple processes allocated memory on GPU 0.

I tried to delete the corresponding code but it did not work 😢.
Would your mind give any suggestions? Thanks in advance!

The text was updated successfully, but these errors were encountered:

xk-huang · 2022-09-01T06:29:52Z

I tried to build from the source with all the cuda_check() deleted, but I still encountered the issue "RuntimeError: CUDA error: an illegal memory access was encountered".

My building command is CC=gcc-8 CXX=g++-8 pip install . since directly building with pip install . failed.

xk-huang · 2022-09-01T06:32:40Z

There is also a leak issue reported by nanobind:

nanobind: leaked 4 instances!
nanobind: leaked 2 types!
 - leaked type "CholeskySolverF"
 - leaked type "MatrixType"
nanobind: leaked 2 functions!
 - leaked function "solve"
 - leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.

bathal1 · 2022-09-01T08:05:46Z

Hi,

Thanks for the report. I haven't tested cholespy on multiple GPU setups, so it's possible that memory allocation is broken there.

Deleting the cuda_check calls is absolutely not going to solve your issue, as this is a wrapper that analyses the return code of CUDA API calls and generate those error messages.

From what you described, it sounds like cholespy only uploads data to GPU 0 so the other ones can't access it. That makes sense since the module is initializing the CUDA context on device 0:

cholespy/src/cuda_driver.cpp

Line 124 in 485f82c

cuda_check(cuDeviceGet(&cu_device, 0));

As a sanity check, if you can control the number of GPUs on which you run your code, could you try setting it to 1 and see if it works then?

It would also be helpful to have a minimal reproducer (if possible) to try to reproduce the issue on my end.

bathal1 · 2022-09-01T11:36:30Z

Resolving the multiple GPU case will require a few API changes to allow the user to explicitly specify a device. You would then be able to specify it for each thread in your parallel job.

In the meantime, you should be able to work around this issue by masking out available devices on each thread via the CUDA_VISIBLE_DEVICES environment variable:

import os

os.environ['CUDA_VISIBLE_DEVICES'] = device_id # Mark only the desired device as visible

# ...

xk-huang changed the title ~~[BUG REPORT] "an illegal memory access was encountered" with joblib launcher~~ [BUG REPORT] "an illegal memory access was encountered" and "nanobind leak" Sep 1, 2022

xk-huang mentioned this issue Sep 1, 2022

leak report wjakob/nanobind#19

Closed

Steelwall2014 added a commit to Steelwall2014/cholespy that referenced this issue Mar 2, 2024

Added support for multi-gpu. see rgl-epfl#10

d8e8c9c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG REPORT] "an illegal memory access was encountered" and "nanobind leak" #10

[BUG REPORT] "an illegal memory access was encountered" and "nanobind leak" #10

xk-huang commented Aug 31, 2022 •

edited

Loading

xk-huang commented Sep 1, 2022

xk-huang commented Sep 1, 2022

bathal1 commented Sep 1, 2022 •

edited

Loading

bathal1 commented Sep 1, 2022

[BUG REPORT] "an illegal memory access was encountered" and "nanobind leak" #10

[BUG REPORT] "an illegal memory access was encountered" and "nanobind leak" #10

Comments

xk-huang commented Aug 31, 2022 • edited Loading

xk-huang commented Sep 1, 2022

xk-huang commented Sep 1, 2022

bathal1 commented Sep 1, 2022 • edited Loading

bathal1 commented Sep 1, 2022

xk-huang commented Aug 31, 2022 •

edited

Loading

bathal1 commented Sep 1, 2022 •

edited

Loading