Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG REPORT] "an illegal memory access was encountered" and "nanobind leak" #10

Open
xk-huang opened this issue Aug 31, 2022 · 4 comments

Comments

@xk-huang
Copy link

xk-huang commented Aug 31, 2022

When I used joblib.Parallel with loky backend to launch multiple jobs in parallel, the below error occurred:

cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:473.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:474.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:475.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:476.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:477.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:478.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:479.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:480.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:481.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:482.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:483.
cuda_check(): API error = 0700 (CUDA_ERROR_ILLEGAL_ADDRESS): "an illegal memory access was encountered" in /project/src/cholesky_solver.cpp:484.

Also, the GPU memory allocation was strange: multiple processes allocated memory on GPU 0.
image

I tried to delete the corresponding code but it did not work 😢.
Would your mind give any suggestions? Thanks in advance!

@xk-huang
Copy link
Author

xk-huang commented Sep 1, 2022

I tried to build from the source with all the cuda_check() deleted, but I still encountered the issue "RuntimeError: CUDA error: an illegal memory access was encountered".

My building command is CC=gcc-8 CXX=g++-8 pip install . since directly building with pip install . failed.

@xk-huang
Copy link
Author

xk-huang commented Sep 1, 2022

There is also a leak issue reported by nanobind:

nanobind: leaked 4 instances!
nanobind: leaked 2 types!
 - leaked type "CholeskySolverF"
 - leaked type "MatrixType"
nanobind: leaked 2 functions!
 - leaked function "solve"
 - leaked function "__init__"
nanobind: this is likely caused by a reference counting issue in the binding code.

@xk-huang xk-huang changed the title [BUG REPORT] "an illegal memory access was encountered" with joblib launcher [BUG REPORT] "an illegal memory access was encountered" and "nanobind leak" Sep 1, 2022
@bathal1
Copy link
Collaborator

bathal1 commented Sep 1, 2022

Hi,

Thanks for the report. I haven't tested cholespy on multiple GPU setups, so it's possible that memory allocation is broken there.

Deleting the cuda_check calls is absolutely not going to solve your issue, as this is a wrapper that analyses the return code of CUDA API calls and generate those error messages.

From what you described, it sounds like cholespy only uploads data to GPU 0 so the other ones can't access it. That makes sense since the module is initializing the CUDA context on device 0:

cuda_check(cuDeviceGet(&cu_device, 0));

As a sanity check, if you can control the number of GPUs on which you run your code, could you try setting it to 1 and see if it works then?

It would also be helpful to have a minimal reproducer (if possible) to try to reproduce the issue on my end.

@bathal1
Copy link
Collaborator

bathal1 commented Sep 1, 2022

Resolving the multiple GPU case will require a few API changes to allow the user to explicitly specify a device. You would then be able to specify it for each thread in your parallel job.

In the meantime, you should be able to work around this issue by masking out available devices on each thread via the CUDA_VISIBLE_DEVICES environment variable:

import os

os.environ['CUDA_VISIBLE_DEVICES'] = device_id # Mark only the desired device as visible

# ...

Steelwall2014 added a commit to Steelwall2014/cholespy that referenced this issue Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants