Issue with cuda lib? #659

vahluw · 2023-05-22T20:04:11Z

vahluw
May 22, 2023

GaNDLF Version

0.0.17-dev

Version information of the GaNDLF package in the virtual environment.

Desktop (please complete the following information):

OS: [e.g. Windows/Linux (include detailed distro information)/macOS] Linux
Version (including Build information, if any): [e.g. Fedora 22 or Windows 10.1803]

How did you install GaNDLF
Please provide all steps followed during installation.

Dataset description
Describe the data (radiology/histology/so on, dimensions, etc.).
Radiology, 3D, breast images

Describe your question/problem
A clear and concise description of what issue you are facing.
I just pulled the newest code and am starting to have issues:

"python ./gandlf_verifyInstall" had no issues

cuda version : 11.2
Command: python /home/GaNDLF/gandlf_run -c /home/config_20.yaml -i /home/train.csv -m /home/output -t True -d cuda

I am getting the following output when trying to train:

Traceback (most recent call last):
File "/cbica/home/ahluwalv/GaNDLF/gandlf_run", line 11, in
from GANDLF.cli import main_run, copyrightMessage
File "/gpfs/fs001/cbica/home/ahluwalv/GaNDLF/GANDLF/cli/init.py", line 1, in
from .patch_extraction import patch_extraction
File "/gpfs/fs001/cbica/home/ahluwalv/GaNDLF/GANDLF/cli/patch_extraction.py", line 7, in
from GANDLF.data.patch_miner.opm.patch_manager import PatchManager
File "/gpfs/fs001/cbica/home/ahluwalv/GaNDLF/GANDLF/data/init.py", line 1, in
from torch.utils.data import DataLoader
File "/cbica/projects/DBT_AI/.conda/envs/venv_gandlf_new/lib/python3.8/site-packages/torch/init.py", line 217, in
_load_global_deps()
File "/cbica/projects/DBT_AI/.conda/envs/venv_gandlf_new/lib/python3.8/site-packages/torch/init.py", line 178, in _load_global_deps
_preload_cuda_deps()
File "/cbica/projects/DBT_AI/.conda/envs/venv_gandlf_new/lib/python3.8/site-packages/torch/init.py", line 158, in _preload_cuda_deps
ctypes.CDLL(cublas_path)
File "/cbica/projects/DBT_AI/.conda/envs/venv_gandlf_new/lib/python3.8/ctypes/init.py", line 373, in init
self._handle = _dlopen(self._name, mode)
OSError: /cbica/projects/DBT_AI/.conda/envs/venv_gandlf_new/lib/python3.8/site-packages/nvidia/cublas/lib/libcublas.so.11: symbol cublasLtGetStatusString, version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

sarthakpati · 2023-05-22T20:50:20Z

sarthakpati
May 22, 2023
Maintainer

How did you install PyTorch? If you used Conda, can you try using pip [ref]?

pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116

0 replies

vahluw · 2023-05-22T21:10:13Z

vahluw
May 22, 2023
Author

What should I do if my the newest version of cuda my computer supports is 11.2?

0 replies

sarthakpati · 2023-05-22T22:31:21Z

sarthakpati
May 22, 2023
Maintainer

What should I do if my the newest version of cuda my computer supports is 11.2?

If your hardware supports CUDA 11.2, it should support 11.6. Just follow the pip installation (it should install CUDA and all required libraries into your virtual environment).

0 replies

vahluw · 2023-05-22T22:48:41Z

vahluw
May 22, 2023
Author

So that fixed that, but now I'm getting the following:

Looping over training data for penalty calculation: 0%| | 0/2426 [00:00<?, ?it/s]/cbica/projects/DBT_AI/.conda/envs/venv_gandlf_new/lib/python3.8/site-packages/torchio/data/io.py:36: UserWarning: Error loading image with SimpleITK:
Exception thrown in SimpleITK ImageFileReader_Execute: /tmp/SimpleITK-build/ITK-prefix/include/ITK-5.2/itkImportImageContainer.hxx:192:
Failed to allocate memory for image.

Trying NiBabel...
warnings.warn(message)

Looping over training data for penalty calculation: 0%| | 0/2426 [00:00<?, ?it/s]
ERROR:

0 replies

Geeks-Sid · 2023-05-23T12:40:34Z

Geeks-Sid
May 23, 2023
Maintainer

Hmmmm, This seems like an interesting error. Can you please let us know your ITK version and the output of
head -n 2 data.csv for assisting you better?

0 replies

vahluw · 2023-05-23T13:43:47Z

vahluw
May 23, 2023
Author

subjectID,channel_0,label
0,/cbica/projects/DBT_AI/Data/masks/75712684_PROC_LCC_RC_2/75712684_PROC_LCC_RC_mat.nii.gz,/cbica/projects/DBT_AI/Data/masks/75712684_PROC_LCC_RC_2/75712684_PROC_LCC_RC_mask.nii.gz

ITK verison: 3.8.0

0 replies

sarthakpati · 2023-05-23T18:39:32Z

sarthakpati
May 23, 2023
Maintainer

Can you mention the output of the following command:

# activate gandlf python environment
python -c "import SimpleITK as sitk;image=sitk.ReadImage('/cbica/projects/DBT_AI/Data/masks/75712684_PROC_LCC_RC_2/75712684_PROC_LCC_RC_mat.nii.gz');print(image.GetSize());mask=sitk.ReadImage('/cbica/projects/DBT_AI/Data/masks/75712684_PROC_LCC_RC_2/75712684_PROC_LCC_RC_mask.nii.gz');print(mask.GetSize())"

Also, it would be great if you can post at least the mask so that we can debug further.

0 replies

vahluw · 2023-05-23T19:29:03Z

vahluw
May 23, 2023
Author

(1996, 2457, 73)
(1996, 2457, 73)

What do you mean by posting the mask?

0 replies

vahluw · 2023-05-23T19:30:21Z

vahluw
May 23, 2023
Author

This is also at the top of the error file:
No NVIDIA kernel driver module found, skipping CUDA

0 replies

sarthakpati · 2023-05-23T19:31:23Z

sarthakpati
May 23, 2023
Maintainer

(1996, 2457, 73)
(1996, 2457, 73)

Hmm, if the piece of code I replied with is giving this output, it means that the IO is working as expected.

What do you mean by posting the mask?

I meant uploading it here for us to debug. But it doesn't matter, since the IO is working correctly (as seen from the output of the command I sent).

No NVIDIA kernel driver module found, skipping CUDA

This is not unrelated to GaNDLF, and is dependent on the host machine.

0 replies

vahluw · 2023-05-23T19:33:48Z

vahluw
May 23, 2023
Author

I tried running the same job with 1/5 of the training data and it was able to run without an error, however, I'm getting this:
Epoch Final train loss : 1.0
Epoch Final train dice : 0.0
Epoch Final train dice_per_label : [0.0, 0.0]
Epoch Final train iou : 0.34576352043151853
Epoch Final train f1 : 0.5872102342128753

4 replies

sarthakpati May 24, 2023
Maintainer

Not technically an error. Please play around with the optimizer/scheduler/learning_rate.

Geeks-Sid May 24, 2023
Maintainer

Also, please edit your preprocessing parameters and augmentation parameters correctly. If you are using CT vs MRI vs RGB, the preprocessing does differ for all such tasks.

Liu7749 Aug 12, 2024

I have a new problem:
When I import anndata or torch
My Terminator said
UnboundLocalError: local variable 'cublas_path' referenced before assignment

sarthakpati Aug 13, 2024
Maintainer

Are you able to run nvidia-smi? If yes, can you post the output?

If the above is true, please try creating a completely new virtual environment and follow the installation instructions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with cuda lib? #659

{{title}}

Replies: 11 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Issue with cuda lib? #659

vahluw May 22, 2023

I am getting the following output when trying to train:

Replies: 11 comments · 4 replies

sarthakpati May 22, 2023 Maintainer

vahluw May 22, 2023 Author

sarthakpati May 22, 2023 Maintainer

vahluw May 22, 2023 Author

Geeks-Sid May 23, 2023 Maintainer

vahluw May 23, 2023 Author

sarthakpati May 23, 2023 Maintainer

vahluw May 23, 2023 Author

vahluw May 23, 2023 Author

sarthakpati May 23, 2023 Maintainer

vahluw May 23, 2023 Author

sarthakpati May 24, 2023 Maintainer

Geeks-Sid May 24, 2023 Maintainer

Liu7749 Aug 12, 2024

sarthakpati Aug 13, 2024 Maintainer

vahluw
May 22, 2023

Replies: 11 comments 4 replies

sarthakpati
May 22, 2023
Maintainer

vahluw
May 22, 2023
Author

sarthakpati
May 22, 2023
Maintainer

vahluw
May 22, 2023
Author

Geeks-Sid
May 23, 2023
Maintainer

vahluw
May 23, 2023
Author

sarthakpati
May 23, 2023
Maintainer

vahluw
May 23, 2023
Author

vahluw
May 23, 2023
Author

sarthakpati
May 23, 2023
Maintainer

vahluw
May 23, 2023
Author

sarthakpati May 24, 2023
Maintainer

Geeks-Sid May 24, 2023
Maintainer

sarthakpati Aug 13, 2024
Maintainer