Failed to build from source (pytorch 1.3.1 + CUDA 11.6) #2321

eric8607242 · 2023-01-04T10:25:26Z

eric8607242
Jan 4, 2023

Hello,

Thanks for your attention to my issue.
I currently build ColossalAI 0.2.0 from source for pytorch 1.13.1 with cuda 11.6.
However, I always encounter this issue shown as follows:

      [1/1] c++ colossal_C_frontend.o multi_tensor_sgd_kernel.cuda.o multi_tensor_scale_kernel.cuda.o multi_tensor_adam.cuda.o multi_tensor_l2norm_kernel.cuda.o multi_tensor_lamb.cuda.o -shared -L/home/eric860724
2/anaconda3/envs/cmdgpt2/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/home/eric8607242/anaconda3/envs/cmdgpt2/lib64 -lcudart -o fused_optim.so         
      FAILED: fused_optim.so                                                                                                                                                                                        
      c++ colossal_C_frontend.o multi_tensor_sgd_kernel.cuda.o multi_tensor_scale_kernel.cuda.o multi_tensor_adam.cuda.o multi_tensor_l2norm_kernel.cuda.o multi_tensor_lamb.cuda.o -shared -L/home/eric8607242/anac
onda3/envs/cmdgpt2/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/home/eric8607242/anaconda3/envs/cmdgpt2/lib64 -lcudart -o fused_optim.so               
      /usr/bin/ld: cannot find -lcudart: No such file or directory                                                                                                                                                  
      collect2: error: ld returned 1 exit status

The corresponding environment is shown as follows:

ubuntu 22.04
cuda version 11.6
pytorch version 1.13.1
GeForce RTX 3090

Thanks for your help. Hope you have a good day.

Btw, I can build ColossalAI successfully under pytorch 1.12.0.
However, there is a GPU memory illegal access issue in pytorch 1.12.0.
Corresponding links:
pytorch/pytorch#85395
pytorch/pytorch#85005

I appreciate if you can release the official release version for ColossalAI 0.2.0 + pytorch 1.3.1

Answered by eric8607242

Jan 5, 2023

Hi all,
Thanks for your attention!
I build from the source successfully!

The root cause of this building issue is because of the wrong environment variable.
In ColossalAI setup pipeline, the script will parse the path of the cuda from torch.utils.cpp_extension.CUDA_HOME.
And the torch.utils.cpp_extension.CUDA_HOME get the path based on the command which nvcc if there is no proper environment variable. (ref. https://github.com/pytorch/pytorch/blob/master/torch/utils/cpp_extension.py#L90)

However, in the anaconda virtual environment with pytorch-cuda=11.6, there is a duplicated nvcc execute file in the bin/, which makes the command which grabs the wrong path before grabbing the correct cuda…

View full answer

eric8607242 · 2023-01-05T03:17:15Z

eric8607242
Jan 5, 2023
Author

Hi all,
Thanks for your attention!
I build from the source successfully!

The root cause of this building issue is because of the wrong environment variable.
In ColossalAI setup pipeline, the script will parse the path of the cuda from torch.utils.cpp_extension.CUDA_HOME.
And the torch.utils.cpp_extension.CUDA_HOME get the path based on the command which nvcc if there is no proper environment variable. (ref. https://github.com/pytorch/pytorch/blob/master/torch/utils/cpp_extension.py#L90)

However, in the anaconda virtual environment with pytorch-cuda=11.6, there is a duplicated nvcc execute file in the bin/, which makes the command which grabs the wrong path before grabbing the correct cuda path.
Therefore, to address this issue, I preset the environment variable CUDA_HOME to the correct path to make the pytorch CUDA_HOME return the correct path.
With this environment preset, I successfully build ColossalAI from the source.

Thanks for the great work!
Hope you have a nice day

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to build from source (pytorch 1.3.1 + CUDA 11.6) #2321

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Failed to build from source (pytorch 1.3.1 + CUDA 11.6) #2321

eric8607242 Jan 4, 2023

Replies: 1 comment

eric8607242 Jan 5, 2023 Author

eric8607242
Jan 4, 2023

eric8607242
Jan 5, 2023
Author