Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cmake v3.30.2 cudart link error #154

Open
Birch-san opened this issue Aug 5, 2024 · 6 comments · May be fixed by #155
Open

Cmake v3.30.2 cudart link error #154

Birch-san opened this issue Aug 5, 2024 · 6 comments · May be fixed by #155
Labels
build-system Issues related to build system

Comments

@Birch-san
Copy link

Birch-san commented Aug 5, 2024

As there wasn't a torch 2.4.0 wheel, I tried building NATTEN myself. It didn't go as smoothly as usual.

Most problems were due to cmake giving misleading/incomplete error messages. These are the various errors I hit along the way:
Birch-san/sdxl-play#3 (comment)

Ultimately I think most problems here were just "my gcc and g++ alternatives didn't point anywhere after Ubuntu upgrade", but there is one change I had to make to setup.py to get it to build, and I'm not sure why cmake wasn't able to figure this out automatically, or try it as a guess:

setup.py

  f"-DNATTEN_CUDA_ARCH_LIST={cuda_arch_list_str}",
+ f"-DCUDA_CUDART_LIBRARY=/usr/local/cuda/lib64/libcudart.so",

Perhaps the reason things have changed is because the newer cmake demises FindCUDA?

CMake Warning (dev) at CMakeLists.txt:11 (find_package):
  Policy CMP0146 is not set: The FindCUDA module is removed.  Run "cmake
  --help-policy CMP0146" for policy details.  Use the cmake_policy command to
  set the policy and suppress this warning.

This warning is for project developers.  Use -Wno-dev to suppress it.

Anyway, passing in the CUDA_CUDART_LIBRARY option persuaded it to try compiling.

Unfortunately it looks like that wasn't what it wanted… linking failed at the end of all of that.

/home/birch/git/sdxl-play/venv-311/lib/python3.11/site-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/natten.dir/link.txt --verbose=1
/usr/bin/c++ -fPIC  -std=c++17 -shared -Wl,-soname,natten/libnatten.cpython-311-x86_64-linux-gnu.so -o natten/libnatten.cpython-311-x86_64-linux-gnu.so … -lcudart /usr/local/cuda/lib64/libcudart.so /usr/local/cuda/lib64/libnvToolsExt.so -lcudadevrt -lcudart_static -lrt -lpthread -ldl
/usr/bin/ld: cannot find -lcudart: No such file or directory
/usr/bin/ld: cannot find -lcudadevrt: No such file or directory
/usr/bin/ld: cannot find -lcudart_static: No such file or directory

seems like a perfectly typical value for CUDA_CUDART_LIBRARY though. and the library certainly exists:

ls /usr/local/cuda/lib64/ | grep cudart
libcudart.so
libcudart.so.12
libcudart.so.12.2.53
libcudart_static.a

any idea what I'm doing wrong? the errors don't seem rational…

@Birch-san
Copy link
Author

I guess the reason CUDA_CUDART_LIBRARY was ineffective, is that -lcudart appears in the libraries list in addition to /usr/local/cuda/lib64/libcudart.so.

probably what I really need to do is add tell it to link the library dir /usr/local/cuda/lib64, so that it can find -lcudart -lcudadevrt -lcudart_static in that dir.

just need to remember which cmake convention to use for that…

@alihassanijr
Copy link
Member

alihassanijr commented Aug 5, 2024

Apologies for this; I dropped the ball on the 2.4 release; I'll build those wheels tonight.

I've always had bad experience with FindCUDA, and unfortunately it's difficult to link with libtorch through cmake without including theirs, and that's when everything goes wrong. Every time I've figured out a way around it it's been a hack, but somehow torch's docker images and NGC images aren't affected. So I don't think it's anything wrong with your environment, rather just FindCUDA being annoying as usual.

Also, if you know which version of CUDA toolkit your local torch was compiled with I can just build that binary first and post the link here -- building wheels take a while now that 2.4 supports 3 different CTK versions and 5 python versions (together that's 15 CUDA wheels and 5 CPU.)

@Birch-san
Copy link
Author

no worries, there's always too much to be done!

I'm pretty much done for the night but I think my last idea might get it building locally.

for some reason CXXFLAGS='-L/usr/local/cuda/lib64' env var didn't work, as in:

CXXFLAGS='-L/usr/local/cuda/lib64' CUDACXX=/usr/local/cuda/bin/nvcc NATTEN_CUDA_ARCH=8.9 NATTEN_VERBOSE=1 NATTEN_IS_BUILDING_DIST=1 NATTEN_WITH_CUDA=1 NATTEN_N_WORKERS=8 python setup.py bdist_wheel -d out/wheels/cu121/torch/240

and by "didn't work" I mean that it didn't introduce any -L/usr/local/cuda/lib64 option into:
build/lib.linux-x86_64-cpython-311/CMakeFiles/natten.dir/link.txt

so I modified csrc/CMakeLists.txt:

  if(${NATTEN_WITH_CUDA})
    target_link_libraries(natten PUBLIC c10 torch torch_cpu torch_python cudart c10_cuda torch_cuda)
+   message("Adding to target 'natten', link directory: ${CUDA_TOOLKIT_ROOT_DIR}/lib64")
+   target_link_directories(natten PUBLIC ${CUDA_TOOLKIT_ROOT_DIR}/lib64)

And this seems to have succeeded in adding a -L/usr/local/cuda/lib64 to natten.dir/link.txt.
will see how it goes.

=====

if you know which version of CUDA toolkit your local torch was compiled with I can just build that binary first

Thanks! Is this it?

print(torch._C._cuda_getCompiledVersion())
12010

torch.version.cuda
'12.1'

torch.__version__
'2.4.0+cu121'

@alihassanijr
Copy link
Member

Yeah the find cuda module is a big pain; I've sometimes been successful in going around it but never wrote it down 😅 .

Thanks! Is this it?

Yes perfect! I'll post that wheel here when it builds.

@Birch-san
Copy link
Author

ah! my local build succeeded. NATTEN now working with torch 2.4.0. in the end, all I needed was that target_link_directories() patch. wonder why.

@alihassanijr
Copy link
Member

alihassanijr commented Aug 5, 2024

Oh nice; feel free to drop the diff here or even open a PR; I wouldn't rule out NATTEN's cmake config doing something wrong.

I guess if the actual issue was a linking error in the end it makes sense; I originally thought FindCUDA was just blocking everything. Anyway I'll try and redo the cmake config soon; I hacked it together one time last year when we made the switch and haven't looked at it since.

@alihassanijr alihassanijr added the build-system Issues related to build system label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build-system Issues related to build system
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants