Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ERROR] in the last step of pip install . #1086

Closed
wplf opened this issue Aug 7, 2024 · 5 comments · Fixed by #1103
Closed

[ERROR] in the last step of pip install . #1086

wplf opened this issue Aug 7, 2024 · 5 comments · Fixed by #1103

Comments

@wplf
Copy link
Contributor

wplf commented Aug 7, 2024

Bug

/TransformerEngine/transformer_engine/common/util/cuda_driver.cpp:98:3: error: ‘cudaDriverEntryPointQueryResult’ was not declared in this scope

This bug may be caused by commit "f9dd37f7bd4c08c79bd6b7bece25dfe286225458"

Sadly, I have to develop on v1.7.

Environment

  • torch 2.4
  • CUDA 118
  • NVCC 11.8
  • GCC 9.4.0
  • cmake 3.16.3

Description

The building process stopped at the last step [32/33]

[32/33] Building CUDA object CMakeFiles/transformer_engine.dir/layer_norm/ln_fwd_cuda_kernel.cu.o ninja: build stopped: subcommand failed.

/vepfs/home/lijinliang/miniconda3/envs/megatron/lib/python3.10/site-packages/cmake/data/bin/cmake --build /vepfs/home/lijinliang/projects/TransformerEngine/build/cmake --parallel 32
[1/2] Building CXX object CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o
FAILED: CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o
/usr/bin/c++ -DNV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING -Dtransformer_engine_EXPORTS -I/vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/.. -I/vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/include -I/vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/vepfs/home/lijinliang/projects/TransformerEngine/build/cmake/string_headers -isystem /usr/local/cuda/targets/x86_64-linux/include -O3 -DNDEBUG -std=gnu++17 -fPIC -MD -MT CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o -MF CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o.d -o CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o -c /vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/util/cuda_driver.cpp
/vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/util/cuda_driver.cpp: In function ‘void* transformer_engine::cuda_driver::get_symbol(const char*)’:
/vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/util/cuda_driver.cpp:98:3: error: ‘cudaDriverEntryPointQueryResult’ was not declared in this scope
98 | cudaDriverEntryPointQueryResult driver_result;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/util/../common.h:25,
from /vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/util/cuda_driver.cpp:11:
/vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/util/cuda_driver.cpp:99:85: error: ‘driver_result’ was not declared in this scope
99 | NVTE_CHECK_CUDA(cudaGetDriverEntryPoint(symbol, &entry_point, cudaEnableDefault, &driver_result));
| ^~~~~~~~~~~~~
/vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/util/.././util/logging.h:36:49: note: in definition of macro ‘NVTE_CHECK_CUDA’
36 | const cudaError_t status_NVTE_CHECK_CUDA = (expr);
| ^~~~
/vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/util/cuda_driver.cpp:100:14: error: ‘driver_result’ was not declared in this scope
100 | NVTE_CHECK(driver_result == cudaDriverEntryPointSuccess,
| ^~~~~~~~~~~~~
/vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/util/.././util/logging.h:28:11: note: in definition of macro ‘NVTE_CHECK’
28 | if (!(expr)) {
| ^~~~
/vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/util/cuda_driver.cpp:100:31: error: ‘cudaDriverEntryPointSuccess’ was not declared in this scope; did you mean ‘cudaGetDriverEntryPointFlags’?
100 | NVTE_CHECK(driver_result == cudaDriverEntryPointSuccess,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/vepfs/home/lijinliang/projects/TransformerEngine/transformer_engine/common/util/.././util/logging.h:28:11: note: in definition of macro ‘NVTE_CHECK’
28 | if (!(expr)) {
| ^~~~
ninja: build stopped: subcommand failed.

@wplf
Copy link
Contributor Author

wplf commented Aug 7, 2024

I am wondering if the cuda lib causes this problem, but NVCC and cuda lib is in my path and LD_LIBRARY_PATH.

@timmoon10
Copy link
Collaborator

Can you try upgrading to CUDA 12.0 or newer?

This error shows up because #970 uses cudaGetDriverEntryPoint to access the CUDA driver. However, the function signature is slightly different in CUDA 11.8 and CUDA 12.0. We should either bump the minimum CUDA version or add some version-dependent logic at:

void *get_symbol(const char *symbol) {
void *entry_point;
cudaDriverEntryPointQueryResult driver_result;
NVTE_CHECK_CUDA(cudaGetDriverEntryPoint(symbol, &entry_point, cudaEnableDefault, &driver_result));
NVTE_CHECK(driver_result == cudaDriverEntryPointSuccess,
"Could not find CUDA driver entry point for ", symbol);
return entry_point;
}

@wplf
Copy link
Contributor Author

wplf commented Aug 8, 2024

Thanks you for your kindness.
I will upgrade cuda!

@jinghere11
Copy link

For I can't upgrade CUDA with a sharing env, can you tell me which version of the Transformer Engine could work?

@wplf
Copy link
Contributor Author

wplf commented Aug 23, 2024

Of course, my version is 1.7.0+4e7caa1, you can check out 4e7caa1, and build from source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants