-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is Phos that support lambda ? #13
Comments
I suspect that |
Perhaps you could print the first few bytes in hex mode. I'm not entirely sure at the moment. |
if (ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
ehdr->e_ident[EI_MAG3] != ELFMAG3) {
LOGE(LOG_ERROR, "image is not an ELF!");
unsigned long* ptr = (unsigned long*)image;
for(int i=0;i<32;i++) {
fprintf(stderr,"%lx ", ptr[i]);
}
return CUDA_ERROR_INVALID_IMAGE;
} and I get the output in the following +00:00:11.642081 ERROR: image is not an ELF! in cpu-client-driver.c:466
6547202f2f0a2f2f 206465746172656e 494449564e207962 43204d56564e2041 a72656c69706d6f 6f43202f2f0a2f2f 422072656c69706d 3a444920646c6975 323939322d4c4320 202f2f0a30333130 6d6f632061647543 6e6f6974616c6970 202c736c6f6f7420 20657361656c6572 3156202c332e3131 a3930312e332e31 6465736142202f2f 4d56564e206e6f20 2f0a312e302e3720 737265762e0a0a2f a332e37206e6f69 207465677261742e 612e0a36385f6d73 735f737365726464 a0a343620657a69 6f6c672e202f2f09 6375646572096c62 6f72705f6e6f6974 6c656e72656b5f64 6c61626f6c672e0a 206e67696c612e20 6e203233752e2034 could this information provides the result you want? should I add any informations more? |
Here is the hex dump:
Here is the content:
It's a PTX image, but it's strange to see pytorch frame load PTX image directly, I don't know why.... |
is this are caused from JIT? I just print the client(or python) process call stack, and the #12 of call stack is a kernel jitted_gpu_reduce_kernel, but even if I run with PYTORCH_JIT=0, this problem are still exists. I also feel confused.... the following are the call stack traces in GDB for the python process. 0x00007f1d69269024 in ShmBuffer::FillIn() () from /lib/x86_64-linux-gnu/cricket-client.so
(gdb) bt
#0 0x00007f1d69269024 in ShmBuffer::FillIn() () from /lib/x86_64-linux-gnu/cricket-client.so
#1 0x00007f1d69268f33 in ShmBuffer::getBytes(char*, int) () from /lib/x86_64-linux-gnu/cricket-client.so
#2 0x00007f1d6926766b in XDRDevice::Getlong(long*) () from /lib/x86_64-linux-gnu/cricket-client.so
#3 0x00007f1d69267b02 in xdrdevice_getlong () from /lib/x86_64-linux-gnu/cricket-client.so
#4 0x00007f1d6891a459 in xdr_u_int () from /usr/lib/x86_64-linux-gnu/libtirpc.so.3
#5 0x00007f1d6891adc7 in xdr_string () from /usr/lib/x86_64-linux-gnu/libtirpc.so.3
#6 0x00007f1d691c2fcf in xdr_str_result () from /lib/x86_64-linux-gnu/cricket-client.so
#7 0x00007f1d6926aa87 in AsyncBatch::Call(unsigned int, int (*)(__rpc_xdr*, ...), void*, int (*)(__rpc_xdr*, ...), void*, timeval, int&, __rpc_xdr*, __rpc_xdr*, _detailed_info*, int, int, DeviceBuffer*, DeviceBuffer*) () from /lib/x86_64-linux-gnu/cricket-client.so
#8 0x00007f1d692686c9 in clnt_device_call(__rpc_client*, unsigned int, int (*)(__rpc_xdr*, ...), void*, int (*)(__rpc_xdr*, ...), void*, timeval) ()
from /lib/x86_64-linux-gnu/cricket-client.so
#9 0x00007f1d691ce370 in rpc_cugeterrorstring_1 () from /lib/x86_64-linux-gnu/cricket-client.so
#10 0x00007f1d69200e8b in cuGetErrorString () from /lib/x86_64-linux-gnu/cricket-client.so
#11 0x00007f1cbd8c8453 in at::cuda::jit::jit_pwise_function(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#12 0x00007f1cbcb0d545 in void at::native::jitted_gpu_reduce_kernel<&at::native::prod_name, long, long, 4, double>(at::TensorIterator&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double, at::native::AccumulationBuffer*, long) [clone .constprop.0] ()
from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#13 0x00007f1cbcb1690f in at::native::prod_kernel_cuda(at::TensorIterator&) () from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#14 0x00007f1ce2bfa6a9 in at::native::impl_func_prod(at::Tensor const&, c10::ArrayRef<long>, bool, c10::optional<c10::ScalarType>, at::Tensor const&)
() from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#15 0x00007f1ce2bfa772 in at::native::structured_prod_out::impl(at::Tensor const&, long, bool, c10::optional<c10::ScalarType>, at::Tensor const&) ()
from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#16 0x00007f1cbd55e7a4 in at::(anonymous namespace)::wrapper_prod_dim_int(at::Tensor const&, long, bool, c10::optional<c10::ScalarType>) ()
from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#17 0x00007f1cbd55e872 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, long, bool, c10::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_prod_dim_int>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, long, bool, c10::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, long, bool, c10::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, long, bool, c10::optional<c10::ScalarType>) ()
from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so |
When I try the samples in the llama2 examples, The pos_cli report an error:
For the convenience of debugging, I just replace the llama2 model to gpt2, and the python process report a same error with the llama2:
and I use the GDB trace the call stacks of the python, I find that phos fails with the following call stacks:
Moreover, I also checks the code in cpu-client-driver.c:466, and the code reponse to this error is related to the cuModuleLoadData.
So can Phos support GPU lambdas and How can I fix this problem?
The text was updated successfully, but these errors were encountered: