Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is Phos that support lambda ? #13

Open
182yzh opened this issue Dec 2, 2024 · 5 comments
Open

is Phos that support lambda ? #13

182yzh opened this issue Dec 2, 2024 · 5 comments

Comments

@182yzh
Copy link

182yzh commented Dec 2, 2024

When I try the samples in the llama2 examples, The pos_cli report an error:

root@iZt4n09kz1g7mi4j4b1vckZ:~# pos_cli --start --target daemon
 POS Log  >>>>>>>>>> PhOS Workspace <<<<<<<<<<
 _____  _                      _       ____   _____
|  __ \| |                    (_)     / __ \ / ____|
| |__) | |__   ___   ___ _ __  ___  _| |  | | (___
|  ___/| '_ \ / _ \ / _ \ '_ \| \ \/ / |  | |\___ \
| |    | | | | (_) |  __/ | | | |>  <| |__| |____) |
|_|    |_| |_|\___/ \___|_| |_|_/_/\_\\____/|_____/

 POS Log  PhoenixOS workspace created, welcome!
+00:00:00.286012 INFO:  waiting for RPC requests...
Cache Optimization: Enabled!
Async Optimization: Enabled!
Handler Optimization: Enabled!
xpu remote address: localhost
create shm buffer
Segmentation fault (core dumped)
 POS Warn  failed execution of command cricket-rpc-server 2>&1: exit_code(139)
 POS Warn  failed to start posd

For the convenience of debugging, I just replace the llama2 model to gpt2, and the python process report a same error with the llama2:

+00:00:11.613831 ERROR: image is not an ELF!    in cpu-client-driver.c:466

and I use the GDB trace the call stacks of the python, I find that phos fails with the following call stacks:

root@iZt4n09kz1g7mi4j4b1vckZ:~# gdb cricket-rpc-server core-cricket-rpc-ser-635504-1733110269
......
# Some irrelevant information is omitted here

warning: Unexpected size of section `.reg-xstate/635511' in core file.
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `cricket-rpc-server'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/635511' in core file.
#0  0x00007f19407346f5 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7f18b8fde000 (LWP 635511))]
(gdb) bt
#0  0x00007f19407346f5 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f19584a7e25 in xdr_string () from /usr/lib/x86_64-linux-gnu/libtirpc.so.3
#2  0x000055c329ff541f in xdr_str_result ()
#3  0x000055c32a042752 in dispatch(int, __rpc_xdr*, __rpc_xdr*) ()
#4  0x000055c32a044be8 in svc_run::{lambda(int)#1}::operator()(int) const ()
#5  0x000055c32a045898 in void std::__invoke_impl<void, svc_run::{lambda(int)#1}, int>(std::__invoke_other, svc_run::{lambda(int)#1}&&, int&&) ()
#6  0x000055c32a045816 in std::__invoke_result<svc_run::{lambda(int)#1}, int>::type std::__invoke<svc_run::{lambda(int)#1}, int>(std::__invoke_result&&, (svc_run::{lambda(int)#1}&&)...) ()
#7  0x000055c32a045785 in void std::thread::_Invoker<std::tuple<svc_run::{lambda(int)#1}, int> >::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) ()
#8  0x000055c32a045740 in std::thread::_Invoker<std::tuple<svc_run::{lambda(int)#1}, int> >::operator()() ()
#9  0x000055c32a045724 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<svc_run::{lambda(int)#1}, int> > >::_M_run() ()
#10 0x00007f19408a9793 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007f195816d609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00007f19406cb133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb)

Moreover, I also checks the code in cpu-client-driver.c:466, and the code reponse to this error is related to the cuModuleLoadData.

CUresult cuModuleLoadData(CUmodule* module, const void* image)
{
    int proc = 1026;
    cpu_time_start(totals, proc);
	enum clnt_stat retval;
    ptr_result result;
    mem_data mem;

    if (image == NULL) {
        LOGE(LOG_ERROR, "image is NULL!");
        return CUDA_ERROR_INVALID_IMAGE;
    }
    Elf64_Ehdr *ehdr = (Elf64_Ehdr*)image;

    if (ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
        ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
        ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
        ehdr->e_ident[EI_MAG3] != ELFMAG3) {
        LOGE(LOG_ERROR, "image is not an ELF!");
        return CUDA_ERROR_INVALID_IMAGE;
    }

    // TODO: [POS] how many bytes should we copy?
    // LOGE(LOG_WARNING, 
    //     "!!! e_shoff: %u, end of sh: %u, "
    //     "e_phoff: %u, end of ph: %u\n",
    //     ehdr->e_shoff,
    //     ehdr->e_shoff + ehdr->e_shnum * ehdr->e_shentsize,
    //     ehdr->e_phoff + ehdr->e_phnum * ehdr->e_phentsize
    // );
    // mem.mem_data_len = ehdr->e_shoff + ehdr->e_shnum * ehdr->e_shentsize;
    mem.mem_data_len = ehdr->e_phoff + ehdr->e_phnum * ehdr->e_phentsize;
    mem.mem_data_val = (uint8_t*)image;

    LOGE(LOG_DEBUG, "image_size = %#0zx", mem.mem_data_len);
    
    if (elf2_parameter_info(mem.mem_data_val, mem.mem_data_len) != 0) {
        LOGE(LOG_ERROR, "could not get kernel infos from memory");
        return CUDA_ERROR_INVALID_IMAGE;
    }

    retval = rpc_cumoduleloaddata_1(mem, &result, clnt);
    LOGE(LOG_DEBUG, "[rpc] %s(%p) = %d, result %p\n", __FUNCTION__, image, result.err, (void*)result.ptr_result_u.ptr);
	if (retval != RPC_SUCCESS) {
		fprintf(stderr, "[rpc] %s failed.", __FUNCTION__);
        return CUDA_ERROR_UNKNOWN;
	}
    if (module != NULL) {
       *module = (CUmodule)result.ptr_result_u.ptr;
    }
    cpu_time_end(totals, proc);
    return result.err;
}

So can Phos support GPU lambdas and How can I fix this problem?

@913887524gsd
Copy link

913887524gsd commented Dec 2, 2024

I suspect that image might refer to a PTX image rather than a CUBIN image. My previous work has focused on GPU virtualization, so this is just an assumption based on potential input format differences that this API might encounter.
Btw, it's strange to see that this API receive PTX image, i haven't encounter this situation before.

@913887524gsd
Copy link

Perhaps you could print the first few bytes in hex mode. I'm not entirely sure at the moment.

@182yzh
Copy link
Author

182yzh commented Dec 2, 2024

cpu-client-driver.c:466
I add the following code at the position of the error happened

    if (ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
        ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
        ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
        ehdr->e_ident[EI_MAG3] != ELFMAG3) {
        LOGE(LOG_ERROR, "image is not an ELF!");
        unsigned long* ptr = (unsigned long*)image;
        for(int i=0;i<32;i++) {
            fprintf(stderr,"%lx ", ptr[i]);
        }
	return CUDA_ERROR_INVALID_IMAGE;
    }

and I get the output in the following

+00:00:11.642081 ERROR: image is not an ELF!    in cpu-client-driver.c:466
6547202f2f0a2f2f 206465746172656e 494449564e207962 43204d56564e2041 a72656c69706d6f 6f43202f2f0a2f2f 422072656c69706d 3a444920646c6975 323939322d4c4320 202f2f0a30333130 6d6f632061647543 6e6f6974616c6970 202c736c6f6f7420 20657361656c6572 3156202c332e3131 a3930312e332e31 6465736142202f2f 4d56564e206e6f20 2f0a312e302e3720 737265762e0a0a2f a332e37206e6f69 207465677261742e 612e0a36385f6d73 735f737365726464 a0a343620657a69 6f6c672e202f2f09 6375646572096c62 6f72705f6e6f6974 6c656e72656b5f64 6c61626f6c672e0a 206e67696c612e20 6e203233752e2034

could this information provides the result you want? should I add any informations more?

@913887524gsd
Copy link

913887524gsd commented Dec 2, 2024

Here is the hex dump:

00000000   2F 2F 0A 2F  2F 20 47 65  6E 65 72 61  74 65 64 20  62 79 20 4E  56 49 44 49  //.// Generated by NVIDI
00000018   41 20 4E 56  56 4D 20 43  6F 6D 70 69  6C 65 72 0A  2F 2F 0A 2F  2F 20 43 6F  A NVVM Compiler.//.// Co
00000030   6D 70 69 6C  65 72 20 42  75 69 6C 64  20 49 44 3A  20 43 4C 2D  32 39 39 32  mpiler Build ID: CL-2992
00000048   30 31 33 30  0A 2F 2F 20  43 75 64 61  20 63 6F 6D  70 69 6C 61  74 69 6F 6E  0130.// Cuda compilation
00000060   20 74 6F 6F  6C 73 2C 20  72 65 6C 65  61 73 65 20  31 31 2E 33  2C 20 56 31   tools, release 11.3, V1
00000078   31 2E 33 2E  31 30 39 0A  2F 2F 20 42  61 73 65 64  20 6F 6E 20  4E 56 56 4D  1.3.109.// Based on NVVM
00000090   20 37 2E 30  2E 31 0A 2F  2F 0A 0A 2E  76 65 72 73  69 6F 6E 20  37 2E 33 0A   7.0.1.//...version 7.3.
000000A8   2E 74 61 72  67 65 74 20  73 6D 5F 38  36 0A 2E 61  64 64 72 65  73 73 5F 73  .target sm_86..address_s
000000C0   69 7A 65 20  36 34 0A 0A  09 2F 2F 20  2E 67 6C 6F  62 6C 09 72  65 64 75 63  ize 64...// .globl.reduc
000000D8   74 69 6F 6E  5F 70 72 6F  64 5F 6B 65  72 6E 65 6C  0A 2E 67 6C  6F 62 61 6C  tion_prod_kernel..global
000000F0   20 2E 61 6C  69 67 6E 20  34 20 2E 75  33 32 20 6E                             .align 4 .u32 n

Here is the content:

  1 //
  2 // Generated by NVIDIA NVVM Compiler
  3 //
  4 // Compiler Build ID: CL-29920130
  5 // Cuda compilation tools, release 11.3, V11.3.109
  6 // Based on NVVM 7.0.1
  7 //
  8
  9 .version 7.3
 10 .target sm_86
 11 .address_size 64
 12
 13         // .globl       reduction_prod_kernel
 14 .global .align 4 .u32 n

It's a PTX image, but it's strange to see pytorch frame load PTX image directly, I don't know why....

@182yzh
Copy link
Author

182yzh commented Dec 2, 2024

is this are caused from JIT? I just print the client(or python) process call stack, and the #12 of call stack is a kernel jitted_gpu_reduce_kernel, but even if I run with PYTORCH_JIT=0, this problem are still exists. I also feel confused....

the following are the call stack traces in GDB for the python process.

0x00007f1d69269024 in ShmBuffer::FillIn() () from /lib/x86_64-linux-gnu/cricket-client.so
(gdb) bt
#0  0x00007f1d69269024 in ShmBuffer::FillIn() () from /lib/x86_64-linux-gnu/cricket-client.so
#1  0x00007f1d69268f33 in ShmBuffer::getBytes(char*, int) () from /lib/x86_64-linux-gnu/cricket-client.so
#2  0x00007f1d6926766b in XDRDevice::Getlong(long*) () from /lib/x86_64-linux-gnu/cricket-client.so
#3  0x00007f1d69267b02 in xdrdevice_getlong () from /lib/x86_64-linux-gnu/cricket-client.so
#4  0x00007f1d6891a459 in xdr_u_int () from /usr/lib/x86_64-linux-gnu/libtirpc.so.3
#5  0x00007f1d6891adc7 in xdr_string () from /usr/lib/x86_64-linux-gnu/libtirpc.so.3
#6  0x00007f1d691c2fcf in xdr_str_result () from /lib/x86_64-linux-gnu/cricket-client.so
#7  0x00007f1d6926aa87 in AsyncBatch::Call(unsigned int, int (*)(__rpc_xdr*, ...), void*, int (*)(__rpc_xdr*, ...), void*, timeval, int&, __rpc_xdr*, __rpc_xdr*, _detailed_info*, int, int, DeviceBuffer*, DeviceBuffer*) () from /lib/x86_64-linux-gnu/cricket-client.so
#8  0x00007f1d692686c9 in clnt_device_call(__rpc_client*, unsigned int, int (*)(__rpc_xdr*, ...), void*, int (*)(__rpc_xdr*, ...), void*, timeval) ()
   from /lib/x86_64-linux-gnu/cricket-client.so
#9  0x00007f1d691ce370 in rpc_cugeterrorstring_1 () from /lib/x86_64-linux-gnu/cricket-client.so
#10 0x00007f1d69200e8b in cuGetErrorString () from /lib/x86_64-linux-gnu/cricket-client.so
#11 0x00007f1cbd8c8453 in at::cuda::jit::jit_pwise_function(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#12 0x00007f1cbcb0d545 in void at::native::jitted_gpu_reduce_kernel<&at::native::prod_name, long, long, 4, double>(at::TensorIterator&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, double, at::native::AccumulationBuffer*, long) [clone .constprop.0] ()
   from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#13 0x00007f1cbcb1690f in at::native::prod_kernel_cuda(at::TensorIterator&) () from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#14 0x00007f1ce2bfa6a9 in at::native::impl_func_prod(at::Tensor const&, c10::ArrayRef<long>, bool, c10::optional<c10::ScalarType>, at::Tensor const&)
    () from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#15 0x00007f1ce2bfa772 in at::native::structured_prod_out::impl(at::Tensor const&, long, bool, c10::optional<c10::ScalarType>, at::Tensor const&) ()
   from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#16 0x00007f1cbd55e7a4 in at::(anonymous namespace)::wrapper_prod_dim_int(at::Tensor const&, long, bool, c10::optional<c10::ScalarType>) ()
   from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
#17 0x00007f1cbd55e872 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, long, bool, c10::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_prod_dim_int>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, long, bool, c10::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, long, bool, c10::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, long, bool, c10::optional<c10::ScalarType>) ()
   from /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants