Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: can't hipMemPoolExportPointer signal memory? #104

Open
Epliz opened this issue Nov 7, 2024 · 9 comments
Open

[Issue]: can't hipMemPoolExportPointer signal memory? #104

Epliz opened this issue Nov 7, 2024 · 9 comments

Comments

@Epliz
Copy link

Epliz commented Nov 7, 2024

Problem Description

Hi,
I have allocated 8 bytes of signal memory with hipExtMallocWithFlags and am trying to share it with another process (as it seems like the stream wait value APIs would be perfect for cheaper sync between processes than events). But hipMemPoolExportPointer returns an error.

Is there any allocation size at which it would work?

Would it be possible for you to help me to get it working or if there is a limitation at the moment, to make it work?

Best regards,
Epliz

Operating System

Ubuntu 24.04

CPU

Intel xeon XE9680

GPU

MI300x x8

ROCm Version

ROCm 6.2.2

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@darren-amd
Copy link

Hi @Epliz,

From my understanding, it seems that you are trying to use hipExtMallocWithFlags with hipMemPoolExportPointer, however, this would result in an error, as when you call hipExtMallocWithFlags, this would not be allocated in a memory pool and therefore trying to export a pointer to the data with hipExtMallocWithFlags would error out. Please see the documentation for memory pools here: Memory Pool Documentation. Specifcally, please see the function hipMallocFromPoolAsync which you can use to allocate memory in a memory pool. You can find an example of exporting and importing pointers from a memory pool here: Memory Pool Example.

Please give that a read and let me know if you have any questions, thanks!

@Epliz
Copy link
Author

Epliz commented Nov 13, 2024

Hi @darren-amd ,

Thank you for your answer!
I looked into what you pointed at but:

  1. I don't see how to pass the signal memory flags anywhere when creating a pool, or when doing an allocation from it
  2. the memory return by hipExtMallocWithFlags, when allocating signal memory, seems to be CPU memory (by looking at the memory type attribute obtained using hipPointerGetAttributes). I don't think mem pools support CPU memory at all?

It would be great to support both points:

  1. supporting CPU memory pools
  2. adding the ability to pass flags when creating a memory pool so that it can be a pool for signal memory

@darren-amd
Copy link

darren-amd commented Dec 2, 2024

Hi @Epliz,

Good questions, firstly, I don't see a particular way to pass in a signal flag to memory allocated in a pool. For the second question, hipExtMallocWithFlags should allocate on an accelerator and should be in GPU memory rather than CPU.

However, I believe hipIpcOpenMemHandle (found here) would be more appropriate for your use case. It will allow you to open an interprocess memory handle exported from another process and returns a device pointer usable in the local process, allowing you to share pointers between different processes. Please give that a try and let me know, as it should support signal memory.

@Epliz
Copy link
Author

Epliz commented Dec 17, 2024

Hi @darren-amd ,
Could you please re-ooen te ticket? I still plan on trying what you suggested when I have time (probably during the end of year break).
I have some doubts that hipIpcOpenMemHandle will work as the allocated memory for signals seems to be in CPU mem.
Sorry for taking time to reply.

@darren-amd
Copy link

Hi @Epliz,

Of course, let me know if you have any questions!

@darren-amd darren-amd reopened this Dec 17, 2024
@Epliz
Copy link
Author

Epliz commented Dec 19, 2024

Hi @darren-amd ,

Here is a small program I tried:

#include <iostream>

#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>

int main(int argc, char** argv) {

  int device_count;

  if (hipGetDeviceCount(&device_count) != hipSuccess) {
    std::cout<<"Failed to get the numnber of GPUs"<<std::endl;
    return -1;
  }

  std::cout<<"Devices "<<device_count<<std::endl;

  hipError_t error;

  int supported;
  if (hipDeviceGetAttribute(&supported, hipDeviceAttributeCanUseStreamWaitValue, 0) != hipSuccess) {
    std::cout<<"Error getting the the property"<<std::endl;
    return -2;
  }

  std::cout<<"wait stream value: "<<supported<<std::endl;

  int* signal = nullptr;
  // apparently needs to allocate 8 bytes
  if ((error = hipExtMallocWithFlags((void**) &signal, 8, hipMallocSignalMemory)) != hipSuccess) {
    std::cout<<"Failed allocating signal memory"<<std::endl;
    return -3;
  }
  std::cout<<"signal ptr: "<<signal<<std::endl;

  hipPointerAttribute_t attributes;
  if (hipPointerGetAttributes (&attributes, signal) != hipSuccess) {
    std::cout<<"Failed getting attributes"<<std::endl;
    return -4;
  }

  std::cout<<"Memory type "<<attributes.type<<std::endl;

  hipIpcMemHandle_t mem_handle;
  if (hipIpcGetMemHandle (&mem_handle, signal) != hipSuccess) {
    std::cout<<"Failed getting mem handle"<<std::endl;
    return -5;
  }

  std::cout<<"got mem handle"<<std::endl;

  return 0;
}

Let me know if you see anything wrong, but for me, it fails when getting the memory handle:

Devices 2
wait stream value: 1
signal ptr: 0x755512bffd08
Memory type 1
Failed getting mem handle

And as I said before, the memory type seems to be CPU.

Best,
Epliz

@darren-amd
Copy link

Hi @Epliz,

It does seem that using hipExtMallocWithFlags with hipMallocSignalMemory is allocating on the host rather than device.

Is there a particular reason you need hipMallocSignalMemory? Allocating with this flag is only necessary in specific use cases (such as Stream Memory Operations), so if it isn't needed you could use hipMalloc instead.

@Epliz
Copy link
Author

Epliz commented Jan 7, 2025

Hi @darren-amd ,

I precisely want to use stream memory operations with that signal memory to implement multi-gpu operations like all-reduce.

I have determined that in single process micro-benchmarks that is the best approach, compared to using events. (The stream memory operations APIs could probably be enhanced further though, to control cache flushing during write ops - I might open a feature request for that).

I have also micro-benchmarked the single-machine multi process case for events, and saw there that interprocess events are particularly bad due to the fact that they use stream callbacks internally (and therefore preventing the cpu from queueing commands ahead on the gpu). I will probably open a ticket about that suboptimal implementation of interprocess events at some point, as it can for sure be made much better.

Best,
Epliz

@darren-amd
Copy link

Hi @Epliz,

I had a chat with our internal team and the current recommendation is to not use stream memory operations in favor of using HIP events instead. The HIP stream memory operations are still in beta and can be prone to errors/changes that may make code unstable. Additionally, stream waits may negatively affect kernel performance if they are executing at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants