Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can GPU memory be restored from same address? #53

Open
lianghao208 opened this issue Nov 28, 2024 · 3 comments
Open

Can GPU memory be restored from same address? #53

lianghao208 opened this issue Nov 28, 2024 · 3 comments

Comments

@lianghao208
Copy link

lianghao208 commented Nov 28, 2024

I checked the GPU memory resore code: https://github.com/RWTH-ACS/cricket/blob/master/cpu/cr.c#L303

It restore the GPU memory by cudaMalloc to re-apply the device memory.

I think the new created device memory address is not as same as the original device memory(the memory checkpointed from the original machine).

But in cpu/host side, it holds the same original device address pointer after the restore, how does the restored process use this new created device memory by visiting the original device address pointer?

@n-eiling
Copy link
Member

n-eiling commented Dec 2, 2024

When restoring resources such as memory addresses the resources in the checkpoint file are mapped onto the newly created ones. So we essentially replace the memory addresses with the new ones. However, in my experiments calling cudaMalloc with the same parameters in the same order as during the original run will also lead to CUDA returning the same memory addresses. However, Cricket does not assume this, because we generally cannot know if there is anything else running on the same GPU.

@lianghao208
Copy link
Author

lianghao208 commented Dec 4, 2024

So we essentially replace the memory addresses with the new ones

Thanks for the reply.
Is the whole procedure looks like this?

  1. cudaMalloc create a new memory, it returns new memory addresses.
  2. when new restore process trys to use memory, Cricket will intercept the memory access request and map it to a newly created address.

If so, I wonder how the Cricket intercepts the memory access request and maps to a newly created address ? Because there is no such CUDA API can explicitly the memory access address(e.g. cuLaunchKernel)
@n-eiling

@n-eiling
Copy link
Member

n-eiling commented Dec 5, 2024

So there are different memory addresses this is relevant for.
For CUDA resources such as cudaStream, cublasHandle, etc, we map the addresses to the new addresses inside the API wrappers. They cannot be sensibly used outside of the CUDA APIs so it's not a problem that the pointer values do not point to actual memory.
During kernel execution the kernel gets the memory address of data either via a kernel parameter or via a global variable.
We can directly influence both and replace the memory addresses in the parameter or global variable before launching a kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants