pytorch interop without cuda synchronization #3699

zdevito · 2020-07-08T22:50:06Z

zdevito
Jul 8, 2020

I am prototyping using jax along with pytorch to use xla to accelerate some fusible kernels inside a larger program. So far, I have code that looks like:

https://gist.github.com/zdevito/dff820e2053b29b1f688ad8db1da5f35

This enables jax (really just the XLA APIs) to run on pytorch data on the GPU and then produce new values without needing to do any copies.

However, when buffer_to_dlpack_managed_tensor it effectively calls block_until_ready, waiting on the cuda stream to ensure the data is ready. When using jax/xla to general kernels inside a larger program, this isn't great for performance because it inserts CPU synchronization points in a program that really only needs ordering of kernels on the GPU. A similar scenario exists for translating into xla, where technically we have to sync the PyTorch work before calling XLA kernels.

I was looking to get your thoughts on ways around this. One way would be for the dlpack-style functions to take or return cudaEvent objects that indicate when the data is ready. Naturally this only works for CUDA, and there might needs to be a similar thing CPUs.

hawkinsp · 2020-07-08T23:22:03Z

hawkinsp
Jul 8, 2020
Maintainer

Hi Zach! Thanks for stopping by.

My reading of the DLPack spec found that it had little to say on the topic of synchronization. So I opted to be conservative and over-synchronize via the host thread, which is certainly safe, if pessimistic. If we (the collective users of the DLPack spec) can agree on a better synchronization protocol it's probably not a big deal to implement it.

There's a similar discussion happening with respect to __cuda_array_interface__ (numba/numba#5162) that might be worth considering here.

I agree that my natural inclination would also be to include a cudaEvent that specifies a moment in time after which the buffer may be read, but I'm not sure they are portable across CUDA contexts and haven't had a chance to look into it further. The next best thing is probably to agree on a stream for synchronization, which apparently does work across CUDA contexts, and that seems to be the direction that the __cuda_array_interface__ proposal is taking.

I'm open to suggestions!

4 replies

zdevito Jul 8, 2020
Author

Hey Peter -- Thanks for the pointer to the numba discussion. I think a stream can work. One nice part about that is you expect the stream to outlive the exchange unlike a event where someone has to be responsible for freeing it. It also gives the opportunity to avoid making cuda API calls when PyTorch and XLA are using the same stream (e.g. both on the default stream).

I can certainly get changes into the DLPack interface from the PyTorch side. I have to re-track down who the current users of the DLPack spec for consensus but in the meantime I can prototype something just to gather timing information. I know what the changes to the PyTorch code would look like, but I could use a pointer to where to make changes to the XLA code.

zdevito Jul 9, 2020
Author

I read more into the __cuda_array_interface__ proposal, and I also like the suggestion to just assume that synchronization happen on the default stream. That would be a significant simplification since we won't have to change any of the DLPack data structures, just the convention for how tensors are passed. I believe PyTorch already follows this convention in its to_dlpack functionality, meaning that we really just need to clarify the spec so that implementations of dlpack have guidance how to handle cuda synchronization.

hawkinsp Jul 13, 2020
Maintainer

I am most of the way through implementing this for you as an experimental option. However it occurs to me there's an ambiguity: do you want to synchronize on the legacy default stream or the per-thread default stream? My inclination would be the latter; why add synchronization between threads?

zdevito Jul 23, 2020
Author

The per-thread default stream, similar to how the __cuda_array_interface__ proposal works, would be best.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch interop without cuda synchronization #3699

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

pytorch interop without cuda synchronization #3699

zdevito Jul 8, 2020

Replies: 1 comment · 4 replies

hawkinsp Jul 8, 2020 Maintainer

zdevito Jul 8, 2020 Author

zdevito Jul 9, 2020 Author

hawkinsp Jul 13, 2020 Maintainer

zdevito Jul 23, 2020 Author

zdevito
Jul 8, 2020

Replies: 1 comment 4 replies

hawkinsp
Jul 8, 2020
Maintainer

zdevito Jul 8, 2020
Author

zdevito Jul 9, 2020
Author

hawkinsp Jul 13, 2020
Maintainer

zdevito Jul 23, 2020
Author