Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: PythonBackend/cl_run_kernel rewrite #28

Closed
simonius opened this issue Sep 28, 2018 · 3 comments
Closed

WIP: PythonBackend/cl_run_kernel rewrite #28

simonius opened this issue Sep 28, 2018 · 3 comments

Comments

@simonius
Copy link

Hi !
i started to write a PyOpenCl based backend and realized that all arrays are created in host memory. Afterwards they are copied to the OpenCL device, used by the kernel, and copied back by cl_run_kernel.
I would like to write functions which just create references to the arrays in OpenCL device memory and pass them around, because transfer to and from GPUs can be slow. What do you think ?

@ranocha
Copy link
Member

ranocha commented Sep 29, 2018

This is the same approach I have been using for Julia (unpublished alpha version, cf. #13 and some references linked therein). I don't know completely how pyopencl handles the lifetime of buffers on the GPU. Relying strictly on the OpenCL standard, there might be some problems of a kernel is called that does not take a specific buffer as argument. That's why we are using kernels with partly irrelevant arguments. Thus, doing the same in pyopencl should be fine. @philipheinisch could tell you more about that.

@philipheinisch
Copy link
Member

In most cases the data is left on the device because multiple kernels are queued back to back without copy. If you want your data to live exclusively on the device you need to make sure enough memory is available on the device at all times. Due to the way OpenCL handles memory this is not trivial, like @ranocha explained. Additional care has to be taken with divergence cleaning to prevent memory problems, if the device runs out of memory or the memory manger thinks it might. It might be possible to circumvent some of the host device copies still performed, but the influence on the total runtime is so small, that it is not worth the effort to risk possibly hard to debug memory problems. Especially as only the kernel time is useful if you want to benchmark the performance of the numeric method.

To summarize it is a good idea to prevent as many copy instructions as possible, but pure device only buffers are tricky.

@simonius
Copy link
Author

Thanks. The documentation of PyOpenCl is inconclusive about this topic. I wrote both versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants