-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: PythonBackend/cl_run_kernel rewrite #28
Comments
This is the same approach I have been using for Julia (unpublished alpha version, cf. #13 and some references linked therein). I don't know completely how pyopencl handles the lifetime of buffers on the GPU. Relying strictly on the OpenCL standard, there might be some problems of a kernel is called that does not take a specific buffer as argument. That's why we are using kernels with partly irrelevant arguments. Thus, doing the same in pyopencl should be fine. @philipheinisch could tell you more about that. |
In most cases the data is left on the device because multiple kernels are queued back to back without copy. If you want your data to live exclusively on the device you need to make sure enough memory is available on the device at all times. Due to the way OpenCL handles memory this is not trivial, like @ranocha explained. Additional care has to be taken with divergence cleaning to prevent memory problems, if the device runs out of memory or the memory manger thinks it might. It might be possible to circumvent some of the host device copies still performed, but the influence on the total runtime is so small, that it is not worth the effort to risk possibly hard to debug memory problems. Especially as only the kernel time is useful if you want to benchmark the performance of the numeric method. To summarize it is a good idea to prevent as many copy instructions as possible, but pure device only buffers are tricky. |
Thanks. The documentation of PyOpenCl is inconclusive about this topic. I wrote both versions. |
Hi !
i started to write a PyOpenCl based backend and realized that all arrays are created in host memory. Afterwards they are copied to the OpenCL device, used by the kernel, and copied back by cl_run_kernel.
I would like to write functions which just create references to the arrays in OpenCL device memory and pass them around, because transfer to and from GPUs can be slow. What do you think ?
The text was updated successfully, but these errors were encountered: