Optimize GPU code #3

bentsherman · 2018-04-22T17:09:40Z

The matrix library has basic GPU capabilities in that it can use CUBLAS and CUSOLVER in place of CBLAS and LAPACKE when a GPU is available, but I don't think these libraries provide enough on their own to obtain maximum performance with a GPU. There are still a few operations that haven't been implemented with BLAS / LAPACKE, so unless I can figure out a way to do so, the matrix library has to keep host and device memory in sync at all times, which hinders performance. I think there are a few things that can be done to maximize performance:

Implement custom CUDA kernels for operations which cannot be done entirely with BLAS / LAPACKE. That way, all internal memory transfers can be removed from the matrix library and delegated to the user, based on when data needs to be printed or saved, etc.
Use CUDA streams for CUBLAS, CUSOLVER, and all memory transfers and custom kernels. Especially once all intermediate operations can be done on the GPU, using streams should increase GPU bandwidth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize GPU code #3

Optimize GPU code #3

bentsherman commented Apr 22, 2018

Optimize GPU code #3

Optimize GPU code #3

Comments

bentsherman commented Apr 22, 2018