You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The matrix library has basic GPU capabilities in that it can use CUBLAS and CUSOLVER in place of CBLAS and LAPACKE when a GPU is available, but I don't think these libraries provide enough on their own to obtain maximum performance with a GPU. There are still a few operations that haven't been implemented with BLAS / LAPACKE, so unless I can figure out a way to do so, the matrix library has to keep host and device memory in sync at all times, which hinders performance. I think there are a few things that can be done to maximize performance:
Implement custom CUDA kernels for operations which cannot be done entirely with BLAS / LAPACKE. That way, all internal memory transfers can be removed from the matrix library and delegated to the user, based on when data needs to be printed or saved, etc.
Use CUDA streams for CUBLAS, CUSOLVER, and all memory transfers and custom kernels. Especially once all intermediate operations can be done on the GPU, using streams should increase GPU bandwidth.
The text was updated successfully, but these errors were encountered:
The matrix library has basic GPU capabilities in that it can use CUBLAS and CUSOLVER in place of CBLAS and LAPACKE when a GPU is available, but I don't think these libraries provide enough on their own to obtain maximum performance with a GPU. There are still a few operations that haven't been implemented with BLAS / LAPACKE, so unless I can figure out a way to do so, the matrix library has to keep host and device memory in sync at all times, which hinders performance. I think there are a few things that can be done to maximize performance:
The text was updated successfully, but these errors were encountered: