Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize GPU code #3

Open
bentsherman opened this issue Apr 22, 2018 · 0 comments
Open

Optimize GPU code #3

bentsherman opened this issue Apr 22, 2018 · 0 comments

Comments

@bentsherman
Copy link
Member

The matrix library has basic GPU capabilities in that it can use CUBLAS and CUSOLVER in place of CBLAS and LAPACKE when a GPU is available, but I don't think these libraries provide enough on their own to obtain maximum performance with a GPU. There are still a few operations that haven't been implemented with BLAS / LAPACKE, so unless I can figure out a way to do so, the matrix library has to keep host and device memory in sync at all times, which hinders performance. I think there are a few things that can be done to maximize performance:

  • Implement custom CUDA kernels for operations which cannot be done entirely with BLAS / LAPACKE. That way, all internal memory transfers can be removed from the matrix library and delegated to the user, based on when data needs to be printed or saved, etc.
  • Use CUDA streams for CUBLAS, CUSOLVER, and all memory transfers and custom kernels. Especially once all intermediate operations can be done on the GPU, using streams should increase GPU bandwidth.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant