-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use MKL for verification #63
Conversation
Co-authored-by: Mehdi Goli <[email protected]>
Co-authored-by: Mehdi Goli <[email protected]>
* Migrate cute sgemm_nt_1 to sycl * Revert "Migrate cute sgemm_nt_1 to sycl" * Revert "Revert "Migrate cute sgemm_nt_1 to sycl"" * Add cutlass example * Update examples/14_ampere_tf32_tensorop_gemm/CMakeLists.txt Co-authored-by: Mehdi Goli <[email protected]> * Update examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm_cute.cpp Co-authored-by: Mehdi Goli <[email protected]> * Update examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm_cute.cu Co-authored-by: Mehdi Goli <[email protected]> * Update examples/cute/tutorial/CMakeLists.txt Co-authored-by: Mehdi Goli <[email protected]> * Update include/cute/arch/copy_sm80.hpp Co-authored-by: Mehdi Goli <[email protected]> * Update include/cute/config.hpp Co-authored-by: Mehdi Goli <[email protected]> * Update include/cute/util/debug.hpp Co-authored-by: Mehdi Goli <[email protected]> * Update include/cute/util/debug.hpp Co-authored-by: Mehdi Goli <[email protected]> * Update include/cutlass/detail/helper_macros.hpp Co-authored-by: Mehdi Goli <[email protected]> * Address comments --------- Co-authored-by: Mehdi Goli <[email protected]>
* Remove sgemm_nt_1 PoC * Fix build issues * Fix code style format * Remove ENABLE_NVPTX flag * Update include/cute/util/debug.hpp Co-authored-by: Mehdi Goli <[email protected]> * Cosmetic --------- Co-authored-by: Mehdi Goli <[email protected]>
…tware#16) * Updating README-sycl.md to capture the 3.5 modifications * Update README-sycl.md Co-authored-by: aacostadiaz <[email protected]> * Remove the sgemm_nt_1_sycl PoC (codeplaysoftware#15) * Remove sgemm_nt_1 PoC * Fix build issues * Fix code style format * Remove ENABLE_NVPTX flag * Update include/cute/util/debug.hpp Co-authored-by: Mehdi Goli <[email protected]> * Cosmetic --------- Co-authored-by: Mehdi Goli <[email protected]> * Applying the comments --------- Co-authored-by: aacostadiaz <[email protected]>
* Adding CuTe example in SYCL Co-authored-by: Alejandro Acosta <[email protected]> Co-authored-by: Mehdi Goli <[email protected]>
* Migrate Cute components to SYCL
* Add cmake configuration * Update examples/cute/tutorial/CMakeLists.txt Co-authored-by: Mehdi Goli <[email protected]> --------- Co-authored-by: Mehdi Goli <[email protected]>
* Update README-sycl.md Fixing CUDA version
Co-authored-by: Mehdi Goli <[email protected]>
Fix typo in Macro Co-authored-by: Mehdi Goli <[email protected]> * Cosmetic --------- Co-authored-by: Mehdi Goli <[email protected]> * Applying the comments --------- Co-authored-by: aacostadiaz <[email protected]> * Revert "Updating README-sycl.md to capture the 3.5 modifications (codeplaysoftware#16)" (codeplaysoftware#17) This reverts commit a726bd3. * fix typo in macro --------- Co-authored-by: Mehdi Goli <[email protected]> Co-authored-by: aacostadiaz <[email protected]>
Add XE MMA/copy atom
Removes the CUDA toolkit dependency when SYCL is enabled. --------- Co-authored-by: Mehdi Goli <[email protected]> Co-authored-by: Atharva Dubey <[email protected]>
Fix some issues when trying to build and run the 14_ampere example for SYCL. --------- Co-authored-by: Muhammad Tanvir <[email protected]>
Enables Cute Tests via SYCL Path --------- Co-authored-by: Mehdi Goli <[email protected]> Co-authored-by: aacostadiaz <[email protected]>
Add an Intel PVC pipeline to compute GEMM.
Add example for intel PVC GEMM
Co-authored-by: Alejandro Acosta <[email protected]>
Co-authored-by: Alejandro Acosta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -42,6 +42,10 @@ | |||
#include "cutlass/util/device_memory.h" | |||
#include "helper.h" | |||
|
|||
#if defined(CUTLASS_ENABLE_MKL) | |||
# include <oneapi/mkl.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are building oneMKL interface for Nvidia GPU, you are linking it to the mkl close source for Intel even if it is on Nvida? what is your mechanism to have a correct linking and make sure that the oneMKL does not run on Intel CPU instead of calling cublas behind the scene to go to GPU. If you are always testing it on the IntelCPU there is no need to have this :
-DENABLE_MKLGPU_BACKEND=${ENABLE_MKLGPU_BACKEND}
-DENABLE_CUBLAS_BACKEND=${ENABLE_CUBLAS_BACKEND}
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is your mechanism to have a correct linking and make sure that the oneMKL does not run on Intel CPU
From what I understood about the oneMKL interface, the dispatch is in accordance to the queue. Whether it is a GPU or a CPU queue, and if a GPU queue, then what backend it will select, depending on the Device.
Since our queue will never be a CPU queue, It will never go to the CPU BLAS side of things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The oneMKL does not have all the backend unless you provide, for example if you dont give a correct pass to the backed there is no GPU or nvidia backend. OneMKL interface is a wrapper over 3 close source actual implementation, meaning oneMKL product, cuBLAs and rocBLAS. if the backend is not available it does not link to the correct one. the oneMKL close source product is only for intel CPU and GPU, so when you are running on Nvidia GPU you dont need the oneMKL product backend at all, similarly, when running on Intel GPU you dont need the cublas Backend at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the backend is not available it does not link to the correct one. the oneMKL close source product is only for intel CPU and GPU, so when you are running on Nvidia GPU you dont need the oneMKL product backend at all, similarly, when running on Intel GPU you dont need the cublas Backend at all
Only one backend is being built. CPU backend is explicitly turned off, and either of the CUBLAS or the MKL_GPU (closed source oneMKL) backend is being built, here, in accordance to the SYCL target being passed.
#if defined(CUTLASS_ENABLE_MKL) | ||
auto d_CRef = syclcompat::malloc<TC>(h_C.size()); | ||
auto d_hRef = std::vector<TC>(h_C.size()); | ||
auto queue = syclcompat::get_default_queue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the default queue in here, how can you make sure the queue is nvidia gpu queue. This test that you are running is specifically set for Amper device, and needed to be compiled only for that device due to definition of the architecture at compile time. how do you make sure that you get a correct queue in all systems when you use default here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test that you are running is specifically set for Amper device, and needed to be compiled only for that device due to definition of the architecture at compile time.
This example only builds if SYCL_NVIDIA_TARGET
is defined (here).
how can you make sure the queue is nvidia gpu queue
Since we do not explicitly create a queue, or pass any sort of device selection of to syclcompat, I would expect that the device selector is being set when calling the binary assuming multiple backends on the platform.
The reason I use the default_queue from sycl compat is because that's being used throughout, for memory allocations etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The selector should be passed somehow probably through SYCLcompat, sycl default queue does not guaranty selection of any devices and you wont have any control when multiple GPU is available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can create a queue using the GPU device selector and then use syclcompat::set_default_queue
. After that, you can add a check like the one in the cuda example.
cudaDeviceProp props;
cudaError_t error = cudaGetDeviceProperties(&props, 0);
if (error != cudaSuccess) {
std::cerr << "cudaGetDeviceProperties() returned an error: " << cudaGetErrorString(error) << std::endl;
return -1;
}
if (!((props.major * 10 + props.minor) >= 80)) {
std::cerr << "Ampere Tensor Core operations must be run on a machine with compute capability at least 80."
<< std::endl;
notSupported = true;
}
if (notSupported) {
// Returning zero so this test passes on older Toolkits. Its actions are no-op.
return 0;
}
Co-authored-by: Mehdi Goli <[email protected]>
Co-authored-by: Mehdi Goli <[email protected]>
Co-authored-by: Mehdi Goli <[email protected]>
Co-authored-by: Mehdi Goli <[email protected]>
039211b
to
8f3bd67
Compare
8959591
to
2aae80e
Compare
Adds support to use oneMKL for verification and verifies example 14.