-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Use the RoCM/HIP device to accelerate certain DPLASMA kernels (#57)
* configure: Add the --with-hip option Signed-off-by: Aurelien Bouteiller <[email protected]> * hip: Configury Signed-off-by: Aurelien Bouteiller <[email protected]> * hip: kernel typedefs Signed-off-by: Aurelien Bouteiller <[email protected]> * hip: update for hip-enabled parsec Signed-off-by: Aurelien Bouteiller <[email protected]> * hip: detect hipblas and rocsolvers Signed-off-by: Aurelien Bouteiller <[email protected]> Conflicts: src/CMakeLists.txt * hip: precision generator rules Signed-off-by: Aurelien Bouteiller <[email protected]> * hip: cleanup unused dyld hipblas functions Signed-off-by: Aurelien Bouteiller <[email protected]> * hip: Update lapack stagein Signed-off-by: Aurelien Bouteiller <[email protected]> * Update for feature/common_gpu parsec branch changes Signed-off-by: Aurelien Bouteiller <[email protected]> * Some conflicting updates between hip and common_gpu need more resolution Signed-off-by: Aurelien Bouteiller <[email protected]> * hip: stream info registration * hip: potrf on AMD Signed-off-by: Aurelien Bouteiller <[email protected]> * hip:po: Some errors introduced when merging Signed-off-by: Aurelien Bouteiller <[email protected]> * Add HIP to the lookahead gpu gemm Signed-off-by: Aurelien Bouteiller <[email protected]> * Add HIP to zgemm_summa Signed-off-by: Aurelien Bouteiller <[email protected]> * hip: rework of PO and workspaces Signed-off-by: Aurelien Bouteiller <[email protected]> * hip: remove unecessary hiblas init calls Signed-off-by: Aurelien Bouteiller <[email protected]> * hip:po:errors in ldam asserts Signed-off-by: Aurelien Bouteiller <[email protected]> * hip:po: some of the changes had broken cusolver Signed-off-by: Aurelien Bouteiller <[email protected]> * fix printlogcuda/hip Signed-off-by: Aurelien Bouteiller <[email protected]> * Auto-generate hip stage-in/out functions Use proper error checks instead of asserts * hip:zgemm_gpu: don't use hipComplex * Return the proper PARSEC_HOOK_RETURN_ERROR in GPU error cases Signed-off-by: Aurelien Bouteiller <[email protected]> * Update for the new device mask for incarnations Signed-off-by: Aurelien Bouteiller <[email protected]> * So far only NN gemm can run with HIP Signed-off-by: Aurelien Bouteiller <[email protected]> * Use the correct DPLASMA_HAVE_HIP Signed-off-by: Aurelien Bouteiller <[email protected]> * Remove weight properties from HIP bodies Signed-off-by: Aurelien Bouteiller <[email protected]> * Reorder and uniformize cuda and hip bodies Signed-off-by: Aurelien Bouteiller <[email protected]> * A PARSEC_HAVE_HIP was still present Signed-off-by: Aurelien Bouteiller <[email protected]> * Rework zpotrf_U Signed-off-by: Aurelien Bouteiller <[email protected]> * hip: add NT/TN/TT cases to gemm_summa Signed-off-by: Aurelien Bouteiller <[email protected]> * Update parsec to a version that works with GPUs Signed-off-by: Aurelien Bouteiller <[email protected]> * zpotrf_wrapper: uid and handles don't exist when not using a GPU device Signed-off-by: Aurelien Bouteiller <[email protected]> * Update dtd for hip/cuda specializations for the dtd workspaces Signed-off-by: Aurelien Bouteiller <[email protected]> * Make all gemm_summa the same between hip/cuda Signed-off-by: Aurelien Bouteiller <[email protected]> * Use the same controls as parsec for GPU_WITH_CUDA/HIP * hip: merge error: the device count must be updated in both hip and cuda builds * hip: printlog hipblascomplex not compatible with creal * hip: final cleanup --------- Signed-off-by: Aurelien Bouteiller <[email protected]>
- Loading branch information
1 parent
f4dd66c
commit edf5be3
Showing
51 changed files
with
1,373 additions
and
438 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,9 @@ | ||
[cu*_alloc_failed] | ||
There was not enough memory available on a CUDA device | ||
[gpu_alloc_failed] | ||
There was not enough memory available on a GPU device | ||
while trying to allocate a %s handle to manage tasks on | ||
this device, or another CUDA device on the node. The | ||
this device, or another GPU device on the node. The | ||
PaRSEC runtime system may be configured to reserve too | ||
much memory on CUDA devices. Try reducing the amount of | ||
much memory on GPU devices. Try reducing the amount of | ||
reserved memory by setting the PaRSEC MCA parameter | ||
'device_cuda_memory_use' to a lower value. | ||
'device_cuda_memory_use' (or similar for the type of | ||
device) to a lower value. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
This directory contains files that are automatically converted from CUDA to HIP using Hipify. | ||
If your file is not automatically convertible, put it somewhere else. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
/* | ||
* Copyright (c) 2020-2024 The University of Tennessee and The University | ||
* of Tennessee Research Foundation. All rights | ||
* reserved. | ||
* | ||
* $COPYRIGHT | ||
* | ||
*/ | ||
|
||
#include "dplasma.h" | ||
#include "dplasmajdf_lapack_dtt.h" | ||
|
||
#if defined(DPLASMA_HAVE_CUDA) | ||
#include <cuda.h> | ||
#include <parsec/mca/device/cuda/device_cuda.h> | ||
|
||
/* Use cudaMemcpy2DAsync or loop with cudaMemcpyAsync for data transfers to device */ | ||
#define USE_COPY_2D | ||
|
||
int | ||
dplasma_cuda_lapack_stage_in(parsec_gpu_task_t *gtask, | ||
uint32_t flow_mask, | ||
parsec_gpu_exec_stream_t *gpu_stream) | ||
{ | ||
cudaError_t ret; | ||
parsec_data_copy_t * copy_in; | ||
parsec_data_copy_t * copy_out; | ||
parsec_device_gpu_module_t *in_elem_dev; | ||
parsec_cuda_exec_stream_t *cuda_stream = (parsec_cuda_exec_stream_t*)gpu_stream; | ||
dplasma_data_collection_t * ddc; | ||
parsec_task_t *task = gtask->ec; | ||
int elem_sz; | ||
int i; | ||
for(i = 0; i < task->task_class->nb_flows; i++){ | ||
if(flow_mask & (1U << i)){ | ||
copy_in = task->data[i].data_in; | ||
copy_out = task->data[i].data_out; | ||
ddc = (dplasma_data_collection_t*)gtask->flow_dc[i]; | ||
assert(ddc != NULL); | ||
elem_sz = parsec_datadist_getsizeoftype(ddc->dc_original->mtype); | ||
in_elem_dev = (parsec_device_gpu_module_t*)parsec_mca_device_get( copy_in->device_index); | ||
if( (in_elem_dev->super.type == PARSEC_DEV_CUDA) || (ddc->dc_original->storage != PARSEC_MATRIX_LAPACK)){ | ||
ret = (cudaError_t)cudaMemcpyAsync( copy_out->device_private, | ||
copy_in->device_private, | ||
gtask->flow_nb_elts[i], | ||
(in_elem_dev->super.type != PARSEC_DEV_CUDA)? | ||
cudaMemcpyHostToDevice : cudaMemcpyDeviceToDevice, | ||
cuda_stream->cuda_stream); | ||
PARSEC_CUDA_CHECK_ERROR( "cudaMemcpyAsync ", ret, { return PARSEC_ERROR; } ); | ||
}else{ | ||
|
||
#ifdef USE_COPY_2D | ||
int ldd, nrows, ncols; | ||
ADTT_INFO_internal(copy_in, ddc, &ldd, &nrows, &ncols); | ||
size_t dpitch = ddc->dc_original->mb * elem_sz; | ||
size_t spitch = ldd * elem_sz; | ||
size_t width = nrows * elem_sz; | ||
size_t height = ncols; | ||
/* copy width bytes heigth times, skipping pitch - width bytes every time */ | ||
ret = (cudaError_t)cudaMemcpy2DAsync( copy_out->device_private, | ||
dpitch, /*dst pitch bytes*/ | ||
copy_in->device_private, | ||
spitch, /*src pitch bytes*/ | ||
width, height, | ||
cudaMemcpyHostToDevice, | ||
cuda_stream->cuda_stream ); | ||
PARSEC_CUDA_CHECK_ERROR( "cudaMemcpy2DAsync ", ret, { return PARSEC_ERROR; } ); | ||
|
||
|
||
#else | ||
|
||
int ldd, nrows, ncols; | ||
ADTT_INFO_internal(copy_in, ddc, &ldd, &nrows, &ncols); | ||
|
||
int j; | ||
for(j=0; j<ncols; j++) { | ||
char*src = ((char*)copy_in->device_private) + j * ldd * elem_sz; | ||
char*dst = ((char*)copy_out->device_private) + j * ddc->dc_original->mb * elem_sz; | ||
ret = cudaMemcpyAsync(dst, | ||
src, | ||
nrows * elem_sz, | ||
cudaMemcpyHostToDevice, | ||
cuda_stream->cuda_stream ); | ||
PARSEC_CUDA_CHECK_ERROR( "cudaMemcpyAsync ", ret, { return PARSEC_ERROR; } ); | ||
|
||
} | ||
#endif | ||
|
||
|
||
} | ||
} | ||
} | ||
return PARSEC_SUCCESS; | ||
} | ||
|
||
int | ||
dplasma_cuda_lapack_stage_out(parsec_gpu_task_t *gtask, | ||
uint32_t flow_mask, | ||
parsec_gpu_exec_stream_t *gpu_stream) | ||
{ | ||
cudaError_t ret; | ||
parsec_data_copy_t * copy_in; | ||
parsec_data_copy_t * copy_out; | ||
parsec_device_gpu_module_t *out_elem_dev; | ||
parsec_cuda_exec_stream_t *cuda_stream = (parsec_cuda_exec_stream_t*)gpu_stream; | ||
parsec_task_t *task = gtask->ec; | ||
dplasma_data_collection_t * ddc; | ||
int elem_sz; | ||
int i; | ||
for(i = 0; i < task->task_class->nb_flows; i++){ | ||
if(flow_mask & (1U << i)){ | ||
copy_in = task->data[i].data_out; | ||
copy_out = copy_in->original->device_copies[0]; | ||
ddc = (dplasma_data_collection_t*)gtask->flow_dc[i]; | ||
assert(ddc != NULL); | ||
elem_sz = parsec_datadist_getsizeoftype(ddc->dc_original->mtype); | ||
out_elem_dev = (parsec_device_gpu_module_t*)parsec_mca_device_get( copy_out->device_index); | ||
|
||
if( (out_elem_dev->super.type == PARSEC_DEV_CUDA) || (ddc->dc_original->storage != PARSEC_MATRIX_LAPACK)){ | ||
ret = (cudaError_t)cudaMemcpyAsync( copy_out->device_private, | ||
copy_in->device_private, | ||
gtask->flow_nb_elts[i], | ||
out_elem_dev->super.type != PARSEC_DEV_CUDA ? | ||
cudaMemcpyDeviceToHost : cudaMemcpyDeviceToDevice, | ||
cuda_stream->cuda_stream); | ||
PARSEC_CUDA_CHECK_ERROR( "cudaMemcpyAsync ", ret, { return PARSEC_ERROR; } ); | ||
}else{ | ||
|
||
#ifdef USE_COPY_2D | ||
int ldd, nrows, ncols; | ||
ADTT_INFO_internal(copy_out, ddc, &ldd, &nrows, &ncols); | ||
size_t dpitch = ldd * elem_sz; | ||
size_t spitch = ddc->dc_original->mb * elem_sz; | ||
size_t width = nrows * elem_sz; | ||
size_t height = ncols; | ||
/* copy width bytes heigth times, skipping pitch - width bytes every time */ | ||
ret = (cudaError_t)cudaMemcpy2DAsync( copy_out->device_private, | ||
dpitch, /*dst pitch bytes*/ | ||
copy_in->device_private, | ||
spitch, /*src pitch bytes*/ | ||
width, height, | ||
cudaMemcpyDeviceToHost, | ||
cuda_stream->cuda_stream); | ||
PARSEC_CUDA_CHECK_ERROR( "cudaMemcpy2DAsync ", ret, { return PARSEC_ERROR; } ); | ||
#else | ||
int ldd, nrows, ncols; | ||
ADTT_INFO_internal(copy_out, ddc, &ldd, &nrows, &ncols); | ||
int j; | ||
for(j=0; j<ncols; j++) { | ||
char*src = ((char*)copy_in->device_private) + j * ddc->dc_original->mb * elem_sz; | ||
char*dst = ((char*)copy_out->device_private) + j * ldd * elem_sz; | ||
ret = cudaMemcpyAsync(dst, | ||
src, | ||
nrows * elem_sz, | ||
cudaMemcpyDeviceToHost, | ||
cuda_stream->cuda_stream); | ||
PARSEC_CUDA_CHECK_ERROR( "cudaMemcpyAsync ", ret, { return PARSEC_ERROR; } ); | ||
} | ||
#endif | ||
} | ||
} | ||
} | ||
return PARSEC_SUCCESS; | ||
} | ||
#endif /* defined(DPLASMA_HAVE_CUDA) */ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.