Skip to content

Commit

Permalink
Bsrmm (#56)
Browse files Browse the repository at this point in the history
* started creating skeleton code for bsrmm

* rebase bsrmm to squash commits

clang formatting

Allow library dependencies to be installed from CI (#49)

csrgeam (#46)

* csrgeam API added

* csrgeam tests and benchmark added

* flops, bandwidth and host implementation for csrgeam

* csrgeam unit tests

* removed webbase_1M test

* csrgeam (functional) added

* added tests for invalid sizes

* typos and year

* clang-format

* csrgeam performance scripts

bump version

Replace host code in bsr2csr (#48)

* removed host bsr2csr and csr2bsr code and replaced it with device calls

* clang formatting

Co-authored-by: jsandham <[email protected]>

bump version

added some examples (#50)

* added sparse level 1 examples

* added examples for sparse level 2 and 3

* clang-format

* added sparse extra examples

* bump version

hipclang related fixes (#51)

* hipclang related fixes

* bump version

sanity check for matrix download (#52)

added fallback for unit test matrix downloads (#53)

examples fix (#54)

* header fix for examples

* bump version

got bsrmm working for block dim less than 8

clang formatting

fixing bugs and getting benchmark to work

optimizing and working on kernels for block dimension greater than 8

kernels and code for block dimension greater than 8 and B matrix transposed

expanded loop unrolling up to block dimension 16

clang formatting

Remove gpg check for CI package CentOS install (#57)

updated internal function names (#61)

* renamed internal csrtr to trm

* clang-format

added missing header (#62)

fixes to documentation

remove compile time evaluation of direction to help reduce the number of kernels

clang formatting

small performance improvements to transpose kernel

clang formatting

increase transpose performance

clang formatting

re-ordering row pointer and column arrays for csr2csr_compress (#59)

* re-ordering row pointer and column arrays for csr2csr_compress

* fixing broken tests

* fixing incorrect order in log_trace

* moving deletion of temporary arry to ensure it is always called

Co-authored-by: jsandham <[email protected]>

bump version

Single thread compile in install script (#63)

pyyaml package name fix for centos8 (#60)

* pyyaml package name fix for centos8

* this should also account for rhel8

* bump version

Update README.md

pivot test fix (#65)

* adding device sync in spin loop tests to not overwrite pivots before checking them

* bump version

Removing rock-dkms (#66)

Revert "Single thread compile in install script (#63)" (#69)

Fortran interface (#55)

* fortran interface draft with examples added

* example fix to properly work with return values

* force cmake to add .f90 module to package

* added some more missing level1, level3 and conversion routines

* added few more missing functions to wrapper

* csric0 and csrilu0 fortran examples

* csrgemm_buffer_size binding name fixed

* fortran example fix, stop allows only constant expressions

* fix for string passing

* added enums to fortran; example for aux functions; fixes to pointer arguments

* more examples

* updated fortran example output of csrilu0 and csric0

* updated install.sh script and dockerfiles to install gfortran dependencies

* fix for device pointer mode

* few changes to make it consistent with hipfort

* bump version

ddoti fortran fix (#71)

bsrmv smem sync? (#70)

bump version

mtx pattern fix (#73)

Added centos 8 dependency fixes (#74)

bump version

bsrsv (#72)

* general working version of bsrsv for lower and upper non transposed matrices

* fixing bsr_to_bsc order

* added functionality for transposed matrix

* enabling complex numbers

* optimized bsrsv for BSR dimensions from 2x2 to 32x32

* gfx908

* fortran functions and example

* disabling some unit diagonal tests with nos1 and nos2

* bump version

fortran module fixes (#75)

centos 6 (#76)

* centos6 support

* bump version

Allow library dependencies to be installed from CI (#49)

csrgeam (#46)

* csrgeam API added

* csrgeam tests and benchmark added

* flops, bandwidth and host implementation for csrgeam

* csrgeam unit tests

* removed webbase_1M test

* csrgeam (functional) added

* added tests for invalid sizes

* typos and year

* clang-format

* csrgeam performance scripts

added some examples (#50)

* added sparse level 1 examples

* added examples for sparse level 2 and 3

* clang-format

* added sparse extra examples

* bump version

examples fix (#54)

* header fix for examples

* bump version

Remove gpg check for CI package CentOS install (#57)

added missing header (#62)

re-ordering row pointer and column arrays for csr2csr_compress (#59)

* re-ordering row pointer and column arrays for csr2csr_compress

* fixing broken tests

* fixing incorrect order in log_trace

* moving deletion of temporary arry to ensure it is always called

Co-authored-by: jsandham <[email protected]>

Single thread compile in install script (#63)

Update README.md

Removing rock-dkms (#66)

Revert "Single thread compile in install script (#63)" (#69)

Fortran interface (#55)

* fortran interface draft with examples added

* example fix to properly work with return values

* force cmake to add .f90 module to package

* added some more missing level1, level3 and conversion routines

* added few more missing functions to wrapper

* csric0 and csrilu0 fortran examples

* csrgemm_buffer_size binding name fixed

* fortran example fix, stop allows only constant expressions

* fix for string passing

* added enums to fortran; example for aux functions; fixes to pointer arguments

* more examples

* updated fortran example output of csrilu0 and csric0

* updated install.sh script and dockerfiles to install gfortran dependencies

* fix for device pointer mode

* few changes to make it consistent with hipfort

* bump version

ddoti fortran fix (#71)

bsrmv smem sync? (#70)

bsrsv (#72)

* general working version of bsrsv for lower and upper non transposed matrices

* fixing bsr_to_bsc order

* added functionality for transposed matrix

* enabling complex numbers

* optimized bsrsv for BSR dimensions from 2x2 to 32x32

* gfx908

* fortran functions and example

* disabling some unit diagonal tests with nos1 and nos2

* bump version

fortran module fixes (#75)

centos 6 (#76)

* centos6 support

* bump version

adding fortran example code

fixing fortran compile error

adding bsrmm to fortran_module.f90

fixing fortran example array order

fix fortran compile error

fix fortran compile error

adding cpp example code for bsrmm

clang formatting

working on optimizing kernels

working on optimizing kernels

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

optimizing bsrmm

reverting back to original kernels

optimizing bsrmm

making test2 kernel active for block dim 8

optimizing bsrmm

significant performance improvement for block dimensions 5 to 32

further performance improvements to transpose and non-transpose case

reduce compile times and replaced general kernel

optimizing for n <= 16

Correction to the cmake RUNPATH parameter (#79)

Co-authored-by: Pruthvi Madugundu <[email protected]>

bump version

cmake update (#80)

* cmake update

* disabling OpenMP until this is fixed within hipclang

Csr2bsr optimization (#78)

* optimized csr2bsr_nnz

* rebase csr2bsr_optimization branch to squash commits

Working on optimizing csr2bsr device code

changed blocksize to 16 as this runs twice as fast

clang formatting

removing comments

performance optimizations

clang formatting

improve performance

clang formatting

csr2bsr optimization

added missing header (#62)

re-ordering row pointer and column arrays for csr2csr_compress (#59)

* re-ordering row pointer and column arrays for csr2csr_compress

* fixing broken tests

* fixing incorrect order in log_trace

* moving deletion of temporary arry to ensure it is always called

Co-authored-by: jsandham <[email protected]>

bump version

Single thread compile in install script (#63)

pyyaml package name fix for centos8 (#60)

* pyyaml package name fix for centos8

* this should also account for rhel8

* bump version

Update README.md

pivot test fix (#65)

* adding device sync in spin loop tests to not overwrite pivots before checking them

* bump version

Removing rock-dkms (#66)

Revert "Single thread compile in install script (#63)" (#69)

Fortran interface (#55)

* fortran interface draft with examples added

* example fix to properly work with return values

* force cmake to add .f90 module to package

* added some more missing level1, level3 and conversion routines

* added few more missing functions to wrapper

* csric0 and csrilu0 fortran examples

* csrgemm_buffer_size binding name fixed

* fortran example fix, stop allows only constant expressions

* fix for string passing

* added enums to fortran; example for aux functions; fixes to pointer arguments

* more examples

* updated fortran example output of csrilu0 and csric0

* updated install.sh script and dockerfiles to install gfortran dependencies

* fix for device pointer mode

* few changes to make it consistent with hipfort

* bump version

ddoti fortran fix (#71)

bsrmv smem sync? (#70)

bump version

mtx pattern fix (#73)

Added centos 8 dependency fixes (#74)

bump version

bsrsv (#72)

* general working version of bsrsv for lower and upper non transposed matrices

* fixing bsr_to_bsc order

* added functionality for transposed matrix

* enabling complex numbers

* optimized bsrsv for BSR dimensions from 2x2 to 32x32

* gfx908

* fortran functions and example

* disabling some unit diagonal tests with nos1 and nos2

* bump version

fortran module fixes (#75)

centos 6 (#76)

* centos6 support

* bump version

added missing header (#62)

re-ordering row pointer and column arrays for csr2csr_compress (#59)

* re-ordering row pointer and column arrays for csr2csr_compress

* fixing broken tests

* fixing incorrect order in log_trace

* moving deletion of temporary arry to ensure it is always called

Co-authored-by: jsandham <[email protected]>

Single thread compile in install script (#63)

Update README.md

Removing rock-dkms (#66)

Revert "Single thread compile in install script (#63)" (#69)

Fortran interface (#55)

* fortran interface draft with examples added

* example fix to properly work with return values

* force cmake to add .f90 module to package

* added some more missing level1, level3 and conversion routines

* added few more missing functions to wrapper

* csric0 and csrilu0 fortran examples

* csrgemm_buffer_size binding name fixed

* fortran example fix, stop allows only constant expressions

* fix for string passing

* added enums to fortran; example for aux functions; fixes to pointer arguments

* more examples

* updated fortran example output of csrilu0 and csric0

* updated install.sh script and dockerfiles to install gfortran dependencies

* fix for device pointer mode

* few changes to make it consistent with hipfort

* bump version

ddoti fortran fix (#71)

bsrmv smem sync? (#70)

bsrsv (#72)

* general working version of bsrsv for lower and upper non transposed matrices

* fixing bsr_to_bsc order

* added functionality for transposed matrix

* enabling complex numbers

* optimized bsrsv for BSR dimensions from 2x2 to 32x32

* gfx908

* fortran functions and example

* disabling some unit diagonal tests with nos1 and nos2

* bump version

fortran module fixes (#75)

centos 6 (#76)

* centos6 support

* bump version

Co-authored-by: jsandham <[email protected]>

* reducing number of tests

* removing bank conflicts

* removing duplicate code from rocsparse-functions header

* fixing line in rocspasrse-functions header changed by bad merge

* fix formating from merge

* fix formatting errors from merge

Co-authored-by: jsandham <[email protected]>
  • Loading branch information
jsandham and jsandham authored Jul 6, 2020
1 parent 0da3129 commit 91f7f2b
Show file tree
Hide file tree
Showing 23 changed files with 3,906 additions and 11 deletions.
14 changes: 13 additions & 1 deletion clients/benchmarks/client.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
#include "testing_hybmv.hpp"

// Level3
#include "testing_bsrmm.hpp"
#include "testing_csrmm.hpp"
#include "testing_csrsm.hpp"

Expand Down Expand Up @@ -209,7 +210,7 @@ int main(int argc, char* argv[])
"SPARSE function to test. Options:\n"
" Level1: axpyi, doti, dotci, gthr, gthrz, roti, sctr\n"
" Level2: bsrmv, bsrsv, coomv, csrmv, csrsv, ellmv, hybmv\n"
" Level3: csrmm, csrsm\n"
" Level3: bsrmm, csrmm, csrsm\n"
" Extra: csrgeam, csrgemm\n"
" Preconditioner: csric0, csrilu0\n"
" Conversion: csr2coo, csr2csc, csr2ell, csr2hyb, csr2bsr\n"
Expand Down Expand Up @@ -567,6 +568,17 @@ int main(int argc, char* argv[])
else if(precision == 'z')
testing_hybmv<rocsparse_double_complex>(arg);
}
else if(function == "bsrmm")
{
if(precision == 's')
testing_bsrmm<float>(arg);
else if(precision == 'd')
testing_bsrmm<double>(arg);
else if(precision == 'c')
testing_bsrmm<rocsparse_float_complex>(arg);
else if(precision == 'z')
testing_bsrmm<rocsparse_double_complex>(arg);
}
else if(function == "csrmm")
{
if(precision == 's')
Expand Down
169 changes: 169 additions & 0 deletions clients/common/rocsparse_template_specialization.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1521,6 +1521,175 @@ rocsparse_status rocsparse_hybmv(rocsparse_handle handle,
* level 3 SPARSE
* ===========================================================================
*/
// bsrmm
template <>
rocsparse_status rocsparse_bsrmm(rocsparse_handle handle,
rocsparse_direction dir,
rocsparse_operation trans_A,
rocsparse_operation trans_B,
rocsparse_int mb,
rocsparse_int n,
rocsparse_int kb,
rocsparse_int nnzb,
const float* alpha,
const rocsparse_mat_descr descr,
const float* bsr_val,
const rocsparse_int* bsr_row_ptr,
const rocsparse_int* bsr_col_ind,
rocsparse_int block_dim,
const float* B,
rocsparse_int ldb,
const float* beta,
float* C,
rocsparse_int ldc)
{
return rocsparse_sbsrmm(handle,
dir,
trans_A,
trans_B,
mb,
n,
kb,
nnzb,
alpha,
descr,
bsr_val,
bsr_row_ptr,
bsr_col_ind,
block_dim,
B,
ldb,
beta,
C,
ldc);
}

template <>
rocsparse_status rocsparse_bsrmm(rocsparse_handle handle,
rocsparse_direction dir,
rocsparse_operation trans_A,
rocsparse_operation trans_B,
rocsparse_int mb,
rocsparse_int n,
rocsparse_int kb,
rocsparse_int nnzb,
const double* alpha,
const rocsparse_mat_descr descr,
const double* bsr_val,
const rocsparse_int* bsr_row_ptr,
const rocsparse_int* bsr_col_ind,
rocsparse_int block_dim,
const double* B,
rocsparse_int ldb,
const double* beta,
double* C,
rocsparse_int ldc)
{
return rocsparse_dbsrmm(handle,
dir,
trans_A,
trans_B,
mb,
n,
kb,
nnzb,
alpha,
descr,
bsr_val,
bsr_row_ptr,
bsr_col_ind,
block_dim,
B,
ldb,
beta,
C,
ldc);
}

template <>
rocsparse_status rocsparse_bsrmm(rocsparse_handle handle,
rocsparse_direction dir,
rocsparse_operation trans_A,
rocsparse_operation trans_B,
rocsparse_int mb,
rocsparse_int n,
rocsparse_int kb,
rocsparse_int nnzb,
const rocsparse_float_complex* alpha,
const rocsparse_mat_descr descr,
const rocsparse_float_complex* bsr_val,
const rocsparse_int* bsr_row_ptr,
const rocsparse_int* bsr_col_ind,
rocsparse_int block_dim,
const rocsparse_float_complex* B,
rocsparse_int ldb,
const rocsparse_float_complex* beta,
rocsparse_float_complex* C,
rocsparse_int ldc)
{
return rocsparse_cbsrmm(handle,
dir,
trans_A,
trans_B,
mb,
n,
kb,
nnzb,
alpha,
descr,
bsr_val,
bsr_row_ptr,
bsr_col_ind,
block_dim,
B,
ldb,
beta,
C,
ldc);
}

template <>
rocsparse_status rocsparse_bsrmm(rocsparse_handle handle,
rocsparse_direction dir,
rocsparse_operation trans_A,
rocsparse_operation trans_B,
rocsparse_int mb,
rocsparse_int n,
rocsparse_int kb,
rocsparse_int nnzb,
const rocsparse_double_complex* alpha,
const rocsparse_mat_descr descr,
const rocsparse_double_complex* bsr_val,
const rocsparse_int* bsr_row_ptr,
const rocsparse_int* bsr_col_ind,
rocsparse_int block_dim,
const rocsparse_double_complex* B,
rocsparse_int ldb,
const rocsparse_double_complex* beta,
rocsparse_double_complex* C,
rocsparse_int ldc)
{
return rocsparse_zbsrmm(handle,
dir,
trans_A,
trans_B,
mb,
n,
kb,
nnzb,
alpha,
descr,
bsr_val,
bsr_row_ptr,
bsr_col_ind,
block_dim,
B,
ldb,
beta,
C,
ldc);
}

// csrmm
template <>
rocsparse_status rocsparse_csrmm(rocsparse_handle handle,
Expand Down
10 changes: 10 additions & 0 deletions clients/include/flops.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,16 @@ constexpr double csrsv_gflop_count(rocsparse_int M, rocsparse_int nnz, rocsparse
* level 3 SPARSE
* ===========================================================================
*/
template <typename T>
constexpr double bsrmm_gflop_count(rocsparse_int N,
rocsparse_int nnzb,
rocsparse_int block_dim,
rocsparse_int nnz_C,
bool beta = false)
{
return (3.0 * nnzb * block_dim * block_dim * N + (beta ? nnz_C : 0)) / 1e9;
}

template <typename T>
constexpr double
csrmm_gflop_count(rocsparse_int N, rocsparse_int nnz_A, rocsparse_int nnz_C, bool beta = false)
Expand Down
18 changes: 18 additions & 0 deletions clients/include/gbyte.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,24 @@ constexpr double
* level 3 SPARSE
* ===========================================================================
*/
template <typename T>
constexpr double bsrmm_gbyte_count(rocsparse_int Mb,
rocsparse_int nnzb,
rocsparse_int block_dim,
rocsparse_int nnz_B,
rocsparse_int nnz_C,
bool beta = false)
{
//reads
size_t reads = (Mb + 1 + nnzb) * sizeof(rocsparse_int)
+ (block_dim * block_dim * nnzb + nnz_B + (beta ? nnz_C : 0)) * sizeof(T);

//writes
size_t writes = nnz_C * sizeof(T);

return (reads + writes) / 1e9;
}

template <typename T>
constexpr double csrmm_gbyte_count(rocsparse_int M,
rocsparse_int nnz_A,
Expand Down
22 changes: 22 additions & 0 deletions clients/include/rocsparse.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,28 @@ rocsparse_status rocsparse_hybmv(rocsparse_handle handle,
* level 3 SPARSE
* ===========================================================================
*/
// bsrmm
template <typename T>
rocsparse_status rocsparse_bsrmm(rocsparse_handle handle,
rocsparse_direction dir,
rocsparse_operation trans_A,
rocsparse_operation trans_B,
rocsparse_int mb,
rocsparse_int n,
rocsparse_int kb,
rocsparse_int nnzb,
const T* alpha,
const rocsparse_mat_descr descr,
const T* bsr_val,
const rocsparse_int* bsr_row_ptr,
const rocsparse_int* bsr_col_ind,
rocsparse_int block_dim,
const T* B,
rocsparse_int ldb,
const T* beta,
T* C,
rocsparse_int ldc);

// csrmm
template <typename T>
rocsparse_status rocsparse_csrmm(rocsparse_handle handle,
Expand Down
83 changes: 80 additions & 3 deletions clients/include/rocsparse_host.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -1233,6 +1233,83 @@ inline void host_hybmv(rocsparse_int M,
* level 3 SPARSE
* ===========================================================================
*/
template <typename T>
inline void host_bsrmm(rocsparse_int Mb,
rocsparse_int N,
rocsparse_int Kb,
rocsparse_int block_dim,
rocsparse_direction dir,
rocsparse_operation transA,
rocsparse_operation transB,
T alpha,
const std::vector<rocsparse_int>& bsr_row_ptr_A,
const std::vector<rocsparse_int>& bsr_col_ind_A,
const std::vector<T>& bsr_val_A,
const std::vector<T>& B,
rocsparse_int ldb,
T beta,
std::vector<T>& C,
rocsparse_int ldc,
rocsparse_index_base base)
{
if(transA != rocsparse_operation_none)
{
return;
}

if(transB != rocsparse_operation_none && transB != rocsparse_operation_transpose)
{
return;
}

rocsparse_int M = Mb * block_dim;
rocsparse_int K = Kb * block_dim;

#ifdef _OPENMP
#pragma omp parallel for schedule(dynamic, 1024)
#endif
for(rocsparse_int i = 0; i < M; i++)
{
rocsparse_int local_row = i % block_dim;

rocsparse_int row_begin = bsr_row_ptr_A[i / block_dim] - base;
rocsparse_int row_end = bsr_row_ptr_A[i / block_dim + 1] - base;

for(rocsparse_int j = 0; j < N; j++)
{
rocsparse_int idx_C = i + j * ldc;

T sum = static_cast<T>(0);

for(rocsparse_int s = row_begin; s < row_end; s++)
{
for(rocsparse_int t = 0; t < block_dim; t++)
{
rocsparse_int idx_A
= (dir == rocsparse_direction_row)
? block_dim * block_dim * s + block_dim * local_row + t
: block_dim * block_dim * s + block_dim * t + local_row;
rocsparse_int idx_B
= (transB == rocsparse_operation_none)
? j * ldb + block_dim * (bsr_col_ind_A[s] - base) + t
: (block_dim * (bsr_col_ind_A[s] - base) + t) * ldb + j;

sum = std::fma(bsr_val_A[idx_A], B[idx_B], sum);
}
}

if(beta == static_cast<T>(0))
{
C[idx_C] = alpha * sum;
}
else
{
C[idx_C] = std::fma(beta, C[idx_C], alpha * sum);
}
}
}
}

template <typename T>
inline void host_csrmm(rocsparse_int M,
rocsparse_int N,
Expand Down Expand Up @@ -1267,16 +1344,16 @@ inline void host_csrmm(rocsparse_int M,
? (csr_col_ind_A[k] - base + j * ldb)
: (j + (csr_col_ind_A[k] - base) * ldb);

sum = std::fma(alpha * csr_val_A[k], B[idx_B], sum);
sum = std::fma(csr_val_A[k], B[idx_B], sum);
}

if(beta == static_cast<T>(0))
{
C[idx_C] = sum;
C[idx_C] = alpha * sum;
}
else
{
C[idx_C] = std::fma(beta, C[idx_C], sum);
C[idx_C] = std::fma(beta, C[idx_C], alpha * sum);
}
}
}
Expand Down
4 changes: 4 additions & 0 deletions clients/include/rocsparse_template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,10 @@ Functions:
rocsparse_chybmv: { function: hybmv, <<: *single_precision_complex }
rocsparse_zhybmv: { function: hybmv, <<: *double_precision_complex }

rocsparse_sbsrmm: { function: bsrmm, <<: *single_precision }
rocsparse_dbsrmm: { function: bsrmm, <<: *double_precision }
rocsparse_cbsrmm: { function: bsrmm, <<: *single_precision_complex }
rocsparse_zbsrmm: { function: bsrmm, <<: *double_precision_complex }
rocsparse_scsrmm: { function: csrmm, <<: *single_precision }
rocsparse_dcsrmm: { function: csrmm, <<: *double_precision }
rocsparse_scsrsm_buffer_size: { function: csrsm, <<: *single_precision }
Expand Down
Loading

0 comments on commit 91f7f2b

Please sign in to comment.