Proper way to setup for multiple batch sizes #180

mobicham · 2024-09-06T14:27:52Z

mobicham
Sep 6, 2024

What is the right way to setup up BitBlas to work with a set of batch sizes:
1- A single config with M=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
2- A separate config for each N value?
When I set bitblas.MatmulConfig(M=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024], ... it just gets stuck for a long time.

Thank you

mobicham · 2024-09-06T15:01:08Z

mobicham
Sep 6, 2024
Author

In fact, anything different from [1] gives the same result.

group_size 128
(1, 4096, 4096) Bitblas Speed-up vs. torch.matmul 0.47
----------------------------------------------
(2, 4096, 4096) Bitblas Speed-up vs. torch.matmul 0.47
----------------------------------------------
(4, 4096, 4096) Bitblas Speed-up vs. torch.matmul 0.46
----------------------------------------------
(8, 4096, 4096) Bitblas Speed-up vs. torch.matmul 0.46
----------------------------------------------
(16, 4096, 4096) Bitblas Speed-up vs. torch.matmul 0.44
----------------------------------------------
(32, 4096, 4096) Bitblas Speed-up vs. torch.matmul 0.43
----------------------------------------------
(64, 4096, 4096) Bitblas Speed-up vs. torch.matmul 0.41
----------------------------------------------
(128, 4096, 4096) Bitblas Speed-up vs. torch.matmul 0.38
----------------------------------------------
(256, 4096, 4096) Bitblas Speed-up vs. torch.matmul 0.45
----------------------------------------------
(512, 4096, 4096) Bitblas Speed-up vs. torch.matmul 0.65
----------------------------------------------
(1024, 4096, 4096) Bitblas Speed-up vs. torch.matmul 1.07

0 replies

LeiWang1999 · 2024-09-07T06:19:40Z

LeiWang1999
Sep 7, 2024
Maintainer

i think we currently do not support to set N with dynamic range.

4 replies

mobicham Sep 7, 2024
Author

Thanks for your reply! I noticed that the speed is exactly the same across different batch sizes if I set N to any value >= 16. Correct me if I am wrong, but It seems to me that there only two settings possible: M==1 -> GEMV settings, M>1 it switches to some GEMM setting

LeiWang1999 Sep 7, 2024
Maintainer

hi, @mobicham do you mean M in MatmulConfig rather than N ?

mobicham Sep 7, 2024
Author

Yeah I mean M sorry (corrected the comments above)

LeiWang1999 Sep 7, 2024
Maintainer

@mobicham, it’s important to recognize that different batch sizes may require different tile configurations or kernels for optimal performance. BitBLAS addresses this by using a simple strategy. It predefines a set of batch sizes: M = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] as static tuning points. For each predefined batch size, BitBLAS tunes the performance across different shapes, such as (1, N, K), (2, N, K), (4, N, K), …, (1024, N, K). It then determines the best tile configuration for each shape. Finally, these optimal configurations are combined into a single dispatch function that dynamically selects the best configuration based on the input batch size.

LeiWang1999 · 2024-09-07T11:00:35Z

LeiWang1999
Sep 7, 2024
Maintainer

we can checkout ~/.cache/bitblas/nvidia/nvidia-a100/14cdbdeb6663f0e057f89429510d98ae649ea3d7b93065c2222e9d404ea60d3f/wrapper_source.cu for the dispatch function.

extern "C" void call(int8_t* __restrict__ A, int8_t* __restrict__ B, float* __restrict__ D, int m, cudaStream_t stream=cudaStreamDefault) {
  if (m == 0) return; 
  if (m <= 1) {
    matmul_n1024k4096_i8xi2_simt_opt_m_1<<<dim3(256, 1, m), dim3(32, 4, 1), 0, stream>>>(A, B, D, m); 
  }
  else if (m <= 16) {
    matmul_n1024k4096_i8xi2_simt_opt_m_16<<<dim3(128, 1, m), dim3(16, 8, 1), 0, stream>>>(A, B, D, m); 
  }
  else if (m <= 32) {
    matmul_n1024k4096_i8xi2_tcx32x16x512w16x16xp2_opt_m_32<<<dim3(32, (m + 15) / 16, 1), dim3(32, 1, 2), 40960, stream>>>(A, B, D, m); 
  }
  else if (m <= 64) {
    matmul_n1024k4096_i8xi2_tcx16x16x512w16x16xp2_opt_m_64<<<dim3(64, (m + 15) / 16, 1), dim3(32, 1, 1), 28672, stream>>>(A, B, D, m); 
  }
  else if (m <= 128) {
    matmul_n1024k4096_i8xi2_tcx16x32x512w16x16xp2_opt_m_128<<<dim3(64, (m + 15) / 16, 1), dim3(32, 1, 1), 28672, stream>>>(A, B, D, m); 
  }
  else if (m <= 256) {
    matmul_n1024k4096_i8xi2_tcx64x64x64w32x32xp2_opt_m_256<<<dim3(16, (m + 63) / 64, 1), dim3(32, 2, 2), 23552, stream>>>(A, B, D, m); 
  }
  else if (m <= 512) {
    matmul_n1024k4096_i8xi2_tcx64x128x64w32x64xp2_opt_m_512<<<dim3(8, (m + 63) / 64, 1), dim3(32, 2, 2), 35840, stream>>>(A, B, D, m); 
  }
  else if (m <= 1024) {
    matmul_n1024k4096_i8xi2_tcx64x128x64w32x64xp2_opt_m_1024<<<dim3(8, (m + 63) / 64, 1), dim3(32, 2, 2), 35840, stream>>>(A, B, D, m); 
  }
  else {
    matmul_n1024k4096_i8xi2_tcx64x128x64w32x64xp2_opt_m_1024<<<dim3(8, (m + 63) / 64, 1), dim3(32, 2, 2), 35840, stream>>>(A, B, D, m); 
  }

}

0 replies

LeiWang1999 · 2024-09-07T11:02:17Z

LeiWang1999
Sep 7, 2024
Maintainer

For the dynamic range, the values [2, 4, 8] may be unnecessary, as they can share the same tile configuration as M=1 (which utilizes the CUDA Core). When the batch size exceeds 16, we switch to utilizing the Tensor Cores for better performance.

0 replies

LeiWang1999 · 2024-09-07T11:04:30Z

LeiWang1999
Sep 7, 2024
Maintainer

For the tuning time, we can utilize the database which bitblasLinear takes, when bitblas encounters a kernel configuration for the first time, it performs the compilation and stores the result in a database, which is located by default at ~/.cache/bitblas. The next time it encounters the same configuration, it retrieves the precompiled library directly from the database, bypassing the tuning process.

As a result, tuning only occurs the first time a specific model and its initial layer are encountered.

But if you do not use database, kernel tuning with a huge dynamic range (for example, [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]), may take a while each time.

0 replies

LeiWang1999 · 2024-09-07T11:05:28Z

LeiWang1999
Sep 7, 2024
Maintainer

This is quite troublesome. We’re also considering bypassing tuning by saving compilation results for different hardware setups, but this is challenging and may take some time to design and implement though ;

0 replies

mobicham · 2024-09-07T16:00:53Z

mobicham
Sep 7, 2024
Author

Yeah I understand, M<16 would be a batched GEMV implementation, and M>=16 would be a GEMM using tensor cores with a 16x16 layout.
What I noticed though is that, the performance is basically the same for any M>=16 so I wanted to check if it's something expected.

The main reason I am asking this is because I am benchmarking several kernels and want to report properly the numbers for BitBlas. What I am doing right now is that I have GEMV numbers (M=1) and GEMM numbers (M>=16), then I take the max of both:

I can't get anything better than this with various M>=16 for larger batch-sizes.

2 replies

LeiWang1999 Sep 7, 2024
Maintainer

hi @mobicham , the optimal tile configuration will be ‘convergence’ as M increases, particularly at 256. The numbers look reasonable to me.

mobicham Sep 8, 2024
Author

Perfect, thanks a lot for your feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper way to setup for multiple batch sizes #180

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Proper way to setup for multiple batch sizes #180

mobicham Sep 6, 2024

Replies: 7 comments · 6 replies

mobicham Sep 6, 2024 Author

LeiWang1999 Sep 7, 2024 Maintainer

mobicham Sep 7, 2024 Author

LeiWang1999 Sep 7, 2024 Maintainer

mobicham Sep 7, 2024 Author

LeiWang1999 Sep 7, 2024 Maintainer

LeiWang1999 Sep 7, 2024 Maintainer

LeiWang1999 Sep 7, 2024 Maintainer

LeiWang1999 Sep 7, 2024 Maintainer

LeiWang1999 Sep 7, 2024 Maintainer

mobicham Sep 7, 2024 Author

LeiWang1999 Sep 7, 2024 Maintainer

mobicham Sep 8, 2024 Author

mobicham
Sep 6, 2024

Replies: 7 comments 6 replies

mobicham
Sep 6, 2024
Author

LeiWang1999
Sep 7, 2024
Maintainer

mobicham Sep 7, 2024
Author

LeiWang1999 Sep 7, 2024
Maintainer

mobicham Sep 7, 2024
Author

LeiWang1999 Sep 7, 2024
Maintainer

LeiWang1999
Sep 7, 2024
Maintainer

LeiWang1999
Sep 7, 2024
Maintainer

LeiWang1999
Sep 7, 2024
Maintainer

LeiWang1999
Sep 7, 2024
Maintainer

mobicham
Sep 7, 2024
Author

LeiWang1999 Sep 7, 2024
Maintainer

mobicham Sep 8, 2024
Author