Replies: 7 comments 6 replies
-
In fact, anything different from [1] gives the same result.
|
Beta Was this translation helpful? Give feedback.
-
i think we currently do not support to set |
Beta Was this translation helpful? Give feedback.
-
we can checkout extern "C" void call(int8_t* __restrict__ A, int8_t* __restrict__ B, float* __restrict__ D, int m, cudaStream_t stream=cudaStreamDefault) {
if (m == 0) return;
if (m <= 1) {
matmul_n1024k4096_i8xi2_simt_opt_m_1<<<dim3(256, 1, m), dim3(32, 4, 1), 0, stream>>>(A, B, D, m);
}
else if (m <= 16) {
matmul_n1024k4096_i8xi2_simt_opt_m_16<<<dim3(128, 1, m), dim3(16, 8, 1), 0, stream>>>(A, B, D, m);
}
else if (m <= 32) {
matmul_n1024k4096_i8xi2_tcx32x16x512w16x16xp2_opt_m_32<<<dim3(32, (m + 15) / 16, 1), dim3(32, 1, 2), 40960, stream>>>(A, B, D, m);
}
else if (m <= 64) {
matmul_n1024k4096_i8xi2_tcx16x16x512w16x16xp2_opt_m_64<<<dim3(64, (m + 15) / 16, 1), dim3(32, 1, 1), 28672, stream>>>(A, B, D, m);
}
else if (m <= 128) {
matmul_n1024k4096_i8xi2_tcx16x32x512w16x16xp2_opt_m_128<<<dim3(64, (m + 15) / 16, 1), dim3(32, 1, 1), 28672, stream>>>(A, B, D, m);
}
else if (m <= 256) {
matmul_n1024k4096_i8xi2_tcx64x64x64w32x32xp2_opt_m_256<<<dim3(16, (m + 63) / 64, 1), dim3(32, 2, 2), 23552, stream>>>(A, B, D, m);
}
else if (m <= 512) {
matmul_n1024k4096_i8xi2_tcx64x128x64w32x64xp2_opt_m_512<<<dim3(8, (m + 63) / 64, 1), dim3(32, 2, 2), 35840, stream>>>(A, B, D, m);
}
else if (m <= 1024) {
matmul_n1024k4096_i8xi2_tcx64x128x64w32x64xp2_opt_m_1024<<<dim3(8, (m + 63) / 64, 1), dim3(32, 2, 2), 35840, stream>>>(A, B, D, m);
}
else {
matmul_n1024k4096_i8xi2_tcx64x128x64w32x64xp2_opt_m_1024<<<dim3(8, (m + 63) / 64, 1), dim3(32, 2, 2), 35840, stream>>>(A, B, D, m);
}
}
|
Beta Was this translation helpful? Give feedback.
-
For the dynamic range, the values [2, 4, 8] may be unnecessary, as they can share the same tile configuration as M=1 (which utilizes the CUDA Core). When the batch size exceeds 16, we switch to utilizing the Tensor Cores for better performance. |
Beta Was this translation helpful? Give feedback.
-
For the tuning time, we can utilize the database which bitblasLinear takes, when bitblas encounters a kernel configuration for the first time, it performs the compilation and stores the result in a database, which is located by default at ~/.cache/bitblas. The next time it encounters the same configuration, it retrieves the precompiled library directly from the database, bypassing the tuning process. As a result, tuning only occurs the first time a specific model and its initial layer are encountered. But if you do not use database, kernel tuning with a huge dynamic range (for example, [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]), may take a while each time. |
Beta Was this translation helpful? Give feedback.
-
This is quite troublesome. We’re also considering bypassing tuning by saving compilation results for different hardware setups, but this is challenging and may take some time to design and implement though ; |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
What is the right way to setup up BitBlas to work with a set of batch sizes:
1- A single config with
M=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
2- A separate config for each N value?
When I set
bitblas.MatmulConfig(M=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024], ...
it just gets stuck for a long time.Thank you
Beta Was this translation helpful? Give feedback.
All reactions