Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about use_tensor_cores = True or False #520

Open
sleepwalker2017 opened this issue Oct 10, 2024 · 2 comments
Open

Question about use_tensor_cores = True or False #520

sleepwalker2017 opened this issue Oct 10, 2024 · 2 comments

Comments

@sleepwalker2017
Copy link

Hi, I'm benchmarking flashinfer on H100, and I'm running attention for the decoding stage.

I use q_head = kv_head = 40, which is the standard attention for llama 13B.

I tried use_tensor_cores = True and False, I get nearly the same performance.

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     99.2         49025722         40  1225643.1  1222881.5   1219938   1231234       4342.4  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.8           374625          1   374625.0   374625.0    374625    374625          0.0  void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosE…
 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     99.2         49189726         40  1229743.2  1229681.5   1228833   1231042        495.0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.8           379936          1   379936.0   379936.0    379936    379936          0.0  void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::Mas…
      0.0             1600          1     1600.0     1600.0      1600      1600          0.0  void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…

My question is:

  1. Is this result reliable? If use_tensor_cores=True, it will invoke the prefill kernel?

  2. I tested the tensor core usage for both kernels, but find they both uses tensor cores, why is that?

  void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosEncodingMode)0, (unsigned int)2, (unsigned int)4, (unsigned int)8, (unsigned int)16, (unsigned int)1, (unsigned int)8, (flashinfer::PageStorage)0, __half, __half, __half, int>(T10 *, T13 *, flashinfer::paged_kv_t<T9, T11, T13>, flashinfer::kv_partition_info_t<T13>, T12 *, float *, bool *, bool, int, float, float, float, float) (74, 40, 1)x(16, 1, 8), Context 1, Stream 7, Device 0, CC 9.0
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- ----------- ------------
    Metric Name                                                            Metric Unit Metric Value
    ---------------------------------------------------------------------- ----------- ------------
    sm__inst_executed_pipe_tensor_op.sum                                                    (!) n/a
    sm__inst_executed_pipe_tensor_op_dmma.sum                                     inst            0
    sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active           %         0.11
    sm__inst_executed_pipe_tensor_op_hmma.sum                                     inst       166688
    ---------------------------------------------------------------------- ----------- ------------
  void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::MaskMode)0, (flashinfer::PosEncodingMode)0, (unsigned int)1, (unsigned int)8, (unsigned int)2, (unsigned int)1, (unsigned int)4, (flashinfer::PageStorage)0, __half, __half, float, __half, int>(T14 *, T14 *, T14 *, T10 *, flashinfer::paged_kv_t<T9, T11, T14>, T14 *, unsigned char *, T14 *, T14 *, T14 *, T13 *, float *, bool *, T14 *, bool, flashinfer::uint_fastdiv, int, float, float, float, float) (74, 1, 40)x(32, 1, 4), Context 1, Stream 7, Device 0, CC 9.0
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- ----------- ------------
    Metric Name                                                            Metric Unit Metric Value
    ---------------------------------------------------------------------- ----------- ------------
    sm__inst_executed_pipe_tensor_op.sum                                                    (!) n/a
    sm__inst_executed_pipe_tensor_op_dmma.sum                                     inst            0
    sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active           %         3.60
    sm__inst_executed_pipe_tensor_op_hmma.sum                                     inst      4899840
    ---------------------------------------------------------------------- ----------- ------------
@yzh119
Copy link
Collaborator

yzh119 commented Oct 10, 2024

Hi @sleepwalker2017 , thanks for doing the benchmark!

  1. Yes, use_tensor_cores=True invokes prefill kernel, and it's reasonable that you get nearly the same performance because decode operations are IO bound.
  2. That's interesting! I didn't manually call tensor cores in flashinfer decode kernels but I'm not sure if nvcc can do some clever optimizations that turns some reduction to tensor cores, so I checked the sass code of decode kernels (you can check sass by cuobjdump -sass *.o), and I can confirm there is NO HMMA instructions in decode kernels' generated sass (and if you check prefill kernels' sass there will be many of them). Perhaps ncu count some other operations in this pipe.

@cyang49
Copy link

cyang49 commented Oct 22, 2024

Hi @yzh119 I'm also looking into the decode kernel implementation. I tested it with llama3-8B size (num_qo_heads=32, num_kv_heads=8) and input batch size=64, seq_len=1024. q_data_type=torch.bfloat16, kv_data_type=torch.float8_e4m3.

I noticed that the BatchDecodePagedKV kernel runs slower than the BatchPrefill... (forcing it by setting use_tensor_cores=True) counterpart. I profiled the decode kernel with nsight compute, and also looked into the code. Here are some of my observations. Could you let me know if these are correct?

  • The compute type of the decode kernel is fixed at FP32. For my setting (Q:BF16, KV: FP8), the QKV are all casted to FP32 before the compute (in compute_qk function)
  • I observed a lot of overhead in the type cast instructions, which seems to be the main bottleneck of the decode kernel.
  • In contrast to the BatchPrefill... kernel with the same input types, there the compute is in BF16 MMA, and the QKV are casted to BF16 before MMA. However, the overhead seems to be smaller, as fast_dequant_f8f16x4 function seems to be a faster dequantization procedure.

My questions:

  • Why is fast_dequant_f8f16x4 not used in the decode kernel path?
  • Would it make sense to keep the compute type at lower precision (fp8) and avoid the dequantization overhead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants