-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v1) #346
base: llama_fp8_12062024
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the integration. Left some comments and suggestions about coding style.
float old; | ||
old = (value >= 0) | ||
? __int_as_float(atomicMax((int*)addr, __float_as_int(value))) | ||
: __uint_as_float( | ||
atomicMin((unsigned int*)addr, __float_as_uint(value))); | ||
|
||
return old; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float old; | |
old = (value >= 0) | |
? __int_as_float(atomicMax((int*)addr, __float_as_int(value))) | |
: __uint_as_float( | |
atomicMin((unsigned int*)addr, __float_as_uint(value))); | |
return old; | |
return (value >= 0) | |
? __int_as_float(atomicMax((int*)addr, __float_as_int(value))) | |
: __uint_as_float( | |
atomicMin((unsigned int*)addr, __float_as_uint(value))); | |
template <bool is_scale_inverted> | ||
__device__ __forceinline__ FP8_TYPE scaled_fp8_conversion(float const val, | ||
float const scale) { | ||
float x = 0.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kindly name the variable with meaningful name, like scaledValue.
if constexpr (is_scale_inverted) { | ||
x = val * scale; | ||
} else { | ||
x = val / scale; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if scale is zero, error handling?
x = val / scale; | ||
} | ||
|
||
float r = fmax(-FP8_E4M3_MAX, fmin(x, FP8_E4M3_MAX)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming r as result, or something like that
const scalar_t* __restrict__ input, | ||
int64_t num_elems) { | ||
__shared__ float cache[1024]; | ||
int64_t i = blockDim.x * blockIdx.x + threadIdx.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t i = blockDim.x * blockIdx.x + threadIdx.x; | |
int64_t index = blockDim.x * blockIdx.x + threadIdx.x; |
@@ -150,6 +160,8 @@ def test_paged_attention( | |||
num_query_heads, num_kv_heads = num_heads | |||
query = torch.empty(num_seqs, num_query_heads, head_size, dtype=dtype) | |||
query.uniform_(-scale, scale) | |||
#query = torch.ones_like(query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#query = torch.ones_like(query) |
#print('>>> ref qkout shape',attn_weights.shape) | ||
#print('>>> ref qkout',attn_weights) | ||
#global REF_TENSOR | ||
#REF_TENSOR = attn_weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#print('>>> ref qkout shape',attn_weights.shape) | |
#print('>>> ref qkout',attn_weights) | |
#global REF_TENSOR | |
#REF_TENSOR = attn_weights |
SEEDS = [0] | ||
CUDA_DEVICES = [ | ||
f"cuda:{i}" for i in range(1 if torch.cuda.device_count() == 1 else 2) | ||
f"cuda:{i}" for i in range(1 if torch.cuda.device_count() == 1 else 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to change the multi-gpu test to only single gpu test. are you sure you want to have this change committed?
NUM_BLOCKS = 1024 * 1024 | ||
PARTITION_SIZE = 512 | ||
NUM_BLOCKS = 256 * 1024 | ||
PARTITION_SIZE = 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what is the reason changing the values of the two constants? and is this change ROCm specific?
@@ -101,7 +101,7 @@ def run_cuda_benchmark(num_iters: int, profile: bool = False) -> float: | |||
start_time = time.perf_counter() | |||
|
|||
# Using default kv_scale | |||
k_scale = v_scale = 1.0 | |||
k_scale = v_scale = 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, can you explain the default kv-scale change?
Description
This is a PR to merge https://github.com/ROCm/vllm/tree/shsanyal_devpa_308_opt optimized
attention.cu
kernel intollama_fp8_12062024
branch.CAVEAT
Currently the
attention.cu
kernel does not supportblock size
of32
andhead size
of64
.The vLLM model unittests are failing as it uses small models e.g. Gemma, Llama which has
head size
of64
.Performance
The following is a
benchmark_throughput
results ofLlama-3.1-70B
withfp8
dynamic quantization andkv-cache-dtype
offp8_e4m3
. For sequence input token length2048
and output token length2048
: