-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] qwen model's query logn-scaling attn #836
Comments
like this, I finished it. commit link c++ Implement the code: link |
@Tlntin Hi, thank you for reply I am using trtllm release v0.7.0 and qwen 7b 1.0, I added another inline __device__ float update_rotary_base_dynamic_ntk(
const int kv_seq_len, const int max_positions, const int embed_dim, const float base, const float scale)
{
const float ntk_alpha = exp2f(ceilf(log2f(1.f * kv_seq_len / max_positions) + 1.f)) - 1.f;
return base * powf(ntk_alpha, embed_dim / (embed_dim- 2.f));
}
inline __device__ void update_rotary_base_n_scale(float& base, float& scale, RotaryScalingType const scale_type,
const int rot_embed_dim, const int max_positions, const int seq_len)
{
// only update the base and/or scale if needed based on scale_type
if (scale_type == RotaryScalingType::kDYNAMIC)
{
if (seq_len > max_positions)
{
base = update_rotary_base(seq_len, max_positions, rot_embed_dim, base, scale);
}
scale = 1.0f; // scale is only used in base for dynamic scaling
}
else if(scale_type == RotaryScalingType::kDYNAMIC_NTK_QWEN){
if (seq_len > max_positions)
{
base = update_rotary_base_dynamic_ntk(seq_len, max_positions, rot_embed_dim, base, scale);
}
scale = 1.0f; // scale is only used in base for dynamic scaling
}
else if (scale_type == RotaryScalingType::kLINEAR)
{
scale = 1.0f / scale;
}
} After saw the code in main branch, I am not sure my modification is right or not, there are some other places invoking Nevertheless, what you post looks like only about rope base update. logn attn is still missing, I have tested and find that removing logn-scaling will hurt the performance of qwen agent I find some commented code about logn scaling in your repository, but it looks like not compatible with packed tensor mode. When enable paged attention, the qkv tensor's shape is [1, num_tokens, qkv_dim], is that right? |
it seems your C++ code may work better, more similar to raw pytorch code! there has two function I think logn scale implementation may has some difficulty. my code logn_scaling can only use without gpt attention plugin, but seems it work not well, so I commented it! |
The trtllm implementation of qwen does not support logn-scaling right now, which result in different infer result. @handoku I have same question,do you hace any update? thanks |
sry, no pregress yet. Maybe make the trtllm team more aware of this painful problem and help to solve it could save us. |
@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length). |
i tested, it work well, you need to same change like above. |
Did you mean use changes in your commit |
yes |
It is supported on Today's update |
Qwen use qwen-style
dynamic ntk
and logn-scaling to generate better text in case of long context text input.The trtllm implementation of qwen does not support logn-scaling right now, which result in low quality outputs.
I would like to provide a implementation. However, its a little diffcult for me to understand the
gpt_attention
.My vanilla thought is multiplying
q
tensor withlogn
tensor before call gpt_attention. But everyseq_len_idx
value ofq
tensor is needed for caculating logseq_len_trained
(seq_len_idx
). I don't know how to getseq_len_idx
value, especially in packed tensor mode.Would you please give some help on this?Is there a convenient way to achieve this(even in a dirty hard-code way)?
The text was updated successfully, but these errors were encountered: