-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU]PageAttn with 4bit-quantization #27992
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: Zhang Yi3 <[email protected]>
Signed-off-by: Zhang Yi3 <[email protected]>
Signed-off-by: Zhang Yi3 <[email protected]>
a12c86f
to
e56639a
Compare
Signed-off-by: Zhang Yi3 <[email protected]>
Signed-off-by: Zhang Yi3 <[email protected]>
373d50d
to
fe6c311
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great job!
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp
Outdated
Show resolved
Hide resolved
@dmitry-gorokhov Could you have a review ? |
wrap_property_RW(m_hint, ov::hint::key_cache_precision, "key_cache_precision"); | ||
wrap_property_RW(m_hint, ov::hint::value_cache_precision, "value_cache_precision"); | ||
wrap_property_RW(m_hint, ov::hint::key_cache_group_size, "key_cache_group_size"); | ||
wrap_property_RW(m_hint, ov::hint::value_cache_group_size, "value_cache_group_size"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to align positioning regarding these options.
We already have high-level hint for KV-cache: ov::hint::kv_cache_precision. These new options are rather fine tuning options. So I would propose the following:
- New options shouln't be treated as hints: lets move from the namespace.
ov::hint::kv_cache_precision
should remain major (including positioning to the user) option for KV-Cache quantization control.ov::hint::kv_cache_precision
(like other hints) should impact values of lower level options:ov::hint::key_cache_precision
/ov::hint::value_cache_precision
/ov::hint::key_cache_group_size
/ov::hint::value_cache_group_size
. E.g.ov::hint::kv_cache_precision == u4
will result in(u8/u4/32/32)
config for lower options.- User will have an ability to rewrite the behavior of high-level hint by changing values for low-level properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good. Just to clarify:
- We could have
ov::hint::kv_cache_precision
for coarse control of KV-cache quantization parameters by default. I would deprecate it at some point (not sure what is the best time). ov::hint::key_cache_precision
,ov::hint::value_cache_precision
,ov::hint::key_cache_group_size
,ov::hint::value_cache_group_size
are for fine-grained control of KV-cache quantization and they have higher priority overov::hint::kv_cache_precision
if defined.ov::hint::key_cache_group_size
,ov::hint::value_cache_group_size
should have reasonable defaults, e.g. 32 or 64 what fits the best for runtime.- We should be able to define any of these options via the compilation config and
rt_info/runtime_options
subsection of the IR.
- @yury-gorbachev, shall we discuss and approve this item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmitry-gorokhov If not use hint namespace, do we have a better namespace for this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangYiIntel just ov::key_cache_precision
.
You can use ov::num_streams
as an example - this is low level property which is affected by high-level hints like ov::hint::performance_mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlexKoff88 @dmitry-gorokhov Regarding to the default group_size
, since the hidden_state
must be divided by group_size
, if the we set it to 32/64, then what should we do if hidden_state
is not divisible by 32/64, should we fallback group_size
to hidden_state
or just throw a exception ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is not the hint it should throw an exception, in case user sets invalid value.
If no user input is provided for these properties, then default value should be properly adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another leftover here is that if we set default group_size
to 32/64 other than hidden_state
, then OpenVINO.GenAI has to update accordingly, otherwise the U8 KV cache quantization is broken.
CC: @ilya-lavrenov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I bet GenAI doesn't set any specific value for group_size
, which means no user input for these properties. So as I mentioned group_size
default value should be properly asjusted on CPU plugin side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CB implementation we need to duplicate all this logic related to KV cache as it's maintained outside of plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I bet GenAI doesn't set any specific value for
group_size
, which means no user input for these properties. So as I mentionedgroup_size
default value should be properly asjusted on CPU plugin side.
The problem here with GenAI is that the ContinousBatchingPipeline
allocates the memory for `PageAttention``s key/value cache, it must know the group_size in advacnde to allocate correct memory size for both cache + scale/zp at https://github.com/openvinotoolkit/openvino.genai/blob/09a542608b560959edb96e628915a1d6bd780c26/src/cpp/src/cache_manager.hpp#L57
ov::Tensor key_cache = remote_context.create_tensor(device_config.get_key_cache_precision(),
device_config.get_key_cache_shape());
ov::Tensor value_cache = remote_context.create_tensor(device_config.get_value_cache_precision(),
device_config.get_value_cache_shape());
The cache shape is defined at https://github.com/openvinotoolkit/openvino.genai/blob/09a542608b560959edb96e628915a1d6bd780c26/src/cpp/src/device_config.hpp#L120
m_key_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(),
ov::Dimension(m_num_kv_heads[layer_id]),
ov::Dimension(m_block_size),
ov::Dimension(m_head_size)});
m_value_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(),
ov::Dimension(m_num_kv_heads[layer_id]),
ov::Dimension(m_block_size),
ov::Dimension(m_head_size)});
The m_head_size
defined is defined as following, which only considers 1 group per hidden_states
if (m_kv_cache_type == ov::element::u8)
m_head_size += 8;
Therefore ContinousBatchingPipeline
is broken with group_num greater than 1
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/tests/functional/custom/behavior/ov_executable_network/properties.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Zhang Yi <[email protected]>
244f7cc
to
c362399
Compare
src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Zhang Yi <[email protected]>
Signed-off-by: Zhang Yi <[email protected]>
7e6ffa2
to
56245d0
Compare
Signed-off-by: Zhang Yi <[email protected]>
fd5df9f
to
84f03a3
Compare
5f81396
to
1549936
Compare
Signed-off-by: Zhang Yi <[email protected]>
1549936
to
7a412f7
Compare
Details:
Tickets: