Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU]PageAttn with 4bit-quantization #27992

Open
wants to merge 36 commits into
base: master
Choose a base branch
from

Conversation

zhangYiIntel
Copy link
Contributor

Details:

  • Add new hint to set group_size for key/value cache
  • Add grouped 4bit sym/asym quantization support for PageAttentionNode
  • Add grouped quantization for U8 quantization for PageAttentionNode

Tickets:

@github-actions github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPU OpenVINO CPU plugin category: Python API OpenVINO Python bindings category: CPP API OpenVINO CPP API bindings labels Dec 10, 2024
@zhangYiIntel zhangYiIntel changed the title Yi3/4bit cache [CPU]PageAttn with 4bit-quantization Dec 10, 2024
Signed-off-by: Zhang Yi3 <[email protected]>
@zhangYiIntel zhangYiIntel marked this pull request as ready for review December 12, 2024 01:57
@zhangYiIntel zhangYiIntel requested review from a team as code owners December 12, 2024 01:57
Copy link
Contributor

@luo-cheng2021 luo-cheng2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job!

@zhangYiIntel
Copy link
Contributor Author

@dmitry-gorokhov Could you have a review ?

src/plugins/intel_cpu/src/config.h Show resolved Hide resolved
wrap_property_RW(m_hint, ov::hint::key_cache_precision, "key_cache_precision");
wrap_property_RW(m_hint, ov::hint::value_cache_precision, "value_cache_precision");
wrap_property_RW(m_hint, ov::hint::key_cache_group_size, "key_cache_group_size");
wrap_property_RW(m_hint, ov::hint::value_cache_group_size, "value_cache_group_size");
Copy link
Contributor

@dmitry-gorokhov dmitry-gorokhov Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to align positioning regarding these options.
We already have high-level hint for KV-cache: ov::hint::kv_cache_precision. These new options are rather fine tuning options. So I would propose the following:

  1. New options shouln't be treated as hints: lets move from the namespace.
  2. ov::hint::kv_cache_precision should remain major (including positioning to the user) option for KV-Cache quantization control.
  3. ov::hint::kv_cache_precision (like other hints) should impact values of lower level options: ov::hint::key_cache_precision/ov::hint::value_cache_precision/ov::hint::key_cache_group_size/ov::hint::value_cache_group_size. E.g. ov::hint::kv_cache_precision == u4 will result in (u8/u4/32/32) config for lower options.
  4. User will have an ability to rewrite the behavior of high-level hint by changing values for low-level properties.

cc'ed @AlexKoff88 @vladimir-paramuzov @sshlyapn @p-durandin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good. Just to clarify:

  • We could have ov::hint::kv_cache_precision for coarse control of KV-cache quantization parameters by default. I would deprecate it at some point (not sure what is the best time).
  • ov::hint::key_cache_precision, ov::hint::value_cache_precision, ov::hint::key_cache_group_size, ov::hint::value_cache_group_size are for fine-grained control of KV-cache quantization and they have higher priority over ov::hint::kv_cache_precision if defined. ov::hint::key_cache_group_size, ov::hint::value_cache_group_size should have reasonable defaults, e.g. 32 or 64 what fits the best for runtime.
  • We should be able to define any of these options via the compilation config and rt_info/runtime_options subsection of the IR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitry-gorokhov If not use hint namespace, do we have a better namespace for this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangYiIntel just ov::key_cache_precision.
You can use ov::num_streams as an example - this is low level property which is affected by high-level hints like ov::hint::performance_mode

Copy link
Contributor Author

@zhangYiIntel zhangYiIntel Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexKoff88 @dmitry-gorokhov Regarding to the default group_size, since the hidden_state must be divided by group_size, if the we set it to 32/64, then what should we do if hidden_state is not divisible by 32/64, should we fallback group_size to hidden_state or just throw a exception ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is not the hint it should throw an exception, in case user sets invalid value.
If no user input is provided for these properties, then default value should be properly adjusted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another leftover here is that if we set default group_size to 32/64 other than hidden_state, then OpenVINO.GenAI has to update accordingly, otherwise the U8 KV cache quantization is broken.
CC: @ilya-lavrenov

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I bet GenAI doesn't set any specific value for group_size, which means no user input for these properties. So as I mentioned group_size default value should be properly asjusted on CPU plugin side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CB implementation we need to duplicate all this logic related to KV cache as it's maintained outside of plugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I bet GenAI doesn't set any specific value for group_size, which means no user input for these properties. So as I mentioned group_size default value should be properly asjusted on CPU plugin side.

The problem here with GenAI is that the ContinousBatchingPipeline allocates the memory for `PageAttention``s key/value cache, it must know the group_size in advacnde to allocate correct memory size for both cache + scale/zp at https://github.com/openvinotoolkit/openvino.genai/blob/09a542608b560959edb96e628915a1d6bd780c26/src/cpp/src/cache_manager.hpp#L57

ov::Tensor key_cache = remote_context.create_tensor(device_config.get_key_cache_precision(),
    device_config.get_key_cache_shape());
ov::Tensor value_cache = remote_context.create_tensor(device_config.get_value_cache_precision(),
    device_config.get_value_cache_shape());

The cache shape is defined at https://github.com/openvinotoolkit/openvino.genai/blob/09a542608b560959edb96e628915a1d6bd780c26/src/cpp/src/device_config.hpp#L120

m_key_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(),
    ov::Dimension(m_num_kv_heads[layer_id]),
    ov::Dimension(m_block_size),
    ov::Dimension(m_head_size)});

m_value_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(),
    ov::Dimension(m_num_kv_heads[layer_id]),
    ov::Dimension(m_block_size),
    ov::Dimension(m_head_size)});

The m_head_size defined is defined as following, which only considers 1 group per hidden_states

if (m_kv_cache_type == ov::element::u8)
    m_head_size += 8;

Therefore ContinousBatchingPipeline is broken with group_num greater than 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPP API OpenVINO CPP API bindings category: CPU OpenVINO CPU plugin category: inference OpenVINO Runtime library - Inference category: Python API OpenVINO Python bindings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants