[CPU]PageAttn with 4bit-quantization #27992

zhangYiIntel · 2024-12-10T08:23:41Z

Details:

Add new hint to set group_size for key/value cache
Add grouped 4bit sym/asym quantization support for PageAttentionNode
Add grouped quantization for U8 quantization for PageAttentionNode

Tickets:

CVS-151586

Signed-off-by: [email protected] <[email protected]>

Signed-off-by: Zhang Yi3 <[email protected]>

luo-cheng2021

LGTM, great job!

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp

zhangYiIntel · 2024-12-20T03:20:17Z

@dmitry-gorokhov Could you have a review ?

src/plugins/intel_cpu/src/config.h

dmitry-gorokhov · 2025-01-02T08:09:35Z

src/bindings/python/src/pyopenvino/core/properties/properties.cpp

+    wrap_property_RW(m_hint, ov::hint::key_cache_precision, "key_cache_precision");
+    wrap_property_RW(m_hint, ov::hint::value_cache_precision, "value_cache_precision");
+    wrap_property_RW(m_hint, ov::hint::key_cache_group_size, "key_cache_group_size");
+    wrap_property_RW(m_hint, ov::hint::value_cache_group_size, "value_cache_group_size");


We need to align positioning regarding these options.
We already have high-level hint for KV-cache: ov::hint::kv_cache_precision. These new options are rather fine tuning options. So I would propose the following:

New options shouln't be treated as hints: lets move from the namespace.

ov::hint::kv_cache_precision should remain major (including positioning to the user) option for KV-Cache quantization control.

ov::hint::kv_cache_precision (like other hints) should impact values of lower level options: ov::hint::key_cache_precision/ov::hint::value_cache_precision/ov::hint::key_cache_group_size/ov::hint::value_cache_group_size. E.g. ov::hint::kv_cache_precision == u4 will result in (u8/u4/32/32) config for lower options.

User will have an ability to rewrite the behavior of high-level hint by changing values for low-level properties.

cc'ed @AlexKoff88 @vladimir-paramuzov @sshlyapn @p-durandin

I think it looks good. Just to clarify:

We could have ov::hint::kv_cache_precision for coarse control of KV-cache quantization parameters by default. I would deprecate it at some point (not sure what is the best time).

ov::hint::key_cache_precision, ov::hint::value_cache_precision, ov::hint::key_cache_group_size, ov::hint::value_cache_group_size are for fine-grained control of KV-cache quantization and they have higher priority over ov::hint::kv_cache_precision if defined. ov::hint::key_cache_group_size, ov::hint::value_cache_group_size should have reasonable defaults, e.g. 32 or 64 what fits the best for runtime.

We should be able to define any of these options via the compilation config and rt_info/runtime_options subsection of the IR.

@yury-gorbachev, shall we discuss and approve this item?

@dmitry-gorokhov If not use hint namespace, do we have a better namespace for this ?

@zhangYiIntel just ov::key_cache_precision.
You can use ov::num_streams as an example - this is low level property which is affected by high-level hints like ov::hint::performance_mode

@AlexKoff88 @dmitry-gorokhov Regarding to the default group_size, since the hidden_state must be divided by group_size, if the we set it to 32/64, then what should we do if hidden_state is not divisible by 32/64, should we fallback group_size to hidden_state or just throw a exception ?

Given this is not the hint it should throw an exception, in case user sets invalid value.
If no user input is provided for these properties, then default value should be properly adjusted.

Another leftover here is that if we set default group_size to 32/64 other than hidden_state, then OpenVINO.GenAI has to update accordingly, otherwise the U8 KV cache quantization is broken.
CC: @ilya-lavrenov

Why? I bet GenAI doesn't set any specific value for group_size, which means no user input for these properties. So as I mentioned group_size default value should be properly asjusted on CPU plugin side.

For CB implementation we need to duplicate all this logic related to KV cache as it's maintained outside of plugin.

Why? I bet GenAI doesn't set any specific value for group_size, which means no user input for these properties. So as I mentioned group_size default value should be properly asjusted on CPU plugin side.

The problem here with GenAI is that the ContinousBatchingPipeline allocates the memory for `PageAttention``s key/value cache, it must know the group_size in advacnde to allocate correct memory size for both cache + scale/zp at https://github.com/openvinotoolkit/openvino.genai/blob/09a542608b560959edb96e628915a1d6bd780c26/src/cpp/src/cache_manager.hpp#L57

ov::Tensor key_cache = remote_context.create_tensor(device_config.get_key_cache_precision(), device_config.get_key_cache_shape()); ov::Tensor value_cache = remote_context.create_tensor(device_config.get_value_cache_precision(), device_config.get_value_cache_shape());

The cache shape is defined at https://github.com/openvinotoolkit/openvino.genai/blob/09a542608b560959edb96e628915a1d6bd780c26/src/cpp/src/device_config.hpp#L120

m_key_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(), ov::Dimension(m_num_kv_heads[layer_id]), ov::Dimension(m_block_size), ov::Dimension(m_head_size)}); m_value_cache_shape.push_back(ov::PartialShape{ov::Dimension::dynamic(), ov::Dimension(m_num_kv_heads[layer_id]), ov::Dimension(m_block_size), ov::Dimension(m_head_size)});

The m_head_size defined is defined as following, which only considers 1 group per hidden_states

if (m_kv_cache_type == ov::element::u8) m_head_size += 8;

Therefore ContinousBatchingPipeline is broken with group_num greater than 1

src/plugins/intel_cpu/src/config.cpp

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp

src/plugins/intel_cpu/src/nodes/paged_attn.cpp

src/plugins/intel_cpu/src/config.cpp

src/plugins/intel_cpu/tests/functional/custom/behavior/ov_executable_network/properties.cpp

Signed-off-by: Zhang Yi <[email protected]>

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp

src/plugins/intel_cpu/src/config.cpp

Signed-off-by: Zhang Yi <[email protected]>

src/plugins/intel_cpu/src/config.cpp

src/plugins/intel_cpu/src/nodes/scaled_attn.cpp

Signed-off-by: Zhang Yi <[email protected]>

zhangYiIntel added 13 commits December 9, 2024 16:03

[CPU]separate precisions of kv cache

15fcdb8

Signed-off-by: [email protected] <[email protected]>

[CPU]use element as template args

82f843a

[CPU]make quantize grouped

a754404

[CPU]make u8 kernel grouped

2aba224

[CPU]U4 Group size support with reference

fc435f6

Signed-off-by: [email protected] <[email protected]>

[CPU]AVX512 support for u4 kernel

d080e2a

[CPU]Support S4 quantization

78ef4dd

Signed-off-by: [email protected] <[email protected]>

[CPU]use AVX512 to quant s4

3e821ea

[CPU]4-bit quantization with avx2

80b093f

Signed-off-by: [email protected] <[email protected]>

fix build on elder compiler

13a496e

[CPU]fix fp32 inference

92e6cb3

[CPU]set group size via hint

91ebc09

Signed-off-by: Zhang Yi3 <[email protected]>

[CPU]fix code style

685f263

Signed-off-by: Zhang Yi3 <[email protected]>

github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPU OpenVINO CPU plugin category: Python API OpenVINO Python bindings category: CPP API OpenVINO CPP API bindings labels Dec 10, 2024

zhangYiIntel changed the title ~~Yi3/4bit cache~~ [CPU]PageAttn with 4bit-quantization Dec 10, 2024

[CPU]fix property test

e56639a

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from a12c86f to e56639a Compare December 11, 2024 02:45

[CPU]add cache precision check

a34ce8b

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel mentioned this pull request Dec 12, 2024

[CB]Support 4-bit cache openvinotoolkit/openvino.genai#1366

Draft

zhangYiIntel marked this pull request as ready for review December 12, 2024 01:57

zhangYiIntel requested review from a team as code owners December 12, 2024 01:57

zhangYiIntel added 2 commits December 12, 2024 09:57

Merge branch 'master' into yi3/4bit-cache

8548773

[CPU]fix code style of config.cpp

fe6c311

Signed-off-by: Zhang Yi3 <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from 373d50d to fe6c311 Compare December 12, 2024 03:17

luo-cheng2021 approved these changes Dec 19, 2024

View reviewed changes

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp Outdated Show resolved Hide resolved

zhangYiIntel added 3 commits December 19, 2024 09:31

apply review comments

f03e23c

Merge branch 'master' into yi3/4bit-cache

99d5c4d

Merge branch 'master' into yi3/4bit-cache

dddb4d9

dmitry-gorokhov reviewed Jan 2, 2025

View reviewed changes

[CPU]apply review comments

c362399

Signed-off-by: Zhang Yi <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from 244f7cc to c362399 Compare January 3, 2025 06:19

dmitry-gorokhov reviewed Jan 3, 2025

View reviewed changes

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/attn_quant.cpp Outdated Show resolved Hide resolved

dmitry-gorokhov reviewed Jan 3, 2025

View reviewed changes

src/plugins/intel_cpu/src/config.cpp Outdated Show resolved Hide resolved

zhangYiIntel added 3 commits January 3, 2025 16:09

[CPU]remove useless code of s4

28bcf7b

Signed-off-by: Zhang Yi <[email protected]>

Merge branch 'master' into yi3/4bit-cache

94522a2

[CPU]Unify u8/u4 dequant kernel with template arg

56245d0

Signed-off-by: Zhang Yi <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from 7e6ffa2 to 56245d0 Compare January 5, 2025 05:45

[CPU]Define key/value cache prec/group_size priority

84f03a3

Signed-off-by: Zhang Yi <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from fd5df9f to 84f03a3 Compare January 6, 2025 03:28

zhangYiIntel requested a review from dmitry-gorokhov January 6, 2025 03:56

dmitry-gorokhov reviewed Jan 6, 2025

View reviewed changes

src/plugins/intel_cpu/src/config.cpp Show resolved Hide resolved

dmitry-gorokhov reviewed Jan 6, 2025

View reviewed changes

src/plugins/intel_cpu/src/nodes/scaled_attn.cpp Show resolved Hide resolved

zhangYiIntel added 2 commits January 6, 2025 15:24

[CPU]fix prec order & check group_size

e0b437e

Merge branch 'master' into yi3/4bit-cache

79df402

dmitry-gorokhov mentioned this pull request Jan 6, 2025

Aarch64 paged attention enablement #27841

Open

zhangYiIntel added 2 commits January 6, 2025 19:46

Merge branch 'master' into yi3/4bit-cache

f196535

[CPU]fix sdpa test

0515410

zhangYiIntel force-pushed the yi3/4bit-cache branch from 5f81396 to 1549936 Compare January 7, 2025 05:53

[CPU]fix group_size in sdpa

7a412f7

Signed-off-by: Zhang Yi <[email protected]>

zhangYiIntel force-pushed the yi3/4bit-cache branch from 1549936 to 7a412f7 Compare January 7, 2025 06:21

luo-cheng2021 mentioned this pull request Jan 7, 2025

[CPU] Change kvcache default type of PagedAttention to u8 for CPU plugin openvinotoolkit/openvino.genai#1206

Closed

[CPU]Change default group_size

594b392

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU]PageAttn with 4bit-quantization #27992

[CPU]PageAttn with 4bit-quantization #27992

zhangYiIntel commented Dec 10, 2024

luo-cheng2021 left a comment

zhangYiIntel commented Dec 20, 2024

dmitry-gorokhov Jan 2, 2025 •

edited

Loading

AlexKoff88 Jan 2, 2025

zhangYiIntel Jan 3, 2025

dmitry-gorokhov Jan 3, 2025

zhangYiIntel Jan 7, 2025 •

edited

Loading

dmitry-gorokhov Jan 7, 2025

zhangYiIntel Jan 7, 2025

dmitry-gorokhov Jan 7, 2025

ilya-lavrenov Jan 7, 2025

zhangYiIntel Jan 7, 2025

[CPU]PageAttn with 4bit-quantization #27992

Are you sure you want to change the base?

[CPU]PageAttn with 4bit-quantization #27992

Conversation

zhangYiIntel commented Dec 10, 2024

Details:

Tickets:

luo-cheng2021 left a comment

Choose a reason for hiding this comment

zhangYiIntel commented Dec 20, 2024

dmitry-gorokhov Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangYiIntel Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitry-gorokhov Jan 2, 2025 •

edited

Loading

zhangYiIntel Jan 7, 2025 •

edited

Loading