[CPU] Change kvcache default type of PagedAttention to u8 for CPU plugin #1206

luo-cheng2021 · 2024-11-13T12:03:37Z

Change kvcache default type of PagedAttention to u8 for CPU plugin to aligned SDPA behaviour.

src/cpp/src/device_config.hpp

ilya-lavrenov · 2024-11-14T08:05:26Z

@luo-cheng2021 could you please also include reverting of #1212 ?
I have hardcoded OpenVINO commit before u8 KV cache migration on CPU to unlock GenAI development.

…fault_u8

This reverts commit 9243a8f.

ilya-lavrenov · 2024-12-30T11:42:41Z

@luo-cheng2021
Could you please rebase this PR on top of current master?
BTW, comparison of CB and Stateful is now passing on GenAI master. Has something changed on CPU / stateful / PA side that improves accuracy?

…fault_u8

luo-cheng2021 · 2024-12-31T06:56:55Z

@luo-cheng2021 Could you please rebase this PR on top of current master? BTW, comparison of CB and Stateful is now passing on GenAI master. Has something changed on CPU / stateful / PA side that improves accuracy?

The PR openvinotoolkit/openvino#27847 may use different splitting strategy which may slightly affect the float error. I think it just happened to delay the occurrence of the error.

ilya-lavrenov · 2024-12-31T07:09:58Z

Please, fix such places in tests as well

openvino.genai/tests/cpp/cache_manager.cpp

Lines 18 to 19 in 653b2ae

    
           ov::element::Type inference_precision = core.get_property("CPU", ov::hint::inference_precision); 
        
           ov::element::Type kv_cache_type = inference_precision == ov::element::bf16 ? ov::element::bf16 : ov::element::f16;

and
https://github.com/openvinotoolkit/openvino.genai/blob/master/tests/cpp/scheduler.cpp#L21-L43

I think we can introduce a single function with dummy model to have the same code in a single place.

ilya-lavrenov · 2024-12-31T07:50:35Z

@luo-cheng2021 Could you please rebase this PR on top of current master? BTW, comparison of CB and Stateful is now passing on GenAI master. Has something changed on CPU / stateful / PA side that improves accuracy?

The PR openvinotoolkit/openvino#27847 may use different splitting strategy which may slightly affect the float error. I think it just happened to delay the occurrence of the error.

It's still strange that:

PA with fp16 kv_cache compares will with SPDA with int8 kv_cache - current master
both PA and SPDA with int8 kv_cache shows differences on multiple models - current PR.

Don't we have some bugs?

luo-cheng2021 · 2024-12-31T08:27:57Z

@luo-cheng2021 Could you please rebase this PR on top of current master? BTW, comparison of CB and Stateful is now passing on GenAI master. Has something changed on CPU / stateful / PA side that improves accuracy?

The PR openvinotoolkit/openvino#27847 may use different splitting strategy which may slightly affect the float error. I think it just happened to delay the occurrence of the error.

It's still strange that:

PA with fp16 kv_cache compares will with SPDA with int8 kv_cache - current master

both PA and SPDA with int8 kv_cache shows differences on multiple models - current PR.

Don't we have some bugs?

According to the 157863, the error caused by the accumulated float errors and the diverge would appear sooner or later. In the ticket there were no bugs found.

ilya-lavrenov · 2025-01-06T13:39:27Z

Merging changes from #1485 to check whether tests will pass

BTW, I expect that speculative decoding will fail anyway, because it compares SDPA vs PA and they are not matching.

…elines (#1485) OpenVINO plugins enable different kind of optimizations by default like KV cache compression to int8, fp16 inference precision, while in GenAI tests we want to test pipelines and how they are compared against HF / optimum w/o extra optimizations: https://github.com/openvinotoolkit/openvino.genai/blob/4db67aecac78885c6d1e302f348c9489e2154388/tests/python_tests/common.py#L318-L325 Hopefully, we can merge int8 KV cache by default for CB then #1206, because in tests we will still compare FP16 KV cache, while official Validation should be responsible for validation against reference via WWB metrics.

luo-cheng2021 · 2025-01-07T06:59:14Z

[CPU]PageAttn with 4bit-quantization will add group quantization for u8/u4 kvcache, after it's merged, current computation of the kvcache size will be changed, and default u8 kvcache path for continuous batching must be changed(default group size will be 32).
So, this PR should be merged with #1366 together to meet the 4bit PR changes.

luo-cheng2021 · 2025-01-07T09:51:06Z

The function has been merged into #1366.

change kvcache default type to u8 for cpu plugin

ac18dd4

github-actions bot added category: continuous batching Continuous batching category: sampling Sampling / Decoding algorithms labels Nov 13, 2024

ilya-lavrenov reviewed Nov 13, 2024

View reviewed changes

src/cpp/src/device_config.hpp Outdated Show resolved Hide resolved

use f32 for hint: EXECUTION_MODE_HINT:ACCURACY

ffef13e

ilya-lavrenov added this to the 2025.0 milestone Nov 14, 2024

luo-cheng2021 added 2 commits November 14, 2024 16:10

Merge remote-tracking branch 'upstream/master' into luocheng/pa_kv_de…

b9a05f0

…fault_u8

Revert "[GHA]: hardcode OpenVINO commit (openvinotoolkit#1212)"

9efab9d

This reverts commit 9243a8f.

github-actions bot added the category: GHA CI based on Github actions label Nov 14, 2024

ilya-lavrenov removed the category: sampling Sampling / Decoding algorithms label Nov 20, 2024

ilya-lavrenov self-assigned this Nov 26, 2024

Merge remote-tracking branch 'upstream/master' into luocheng/pa_kv_de…

eb44c6e

…fault_u8

github-actions bot removed the category: GHA CI based on Github actions label Dec 31, 2024

fix ci errors

400e41c

github-actions bot added the no-match-files label Dec 31, 2024

Merge branch 'master' into luocheng/pa_kv_default_u8

9de7c89

ilya-lavrenov mentioned this pull request Jan 6, 2025

[TESTS] Use FP32 inference precision, FP16 KV cache precision for pipelines #1485

Merged

Merge branch 'master' into luocheng/pa_kv_default_u8

07f0382

luo-cheng2021 closed this Jan 7, 2025

luo-cheng2021 reopened this Jan 8, 2025

luo-cheng2021 marked this pull request as ready for review January 8, 2025 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Change kvcache default type of PagedAttention to u8 for CPU plugin #1206

[CPU] Change kvcache default type of PagedAttention to u8 for CPU plugin #1206

luo-cheng2021 commented Nov 13, 2024

ilya-lavrenov commented Nov 14, 2024

ilya-lavrenov commented Dec 30, 2024

luo-cheng2021 commented Dec 31, 2024 •

edited

Loading

ilya-lavrenov commented Dec 31, 2024

ilya-lavrenov commented Dec 31, 2024

luo-cheng2021 commented Dec 31, 2024

ilya-lavrenov commented Jan 6, 2025 •

edited

Loading

luo-cheng2021 commented Jan 7, 2025

luo-cheng2021 commented Jan 7, 2025

[CPU] Change kvcache default type of PagedAttention to u8 for CPU plugin #1206

Are you sure you want to change the base?

[CPU] Change kvcache default type of PagedAttention to u8 for CPU plugin #1206

Conversation

luo-cheng2021 commented Nov 13, 2024

ilya-lavrenov commented Nov 14, 2024

ilya-lavrenov commented Dec 30, 2024

luo-cheng2021 commented Dec 31, 2024 • edited Loading

ilya-lavrenov commented Dec 31, 2024

ilya-lavrenov commented Dec 31, 2024

luo-cheng2021 commented Dec 31, 2024

ilya-lavrenov commented Jan 6, 2025 • edited Loading

luo-cheng2021 commented Jan 7, 2025

luo-cheng2021 commented Jan 7, 2025

luo-cheng2021 commented Dec 31, 2024 •

edited

Loading

ilya-lavrenov commented Jan 6, 2025 •

edited

Loading