-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Change kvcache default type of PagedAttention to u8 for CPU plugin #1206
base: master
Are you sure you want to change the base?
[CPU] Change kvcache default type of PagedAttention to u8 for CPU plugin #1206
Conversation
@luo-cheng2021 could you please also include reverting of #1212 ? |
@luo-cheng2021 |
The PR openvinotoolkit/openvino#27847 may use different splitting strategy which may slightly affect the float error. I think it just happened to delay the occurrence of the error. |
Please, fix such places in tests as well openvino.genai/tests/cpp/cache_manager.cpp Lines 18 to 19 in 653b2ae
and I think we can introduce a single function with dummy model to have the same code in a single place. |
It's still strange that:
Don't we have some bugs? |
According to the 157863, the error caused by the accumulated float errors and the diverge would appear sooner or later. In the ticket there were no bugs found. |
Merging changes from #1485 to check whether tests will pass BTW, I expect that speculative decoding will fail anyway, because it compares SDPA vs PA and they are not matching. |
…elines (#1485) OpenVINO plugins enable different kind of optimizations by default like KV cache compression to int8, fp16 inference precision, while in GenAI tests we want to test pipelines and how they are compared against HF / optimum w/o extra optimizations: https://github.com/openvinotoolkit/openvino.genai/blob/4db67aecac78885c6d1e302f348c9489e2154388/tests/python_tests/common.py#L318-L325 Hopefully, we can merge int8 KV cache by default for CB then #1206, because in tests we will still compare FP16 KV cache, while official Validation should be responsible for validation against reference via WWB metrics.
[CPU]PageAttn with 4bit-quantization will add group quantization for u8/u4 kvcache, after it's merged, current computation of the kvcache size will be changed, and default u8 kvcache path for continuous batching must be changed(default group size will be 32). |
The function has been merged into #1366. |
Change kvcache default type of PagedAttention to u8 for CPU plugin to aligned SDPA behaviour.