[Snippets][CPU] Disable MHA tokenization in LLM #28601

a-sidorova · 2025-01-22T07:01:01Z

Details:

The second inference in LLM is usually single token inference. It means that M dimension of MatMuls in SDPA pattern will have the value 1 (during compilation model this dimension is dynamic (unknown)). Snippets cannot provide efficient execution for single token inference. So we decided to disable MHA tokenization by Snippets in CPU Plugin on LLMs'. We consider the presence of ScaledDotProductAttentionWithKVCache op in the model as a sign that this model is LLM.

Tickets:

160634
160978

TODO:

Performance validation on LLMs (the results are in the ticket CVS-160978)

src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp

### Details: - *The second inference in LLM is usually single token inference. It means that `M` dimension of MatMuls in SDPA pattern will have the value `1` (during compilation model this dimension is dynamic (unknown)). Snippets cannot provide efficient execution for single token inference. So we decided to disable MHA tokenization by Snippets in CPU Plugin on LLMs'. We consider the presence of `ScaledDotProductAttentionWithKVCache` op in the model as a sign that this model is LLM.* - *Cherry-picked from #28601 ### Tickets: - *160634* - *160978 (contains performance validation results)*

a-sidorova added the do_not_merge label Jan 22, 2025

a-sidorova requested review from a team as code owners January 22, 2025 07:01

github-actions bot added the category: CPU OpenVINO CPU plugin label Jan 22, 2025

a-sidorova added this to the 2025.1 milestone Jan 22, 2025

dmitry-gorokhov self-assigned this Jan 22, 2025

a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch 2 times, most recently from 0e62943 to 08aeea7 Compare January 22, 2025 09:04

IvanNovoselov approved these changes Jan 22, 2025

View reviewed changes

src/plugins/intel_cpu/src/transformations/transformation_pipeline.cpp Outdated Show resolved Hide resolved

a-sidorova mentioned this pull request Jan 22, 2025

[Snippets][CPU][Port to 2025.0] Disable MHA tokenization in LLM #28611

Merged

a-sidorova removed the do_not_merge label Jan 22, 2025

dmitry-gorokhov approved these changes Jan 23, 2025

View reviewed changes

a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch from cad0554 to e432b62 Compare January 23, 2025 07:13

a-sidorova added 2 commits January 23, 2025 16:02

[Snippets][CPU] Disable MHA tokenization in LLM

a764856

[Snippets][CPU] Added PagedAttentionExtension to check

8e13f0c

a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch from e432b62 to 8e13f0c Compare January 23, 2025 12:02

mg-intel added this pull request to the merge queue Jan 24, 2025

Merged via the queue into openvinotoolkit:master with commit 62e8e08 Jan 24, 2025
180 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Snippets][CPU] Disable MHA tokenization in LLM #28601

[Snippets][CPU] Disable MHA tokenization in LLM #28601

a-sidorova commented Jan 22, 2025 •

edited

Loading

[Snippets][CPU] Disable MHA tokenization in LLM #28601

[Snippets][CPU] Disable MHA tokenization in LLM #28601

Conversation

a-sidorova commented Jan 22, 2025 • edited Loading

Details:

Tickets:

TODO:

a-sidorova commented Jan 22, 2025 •

edited

Loading