Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snippets][CPU] Disable MHA tokenization in LLM #28601

Conversation

a-sidorova
Copy link
Contributor

@a-sidorova a-sidorova commented Jan 22, 2025

Details:

  • The second inference in LLM is usually single token inference. It means that M dimension of MatMuls in SDPA pattern will have the value 1 (during compilation model this dimension is dynamic (unknown)). Snippets cannot provide efficient execution for single token inference. So we decided to disable MHA tokenization by Snippets in CPU Plugin on LLMs'. We consider the presence of ScaledDotProductAttentionWithKVCache op in the model as a sign that this model is LLM.

Tickets:

  • 160634
  • 160978

TODO:

  • Performance validation on LLMs (the results are in the ticket CVS-160978)

@a-sidorova a-sidorova requested review from a team as code owners January 22, 2025 07:01
@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Jan 22, 2025
@a-sidorova a-sidorova added this to the 2025.1 milestone Jan 22, 2025
@dmitry-gorokhov dmitry-gorokhov self-assigned this Jan 22, 2025
@a-sidorova a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch 2 times, most recently from 0e62943 to 08aeea7 Compare January 22, 2025 09:04
@a-sidorova a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch from cad0554 to e432b62 Compare January 23, 2025 07:13
akladiev pushed a commit that referenced this pull request Jan 23, 2025
### Details:
- *The second inference in LLM is usually single token inference. It
means that `M` dimension of MatMuls in SDPA pattern will have the value
`1` (during compilation model this dimension is dynamic (unknown)).
Snippets cannot provide efficient execution for single token inference.
So we decided to disable MHA tokenization by Snippets in CPU Plugin on
LLMs'. We consider the presence of
`ScaledDotProductAttentionWithKVCache` op in the model as a sign that
this model is LLM.*
- *Cherry-picked from
#28601

### Tickets:
 - *160634*
 - *160978 (contains performance validation results)*
@a-sidorova a-sidorova force-pushed the feature/snippets/disable_mha_token_in_llm branch from e432b62 to 8e13f0c Compare January 23, 2025 12:02
@mg-intel mg-intel added this pull request to the merge queue Jan 24, 2025
Merged via the queue into openvinotoolkit:master with commit 62e8e08 Jan 24, 2025
180 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants