Skip to content

Conversation

@wuxun-zhang
Copy link

The new kernel implements below method, key points are:

  • num of work groups are fixed to num of total XeCores
  • dynamically split KV seq length from all seqs into all work groups
  • each XeCore gets balanced work units
image

As of now there are two limitations:

  • only decode support (seq_len_qo==1)
  • batch_size * num_heads_q <= num of total XeCores

@pengzhao-intel
Copy link

maybe add the limitation of this algorithm in the code as well, especially for one with atomic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants