Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add HPU support to vLLM v1 - cont. #609

Open
wants to merge 44 commits into
base: habana_main
Choose a base branch
from

Conversation

kzawora-intel
Copy link

@kzawora-intel kzawora-intel commented Dec 10, 2024

Up-to-date variant of #487 after rebase and with functional TP. Diff will become more readable once #605 gets merged.

  • Implemented v1 HPU attn backend, worker, model_runner and executor
  • VLLM_USE_V1=1 properly selects V1 HPU components
  • V1 HPU executor loads model properly
  • V1 HPU executor allocates KV cache properly
  • V1 HPU model runner is constructed properly and initializes bucketing
  • V1 HPU attention backend gets selected automatically
  • profile_run works on dummy data
  • V1 HPU model_runner prepares input tensors based on SchedulerOutputs (rather than SequenceGroupMetadata)
  • V1 HPU model_runner differentiates prefill and decode sequences
  • V1 HPU model_runner execute_model runs for prefill
  • V1 HPU model_runner execute_model runs for decode
  • V1 HPU model_runner handles mixed-batch scenarios
  • V1 HPU model_runner prefill returns correct results
  • V1 HPU model_runner decode returns correct results (w/ flat PA)
  • V1 HPU model_runner decode returns correct results (w/ contiguous PA)
  • V1 HPU model_runner prefill runs at BS>1
  • V1 standard greedy and random sampling work on HPU
  • Capturing and replaying HPU Graphs work
  • Llama3.1-8B runs on GSM-8k with SOTA accuracy
  • V1 HPU model_runner warmup works properly
  • V1 HPU automatic prefix caching works properly
  • Tensor parallelism works
  • torch.compile works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant