[V1] LoRA Support #10957

varun-sundar-rabindranath · 2024-12-06T17:02:55Z

Changes:

Run LoRA requests through V1
- All LoRA functionality is put in a LoRAGPUModelRunnerMixin class that the GPUModelRunner inherits.
- Changes to GPUModelRunner for loading lora models and setting active loras before every run.
Prefix caching
- Add lora_id as a key to prefix caching hash.
Scheduler:
- Add code to track Current and Newly added LoRA requests.
Detokenizer:
- Use LoRA tokenizers for LoRA requests.

Benchmarks:
Machine : 1xA100
V1

VLLM_USE_V1="1" python3 benchmarks/benchmark_throughput.py --model  meta-llama/Llama-2-7b-hf --backend vllm   --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200 --max-loras 4 --max-lora-rank 8  --enable-lora --lora-path "yard1/llama-2-7b-sql-lora-test"

Throughput: 2.42 requests/s, 1225.95 total tokens/s, 628.29 output tokens/s

V0

python3 benchmarks/benchmark_throughput.py --model  meta-llama/Llama-2-7b-hf --backend vllm   --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 200 --max-loras 4 --max-lora-rank 8  --enable-lora --lora-path "yard1/llama-2-7b-sql-lora-test"

Throughput: 5.95 requests/s, 3021.90 total tokens/s, 1548.71 output tokens/s

The performance gap between V0 and V1 is due to CUDA Graphs. Refer to benchmarks in reference PR #11613 .

github-actions · 2024-12-06T17:03:10Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

varun-sundar-rabindranath · 2024-12-06T17:04:18Z

tests/lora/test_baichuan.py

@@ -62,6 +70,7 @@ def test_baichuan_lora(baichuan_lora_files):
        assert output2[i] == expected_lora_output[i]


+@pytest.mark.skip_v1


Here and below, skipping tests for V1 with TP > 1 - To enable after TP for V1 lands

vllm/engine/arg_utils.py

varun-sundar-rabindranath · 2024-12-06T17:21:12Z

vllm/v1/engine/detokenizer.py

+            tokenizer_name=tokenizer_name,
+            tokenizer_mode=tokenizer_mode,
+            trust_remote_code=trust_remote_code,
+            revision=revision)


@ywang96 @njhill small refactor to allow for per-request tokenizers.

varun-sundar-rabindranath · 2024-12-06T17:25:16Z

vllm/v1/worker/gpu_model_runner.py

@@ -602,269 +633,3 @@ def _get_padded_batch_size(self, batch_size: int) -> Optional[int]:
            if batch_size <= size:
                return size
        return None
-


Refactor : Moved CachedRequestState and InputBatch to input_batch.py. It looked like a good refactor to reduce file-size. In this PR it lets both gpu_model_runner.py and lora_model_runner_mixin.py import these datastructures from InputBatch.

varun-sundar-rabindranath · 2024-12-06T17:26:05Z

vllm/v1/worker/input_batch.py

+            max_num_logprobs=self.max_num_logprobs,
+        )
+
+    def make_lora_inputs(self, num_scheduled_tokens: np.array) \


Added for LoRA

WoosukKwon

Thanks for doing this! Left a few early comments. Will look into more details later.

vllm/v1/engine/processor.py

vllm/v1/core/scheduler.py

WoosukKwon · 2024-12-06T17:33:19Z

vllm/v1/core/scheduler.py

+        if self.lora_config:
+            requested_loras =  \
+                set(req.lora_request.lora_int_id \
+                        for req in scheduled_running_reqs \
+                            if req.lora_request and \
+                                req.lora_request.lora_int_id > 0)
+            assert len(requested_loras) <= self.lora_config.max_loras


Can we cache this state and incrementally update it whenever new request joins or finishes?

I explored this a bit. Tracking the additions and deletions to the running queue in the current code is hard. The updates happen in more than one place (with new requests, finish requests and requests moving between running to preempted state and back). one way is to replace the append/remove/pop with

self.running.<operation>() if lora_config: update_active_loras()

A better way is to subclass List and after any Create, Update, Delete operation we can update the active LoRAs. This is a considerable change. I believe we can do this after some profiling to see how bad this code is.
For the moment, I think this localized update is nicer as it doesn't introduce a bunch of if self.lora_configs .

Is there a better way I am missing ?

vllm/v1/worker/input_batch.py

WoosukKwon · 2024-12-06T17:41:55Z

vllm/v1/worker/input_batch.py

+        req_lora_mapping = self.request_lora_mapping[:self.num_reqs]
+        prompt_lora_mapping = tuple(req_lora_mapping)
+        token_lora_mapping = tuple(
+            req_lora_mapping.repeat(num_scheduled_tokens))
+
+        active_lora_ids: set[int] = set(np.unique(req_lora_mapping))
+        active_lora_requests: set[LoRARequest] = \
+            set({lr for lr in self.lora_requests \
+                    if lr.lora_int_id in active_lora_ids})
+        # Update lora requests
+        self.lora_requests = active_lora_requests
+
+        return prompt_lora_mapping, token_lora_mapping, self.lora_requests


How does this work with tunica kernels?

We use the punica SGMV kernel always (as set in

vllm/vllm/v1/worker/lora_model_runner_mixin.py

Line 68 in a7a9626

lora_mapping = LoRAMapping(token_lora_mapping,

). Internally the kernels launch a thread-block set for each request separately. So, as long as the prompt_lora_mapping is correct, the kernels work correctly.

The SGMV kernel codepath merges the sequences that have the same lora-id together in

vllm/vllm/lora/punica.py

Line 28 in 7406274

def compute_meta(

. I chose the SGMV kernel so this merging happens wherever possible.

I'll profile with both SGMV and BGMV kernels and choose the best. For now, SGMV looked like a good default/placeholder.

Regarding V0 LoRA，SGMV implements group gemm, which provides better performance for prefill stage . BGMV implements group gemv, which is better optimized for decoding stage . If only one can be chosen, SGMV is likely more suitable.

mergify · 2024-12-17T06:13:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-12-31T00:54:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath · 2024-12-31T01:57:19Z

vllm/model_executor/layers/logits_processor.py

-        logits = lm_head.linear_method.apply(lm_head,
-                                             hidden_states,
-                                             bias=embedding_bias)
+    def _gather_logits(self, logits: torch.Tensor) -> torch.Tensor:


Refactor : introduce _gather_logits() that LogitsProcessorWithLoRA also uses.

varun-sundar-rabindranath · 2024-12-31T01:58:10Z

vllm/v1/core/kv_cache_utils.py

+    return [request.lora_request.lora_int_id]
+
+
+def generate_block_hash_extra_keys(


Refactor for using prefix caching with LoRA.

varun-sundar-rabindranath · 2024-12-31T02:03:57Z

vllm/v1/worker/gpu_model_runner.py

-        del hidden_states, logits
-        self.encoder_cache.clear()
+        # For profile, have maximum num_reqs and that collectively have
+        # maximum num_tokens.


Setup num_scheduled_tokens for initializing LoRA for profile_run. @ywang96 will this change interfere with the multi modal setup above ? Can you point me to a test / command that I should confirm that it works ? Thanks.

comaniac

v1/core LGTM

vllm/v1/core/kv_cache_utils.py

comaniac · 2024-12-31T17:24:26Z

vllm/v1/core/scheduler.py

+        if self.lora_config:
+            requested_loras = set(
+                req.lora_request.lora_int_id for req in scheduled_running_reqs
+                if req.lora_request and req.lora_request.lora_int_id > 0)


ooc, why LoRA ID 0 is reserved?

This is a V0 requirement. LoRA ID 0 is reserved for requests without LoRA.
reference :

vllm/vllm/sequence.py

Line 469 in 23c1b10

def lora_int_id(self) -> int:

it is used as,

vllm/tests/lora/test_llama_tp.py

Line 58 in 23c1b10

def generate_and_test(llm, sql_lora_files):

The requirement is plumbed down to the kernel.

vllm/vllm/lora/punica_wrapper/utils.py

Line 93 in 23c1b10

prompt_mapping: List[int] = [

. 0 LoRA ID is translated to -1 so the LoRA kernels ignore them. e.g.

vllm/vllm/lora/ops/sgmv_shrink.py

Line 55 in 23c1b10

if lora_index == -1:

I believe this is an implementation detail and can be moved down to be fully handled at the Kernel level. However, I am not sure if reserving LoRA ID 0 is a norm with LoRA users.

@jeejeelee any comments ?

Ah yeah now I remembered. This does introduce some confusions to me in the past so it'd be better to change it, but not necessary in this PR.

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath requested review from WoosukKwon, robertgshaw2-neuralmagic, njhill, ywang96, comaniac and alexm-neuralmagic as code owners December 6, 2024 17:02

varun-sundar-rabindranath marked this pull request as draft December 6, 2024 17:03

varun-sundar-rabindranath commented Dec 6, 2024

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath commented Dec 6, 2024

View reviewed changes

varun-sundar-rabindranath changed the title ~~V1 LoRA Support~~ [V1] LoRA Support Dec 6, 2024

varun-sundar-rabindranath mentioned this pull request Dec 6, 2024

[WIP] V1 LoRA support #10579

Closed

WoosukKwon reviewed Dec 6, 2024

View reviewed changes

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch from d21df49 to 797dab2 Compare December 17, 2024 03:47

mergify bot added the needs-rebase label Dec 17, 2024

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch from d4d70cc to 550da53 Compare December 17, 2024 15:25

mergify bot removed the needs-rebase label Dec 17, 2024

varun-sundar-rabindranath mentioned this pull request Dec 30, 2024

[Do Not Merge] - LoRA V1 Reference PR #11613

Draft

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch from 51ef92a to 3200ed4 Compare December 31, 2024 00:54

mergify bot added the needs-rebase label Dec 31, 2024

Varun Sundar Rabindranath added 2 commits December 30, 2024 20:19

Add lora support

9052c78

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

format

48e9185

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath force-pushed the varun/v1-lora-support-attempt-2 branch from 3200ed4 to 48e9185 Compare December 31, 2024 01:53

mergify bot removed the needs-rebase label Dec 31, 2024

varun-sundar-rabindranath marked this pull request as ready for review December 31, 2024 01:54

varun-sundar-rabindranath commented Dec 31, 2024

View reviewed changes

comaniac reviewed Dec 31, 2024

View reviewed changes

Varun Sundar Rabindranath added 3 commits January 2, 2025 10:10

rename generate keys fns

750c482

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

review comments

385bff6

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

format

d04d56d

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath requested review from WoosukKwon and jeejeelee January 2, 2025 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] LoRA Support #10957

[V1] LoRA Support #10957

varun-sundar-rabindranath commented Dec 6, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 6, 2024

varun-sundar-rabindranath Dec 6, 2024 •

edited

Loading

varun-sundar-rabindranath Dec 6, 2024

varun-sundar-rabindranath Dec 6, 2024 •

edited

Loading

varun-sundar-rabindranath Dec 6, 2024

WoosukKwon left a comment

WoosukKwon Dec 6, 2024

varun-sundar-rabindranath Dec 6, 2024

WoosukKwon Dec 6, 2024

varun-sundar-rabindranath Dec 6, 2024

jeejeelee Dec 7, 2024

mergify bot commented Dec 17, 2024

mergify bot commented Dec 31, 2024

varun-sundar-rabindranath Dec 31, 2024

varun-sundar-rabindranath Dec 31, 2024

varun-sundar-rabindranath Dec 31, 2024

comaniac left a comment

comaniac Dec 31, 2024

varun-sundar-rabindranath Jan 2, 2025

comaniac Jan 2, 2025

		@@ -62,6 +70,7 @@ def test_baichuan_lora(baichuan_lora_files):
		assert output2[i] == expected_lora_output[i]


		@pytest.mark.skip_v1

		return [request.lora_request.lora_int_id]


		def generate_block_hash_extra_keys(

[V1] LoRA Support #10957

Are you sure you want to change the base?

[V1] LoRA Support #10957

Conversation

varun-sundar-rabindranath commented Dec 6, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 6, 2024

varun-sundar-rabindranath Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varun-sundar-rabindranath Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Dec 17, 2024

mergify bot commented Dec 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Dec 6, 2024 •

edited by github-actions bot

Loading

varun-sundar-rabindranath Dec 6, 2024 •

edited

Loading

varun-sundar-rabindranath Dec 6, 2024 •

edited

Loading