[Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization #6455

wushidonguc · 2024-07-15T22:58:42Z

Prior to this change, the Llama model initialization in Pipeline Parallel scenarios would create all components and layers for every rank, leading to unnecessary memory overhead. This pull request optimizes memory usage by creating certain components only on the relevant ranks.

The primary changes include:

Creating the embedding layer on the first rank only.
Creating the norm layer on the last rank only.
Creating the lm head, logits processor, and sampler components on the last rank only.

By selectively creating components on the relevant ranks, this optimization reduces the overall memory footprint. Testing with --pipeline-parallel-size 8 showed up to 25% per-rank memory savings for the Llama-3-70B model. This memory optimization can potentially enable higher throughput by allowing more memory to be utilized for serving.

This optimization enables serving larger models and running inference on resource-constrained environments by leveraging Pipeline Parallelism more efficiently.

Testing:

Benchmarks have been conducted to measure the memory savings and performance impact of this optimization.

Please review the changes and provide feedback or suggestions for further improvements.

github-actions · 2024-07-15T22:58:53Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only trigger fastcheck CI to run, which consists only a small and essential subset of tests to quickly catch errors with the flexibility to run extra individual tests on top (you can do this by unblocking test steps in the Buildkite run).

Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well.

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

sfc-gh-hazhang · 2024-07-16T00:56:56Z

looks nice. I'll do a test on H100 setups and confirm improvement.

youkaichao · 2024-07-16T01:12:10Z

vllm/model_executor/models/llama.py

+        if get_pp_group().is_last_rank:
+            self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        else:
+            self.norm = None


I suppose norm should take only a small fraction of GPU memory. is it worth the effort?

vllm/model_executor/models/llama.py

youkaichao · 2024-07-16T01:21:47Z

thanks for the contribution! the overall idea looks good to me.

we actually use the minimum of the number of blocks across all processes to determine the final block size, which means it is only effective if we can reduce the lower bound of model memory usage.

In this case, you:

removed lm head for the first pp rank
removed both word embedding and lm head for the rest ranks
removed word embedding for the last rank

Overall, the minimum memory usage of the model is reduced by about the size of a word embedding or an lm head layer (typically the same size).

It would be better if you can move this logic into common layers, so that the rest models can directly benefit from it. I added PPMissingLayer in #6406 recently, which might help. My goal is to reduce the code change to support pp, to make sure the code is not too intrusive and people can find it easy to integrate.

One niche case, is you need to take care of models that use tie embeddings.

andoorve · 2024-07-16T01:29:51Z

Thanks for your change! This should bring some improvement. I'll let @sfc-gh-hazhang report back.

wushidonguc · 2024-07-16T20:46:54Z

thanks for the contribution! the overall idea looks good to me.

we actually use the minimum of the number of blocks across all processes to determine the final block size, which means it is only effective if we can reduce the lower bound of model memory usage.

In this case, you:
* removed lm head for the first pp rank

* removed both word embedding and lm head for the rest ranks

* removed word embedding for the last rank
Overall, the minimum memory usage of the model is reduced by about the size of a word embedding or an lm head layer (typically the same size).

It would be better if you can move this logic into common layers, so that the rest models can directly benefit from it. I added PPMissingLayer in #6406 recently, which might help. My goal is to reduce the code change to support pp, to make sure the code is not too intrusive and people can find it easy to integrate.

One niche case, is you need to take care of models that use tie embeddings.

@youkaichao Thanks for the feedback and suggestions!

Instead of modifying VocabParallelEmbedding and ParallelLMHead, I propose keeping the pipeline parallelism logic within the model-specific classes. We can introduce helper methods like get_rank_dependent_embedding and get_rank_dependent_lm_head to encapsulate the rank-aware logic. These helpers would return the appropriate instances based on the current rank.

This way, VocabParallelEmbedding and ParallelLMHead remain generic and reusable across models. The pipeline parallelism logic stays in the model-specific classes, making it easier to maintain and extend.

Please let me know if this approach works for you.

youkaichao · 2024-07-16T20:56:00Z

The pipeline parallelism logic stays in the model-specific classes, making it easier to maintain and extend.

this idea looks good to me. I'd like to make the code change as small as possible, and it would be better if we can make the code intuitive.

vllm/model_executor/models/llama.py

wushidonguc · 2024-07-16T22:46:43Z

@youkaichao I've rebased the branch on the latest main and integrated the PPMissingLayer component to handle weight skipping. No helper functions are added at this stage, but they can be introduced in a separate pull request if deemed necessary.

Please review the changes and provide any feedback or concerns you may have. I tested the integration of PPMissingLayer locally, and it seems to be working as expected.

youkaichao · 2024-07-16T22:55:17Z

vllm/model_executor/models/llama.py

+            self.lm_head = ParallelLMHead(
+                self.unpadded_vocab_size,
+                config.hidden_size,
+                org_num_embeddings=config.vocab_size,
+                padding_size=DEFAULT_VOCAB_PADDING_SIZE
+                # We need bigger padding if using lora for kernel
+                # compatibility
+                if not lora_config else lora_config.lora_vocab_padding_size,
+                quant_config=quant_config,
+            )


I think you should only put if-else for this layer, LogitsProcessor and Sampler don't have parameters. Let's make the change as small as possible.

youkaichao · 2024-07-16T22:57:50Z

vllm/model_executor/models/llama.py

-            config.hidden_size,
-            org_num_embeddings=config.vocab_size,
-        )
+        if get_pp_group().is_first_rank:


if get_pp_group().is_first_rank or (config.tie_word_embeddings and get_pp_group().is_last_rank) ? we need to take care of tie word embedding as well.

…ayers per rank

youkaichao · 2024-07-16T23:15:30Z

vllm/model_executor/models/llama.py

+            self.logits_processor = LogitsProcessor(self.unpadded_vocab_size,
                                                config.vocab_size, logit_scale)
-        self.sampler = Sampler()
+            self.sampler = Sampler()


some niche details, I think it would be better to keep these two lines untouched.

your code should just add if-else around:

self.lm_head = ParallelLMHead( self.unpadded_vocab_size, config.hidden_size, org_num_embeddings=config.vocab_size, padding_size=DEFAULT_VOCAB_PADDING_SIZE # We need bigger padding if using lora for kernel # compatibility if not lora_config else lora_config.lora_vocab_padding_size, quant_config=quant_config, ) if config.tie_word_embeddings: self.lm_head.weight = self.model.embed_tokens.weight

I understand your perspective about keeping those lines untouched. However, placing LogitsProcessor and Sampler within the if-else statement avoids unnecessary object creation for ranks other than the last rank.

youkaichao · 2024-07-17T00:58:36Z

/ready

youkaichao

test failure in https://buildkite.com/vllm/ci-aws/builds/5017#0190be33-93a3-48a3-a443-90c80d0b63df can be ignored, I just disabled them because they are flaky. need to wait for @andoorve to fix.

as long as other tests are passed, we can merge this PR.

thanks for the contribution!

…m-project#6455) original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization

…m-project#6455) original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization Signed-off-by: Alvant <[email protected]>

youkaichao reviewed Jul 16, 2024

View reviewed changes

vllm/model_executor/models/llama.py Outdated Show resolved Hide resolved

youkaichao reviewed Jul 16, 2024

View reviewed changes

vllm/model_executor/models/llama.py Show resolved Hide resolved

youkaichao reviewed Jul 16, 2024

View reviewed changes

vllm/model_executor/models/llama.py Show resolved Hide resolved

wushidonguc force-pushed the pp-optimization branch from 6708833 to 2de01e3 Compare July 16, 2024 22:26

youkaichao reviewed Jul 16, 2024

View reviewed changes

wushidonguc added 4 commits July 16, 2024 23:05

Optimize memory for pipeline parallelism by creating only necessary l…

70a48d2

…ayers per rank

Use PPMissingLayer after rebasing to latest

7c9fa7c

Formatting

1552bcf

More formatting

5428cb0

wushidonguc force-pushed the pp-optimization branch from 0b7e1d3 to be1a370 Compare July 16, 2024 23:05

youkaichao reviewed Jul 16, 2024

View reviewed changes

Take care of tie word embedding

9173c26

wushidonguc force-pushed the pp-optimization branch from be1a370 to 9173c26 Compare July 16, 2024 23:29

wushidonguc added 2 commits July 16, 2024 23:31

Do not change LogitProcessor and Sampler

d060127

Formatting

3593de8

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 17, 2024

youkaichao approved these changes Jul 17, 2024

View reviewed changes

youkaichao merged commit 1d094fd into vllm-project:main Jul 17, 2024
80 of 85 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization #6455

[Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization #6455

wushidonguc commented Jul 15, 2024

github-actions bot commented Jul 15, 2024

sfc-gh-hazhang commented Jul 16, 2024

youkaichao Jul 16, 2024

youkaichao commented Jul 16, 2024 •

edited

Loading

andoorve commented Jul 16, 2024

wushidonguc commented Jul 16, 2024 •

edited

Loading

youkaichao commented Jul 16, 2024

wushidonguc commented Jul 16, 2024

youkaichao Jul 16, 2024

youkaichao Jul 16, 2024

wushidonguc Jul 16, 2024

youkaichao Jul 16, 2024

wushidonguc Jul 16, 2024

youkaichao commented Jul 17, 2024

youkaichao left a comment

[Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization #6455

[Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization #6455

Conversation

wushidonguc commented Jul 15, 2024

github-actions bot commented Jul 15, 2024

sfc-gh-hazhang commented Jul 16, 2024

youkaichao Jul 16, 2024

Choose a reason for hiding this comment

youkaichao commented Jul 16, 2024 • edited Loading

andoorve commented Jul 16, 2024

wushidonguc commented Jul 16, 2024 • edited Loading

youkaichao commented Jul 16, 2024

wushidonguc commented Jul 16, 2024

youkaichao Jul 16, 2024

Choose a reason for hiding this comment

youkaichao Jul 16, 2024

Choose a reason for hiding this comment

wushidonguc Jul 16, 2024

Choose a reason for hiding this comment

youkaichao Jul 16, 2024

Choose a reason for hiding this comment

wushidonguc Jul 16, 2024

Choose a reason for hiding this comment

youkaichao commented Jul 17, 2024

youkaichao left a comment

Choose a reason for hiding this comment

youkaichao commented Jul 16, 2024 •

edited

Loading

wushidonguc commented Jul 16, 2024 •

edited

Loading