-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization #6455
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well. To run full CI, you can do one of these:
🚀 |
looks nice. I'll do a test on H100 setups and confirm improvement. |
vllm/model_executor/models/llama.py
Outdated
if get_pp_group().is_last_rank: | ||
self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps) | ||
else: | ||
self.norm = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose norm
should take only a small fraction of GPU memory. is it worth the effort?
thanks for the contribution! the overall idea looks good to me. we actually use the minimum of the number of blocks across all processes to determine the final block size, which means it is only effective if we can reduce the lower bound of model memory usage. In this case, you:
Overall, the minimum memory usage of the model is reduced by about the size of a word embedding or an lm head layer (typically the same size). It would be better if you can move this logic into common layers, so that the rest models can directly benefit from it. I added One niche case, is you need to take care of models that use tie embeddings. |
Thanks for your change! This should bring some improvement. I'll let @sfc-gh-hazhang report back. |
@youkaichao Thanks for the feedback and suggestions! Instead of modifying This way, Please let me know if this approach works for you. |
this idea looks good to me. I'd like to make the code change as small as possible, and it would be better if we can make the code intuitive. |
6708833
to
2de01e3
Compare
@youkaichao I've rebased the branch on the latest Please review the changes and provide any feedback or concerns you may have. I tested the integration of |
self.lm_head = ParallelLMHead( | ||
self.unpadded_vocab_size, | ||
config.hidden_size, | ||
org_num_embeddings=config.vocab_size, | ||
padding_size=DEFAULT_VOCAB_PADDING_SIZE | ||
# We need bigger padding if using lora for kernel | ||
# compatibility | ||
if not lora_config else lora_config.lora_vocab_padding_size, | ||
quant_config=quant_config, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should only put if-else
for this layer, LogitsProcessor
and Sampler
don't have parameters. Let's make the change as small as possible.
vllm/model_executor/models/llama.py
Outdated
config.hidden_size, | ||
org_num_embeddings=config.vocab_size, | ||
) | ||
if get_pp_group().is_first_rank: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if get_pp_group().is_first_rank or (config.tie_word_embeddings and get_pp_group().is_last_rank)
? we need to take care of tie word embedding as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
0b7e1d3
to
be1a370
Compare
vllm/model_executor/models/llama.py
Outdated
self.logits_processor = LogitsProcessor(self.unpadded_vocab_size, | ||
config.vocab_size, logit_scale) | ||
self.sampler = Sampler() | ||
self.sampler = Sampler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some niche details, I think it would be better to keep these two lines untouched.
your code should just add if-else
around:
self.lm_head = ParallelLMHead(
self.unpadded_vocab_size,
config.hidden_size,
org_num_embeddings=config.vocab_size,
padding_size=DEFAULT_VOCAB_PADDING_SIZE
# We need bigger padding if using lora for kernel
# compatibility
if not lora_config else lora_config.lora_vocab_padding_size,
quant_config=quant_config,
)
if config.tie_word_embeddings:
self.lm_head.weight = self.model.embed_tokens.weight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your perspective about keeping those lines untouched. However, placing LogitsProcessor
and Sampler
within the if-else statement avoids unnecessary object creation for ranks other than the last rank.
be1a370
to
9173c26
Compare
/ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test failure in https://buildkite.com/vllm/ci-aws/builds/5017#0190be33-93a3-48a3-a443-90c80d0b63df can be ignored, I just disabled them because they are flaky. need to wait for @andoorve to fix.
as long as other tests are passed, we can merge this PR.
thanks for the contribution!
…m-project#6455) original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
…m-project#6455) original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
…m-project#6455) original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
…m-project#6455) original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization Signed-off-by: Alvant <[email protected]>
Prior to this change, the Llama model initialization in Pipeline Parallel scenarios would create all components and layers for every rank, leading to unnecessary memory overhead. This pull request optimizes memory usage by creating certain components only on the relevant ranks.
The primary changes include:
By selectively creating components on the relevant ranks, this optimization reduces the overall memory footprint. Testing with
--pipeline-parallel-size 8
showed up to 25% per-rank memory savings for the Llama-3-70B model. This memory optimization can potentially enable higher throughput by allowing more memory to be utilized for serving.This optimization enables serving larger models and running inference on resource-constrained environments by leveraging Pipeline Parallelism more efficiently.
Testing:
Please review the changes and provide feedback or suggestions for further improvements.