From b0b2b30790051a58d99bf3b0511d5600f3644229 Mon Sep 17 00:00:00 2001 From: grimoire Date: Tue, 2 Jan 2024 10:41:38 +0800 Subject: [PATCH 1/9] cn doc --- docs/zh_cn/pytorch.md | 80 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 80 insertions(+) create mode 100644 docs/zh_cn/pytorch.md diff --git a/docs/zh_cn/pytorch.md b/docs/zh_cn/pytorch.md new file mode 100644 index 0000000000..122c80710f --- /dev/null +++ b/docs/zh_cn/pytorch.md @@ -0,0 +1,80 @@ +# Pytorch + +`lmdeploy.pytorch` 是 LMDeploy 提供的推理后端之一。与着重于性能的 turbomind 相比,lmdeploy.pytorch 以较小的性能开销为代价,提供了一套更容易开发与扩展的大模型推理实现。 + +## 设计 + +[PLACEHOLDER] + +## API + +lmdeploy.pytorch 可以与 turbomind 共享同样的服务接口,这些服务接口通过 Engine 与 EngineInstance 与 lmdeploy.pytorch 进行交互。 + +EngineInstance 是推理请求的发起者,它会将推理请求组织成特定格式发送给 Engine,以此实现流式推理。EngineInstance 的推理接口是线程安全的,服务发起者可以在不同线程中启动各自的 EngineInstance,Engine 回根据当前资源与推理请求自动进行 batch 化处理。 + +Engine 是推理请求的接收与执行者。它包含如下的组件来完成这项任务: + +- ModelAgent 对象负责模型的加载、缓存管理以及 TensorParallelism 的管理。 +- Scheduler 对象负责 session 的管理,sequence 与 lora adapter 所需要的资源的分配。 +- RequestManager 负责请求的发送与接收,可以通过它与 EngineInstance 交互。 + +## Engine + +为了应对异步推理请求,Engine 在启动后会维护一个线程,循环如下操作: + +1. 通过 RequestManager 读取请求,对各种请求进行分类处理。 +2. Scheduler 规划哪些请求可以被处理,以及它们所需的缓存和 adapters。 +3. ModelAgent 根据步骤 2. 得到的信息为输入分配资源,然后使用 patch 后的模型进行推理 +4. Scheduler 根据推理结果更新请求状态 +5. RequestManager 将输出返回给发送者(EngineInstance),回到步骤 1. + +下面我们将介绍上述步骤中用到的几个重要组件 + +### Scheduler + +在进行大模型的推理时,通常会把 attention 的历史输入 key 和 value 缓存起来,以避免在未来的推理中进行重复计算。这种情况下如果要进行多 batch 的推理,由于不同数据的序列长度可能不同,kv 会进行大量的填充,浪费很多显存资源,也限制了模型的并发推理能力上限。 + +[vLLM](https://docs.vllm.ai) 提了一种 paging 策略,以 page block 为单位为 key value 分配缓存,这样就可以避免由于 padding 导致的显存浪费。 lmdeploy.pytorch 中的 Scheduler 也遵循同样的设计,根据请求的长度合理分配所需的资源,并撤出暂时不使用的资源以保证存储资源的高效利用。 + +lmdeploy.pytorch 还对 [S-LoRA](https://github.com/S-LoRA/S-LoRA) 的支持,S-LoRA 是一种对单模型多 adapter 的支持方案。LoRA 在推理时通常会把 adapter 融合进模型权重当中,同时使用复数个 adapter 会导致显存使用量的激增;S-LoRA 不对 adapter 进行融合,通过使用 unified paging,在推理时动态换入需要使用的 adapter,大幅降低了使用 adapter 的显存开销。Scheduler 中也实现了相关的功能,让用户可以更方便的使用自己的 adapter. + +### ModelAgent + +lmdeploy.pytorch 中对 Tensor Parallelism(TP)进行了支持,不同的 TP 参数对模型的构造、权重处理、分配 cache 都存在影响。ModelAgent 对这些内容进行了封装,让 Engine 不用再关心这部分细节。 + +ModelAgent 有两个重要组件: + +1. patched_model 是更新后的 huggingface 模型,更新后的模型添加了各种功能的支持,包括更高性能的子模块实现、TP、量化等等 +2. cache_engine 是缓存的分配与交换模块。它接收来自 scheduler 的交换请求,执行 host-device 间显存交换,adapter 加载等工作 + +## Patching + +为了降低接入模型的门槛,我们实现了一套简单的 patch 机制来简化实现的替换。 + +以 Llama 模型的 LlamaAttention.forward 为例,我们可以重新写一个 forward 的实现: + +```python +class CustomLlamaAttention(nn.Module): + def forward(self, ...): + # custom forward +``` + +然后在 `lmdeploy.pytorch.models.module_map` 中注册模块的映射 + +```python +MODULE_MAP.update({ +'transformers.models.llama.modeling_llama.LlamaAttention': +'qualname.to.CustomLlamaAttention'}) +``` + +经过 patch 后的模型就会使用新的 forward 实现。TP、量化等功能也依赖 patch 机制,这里不做太多展开。 + +## 能力 + +- continuous batching: 由于输入序列的长度不一样,batching 通常需要打击输入进行 padding,这种 padding 会导致后续运算的计算量增加、影响速度,也会使得显存的占用大幅增加。遵循许多其他成熟框架的方案,lmdeploy.pytorch 采用了 continuous batching 的方式对输入做了连续化处理,避免了多余的资源占用。 + +- Tensor Parallelism: 大模型可能会占用远超一张显卡的显存量,为了支持这样的大模型的推理,我们实现了 Tensor 并发,模型的权重会被分布在不同的设备中,每张 GPU 设备负责一部分计算,减少了单卡显存占用,也充分利用了多显卡的计算优势。 + +- S-LoRA: LoRA adapter 可以帮助我们使用有限的显存来调优大模型,S-LoRA 可以帮助我们在有限的显存中同时使用复数个 LoRA 权重,扩展模型的能力。 + +- Quantization: 量化可以帮助我们进一步减少显存占用,提高推理性能。lmdeploy.pytorch 分支中添加了 w8a8 模型量化的支持,可以阅读 [w8a8.md](w8a8.md) 了解更多细节。 From 6d64d3e9b4fc99bde6379371c3828afb8e39acf9 Mon Sep 17 00:00:00 2001 From: grimoire Date: Tue, 2 Jan 2024 16:26:24 +0800 Subject: [PATCH 2/9] add contrib --- docs/zh_cn/pytorch.md | 2 +- docs/zh_cn/pytorch_contributing.md | 312 +++++++++++++++++++++++++++++ 2 files changed, 313 insertions(+), 1 deletion(-) create mode 100644 docs/zh_cn/pytorch_contributing.md diff --git a/docs/zh_cn/pytorch.md b/docs/zh_cn/pytorch.md index 122c80710f..a244031981 100644 --- a/docs/zh_cn/pytorch.md +++ b/docs/zh_cn/pytorch.md @@ -4,7 +4,7 @@ ## 设计 -[PLACEHOLDER] +\[PLACEHOLDER\] ## API diff --git a/docs/zh_cn/pytorch_contributing.md b/docs/zh_cn/pytorch_contributing.md new file mode 100644 index 0000000000..36441cb864 --- /dev/null +++ b/docs/zh_cn/pytorch_contributing.md @@ -0,0 +1,312 @@ +# lmdeploy.pytorch 新模型支持 + +lmdeploy.pytorch 被设计用来简化新模型的支持以及原型的开发,新模型的支持依赖于 patch 机制,对原模型做修改以及功能添加,以期可以最大程度上复用模型的原始实现,减少工作量。 + +## 模型支持 + +我们以 transformers 中的 llama 实现来介绍模型支持的流程 + +在开始之前,我们首先要了解一下模型的输入。lmdeploy.pytorch 的输入与标准 transformers 模型的输入略有不同,差异主要体现在如下方面: + +1. 由于支持了 continuous batching,一个 batch 的输入 `input_ids` 会被拼接成一维的长序列,然后 `unsqueeze(0)` 来保证输入维度与 transformers 中相同。这样的输入不会影响 MLP 以及 RMSNorm 等模块的计算。 +2. 由于添加了对 paged attention 的支持,`past_key_value` 不再是原来的大小,而是一组形状为 `[num_blocks, block_size, num_heads, head_dim]` 的 cache 块,num_blocks 为总 block 数量,由可用显存大小决定,block_size 为预设的块大小。这样的输入改变会影响到 LlamaModel 和 LlamaAttention 的计算,因此要对这两个模块的实现进行修改。 +3. 由于上述输入的改变,模型中需要一些额外的输入来支持推理,比如 batch 中的序列起始位置和长度,kv cache 的 block table 等。这些输入并不在模块的 forward 参数列表中,我们需要维护一个上下文以获得这些输入。 + +上面的输入改动会影响 LlamaModel 和 LlamaAttention,首先我们来实现新的 LlamaModel,这是对原始实现的简化,我们删除了很多检查代码,以避免由于输入改变造成的断言失败,仅保留了最小程度的代码: + +```python +# lmdeploy/pytorch/models/llama.py + +class LlamaModel(nn.Module): + def forward( + self, + input_ids: torch.LongTensor = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[List[torch.FloatTensor]] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPast]: + """Rewrite implementation of LlamaModel.forward.""" + inputs_embeds = self.embed_tokens(input_ids) + hidden_states = inputs_embeds + + # decoder layers + for idx, decoder_layer in enumerate(self.layers): + past_key_value = past_key_values[idx] + layer_outputs = decoder_layer( + hidden_states, + attention_mask=attention_mask, + position_ids=position_ids, + past_key_value=past_key_value, + output_attentions=output_attentions, + use_cache=use_cache, + ) + hidden_states = layer_outputs[0] + hidden_states = self.norm(hidden_states) + + return BaseModelOutputWithPast( + last_hidden_state=hidden_states, + past_key_values=past_key_values, + hidden_states=None, + attentions=None, + ) +``` + +然后是对 LlamaAttention 模块的改写。按顺序实现如下操作: + +1. kqv proj +2. rotary embedding +3. 填充 kv cache +4. MHA 计算 +5. o proj + +continuous batching 和 kv cache 的改动对该模块的影响比较大 + +```python +# lmdeploy/pytorch/models/llama.py +from lmdeploy.pytorch.kernels import apply_rotary_pos_emb, fill_kv_cache, paged_attention_fwd + +class LlamaAttention(nn.Module): + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_value: Optional[Tuple[torch.Tensor]] = None, + output_attentions: bool = False, + use_cache: bool = False, + ) -> Tuple[torch.Tensor, Optional[torch.Tensor], + Optional[Tuple[torch.Tensor]]]: + """Rewrite of LlamaAttention.forward.""" + context = self.context.context + history_lengths = context.history_lengths + position_ids_1d = context.position_ids_1d + block_offsets = context.block_offsets + + # qkv proj + query_states = q_proj(hidden_states) + key_states = k_proj(hidden_states) + value_states = v_proj(hidden_states) + query_states = query_states.view(-1, num_heads, head_dim) + key_states = key_states.view(-1, num_kv_heads, head_dim) + value_states = value_states.view(-1, num_kv_heads, head_dim) + + # rotary embedding + max_seq_len = position_ids.size(-1) + kv_seq_len = max_seq_len + max(history_lengths) + if kv_seq_len >= self.rotary_emb.max_seq_len_cached: + cos, sin = self.rotary_emb(value_states, + seq_len=kv_seq_len + 128) + query_states, key_states = apply_rotary_pos_emb( + query_states, + key_states, + self.rotary_emb.cos_cached, + self.rotary_emb.sin_cached, + position_ids, + position_ids_1d, + q_embed=query_states, + k_embed=key_states) + + # fill kv cache + kv_seq_length = context.kv_seq_length + q_seq_length = context.seq_length + q_start_loc = context.q_start_loc + fill_kv_cache(key_states, + value_states, + past_key_value[0], + past_key_value[1], + q_start_loc, + q_seq_length, + block_offsets=block_offsets, + history_lengths=history_lengths, + context=context) + + # attention + attn_output = query_states + block_size = past_key_value[0].size(1) + paged_attention_fwd( + query_states, + past_key_value[0], + past_key_value[1], + attn_output, + block_offsets, + b_start_loc=q_start_loc, + b_seq_len=q_seq_length, + b_kv_seq_len=kv_seq_length, + max_input_len=max_seq_len, + ) + hidden_size = num_heads * head_dim + attn_output = attn_output.reshape(*hidden_states.shape[:-1], hidden_size) + + # o proj + attn_output = o_proj(attn_output) + return attn_output, None, past_key_value +``` + +上面的代码有几处值得注意的地方,首先是 context 对象。我们需要 history_lengths、block_offsets 等参数辅助运算,这些参数无法通过模型的 forward 函数传递进来。因此我们维护了一个 context 对象,把几乎所有可能用到的输入参数都保存在其中,方便在各个模块间共享。context 对象可以通过 `self.context.context` 来访问,结构可以参考 [context-结构](#context-结构)。 + +另一个值得注意的地方就是自定义 kernel,由于输入形式的改变,原来的 LlamaAttention 实现变得不再适用,为了保证推理的速度和正确性,我们在 lmdeploy.pytorch.kernels 中实现了许多自定义的 triton kernel,上面的模块中就用到了 `apply_rotary_pos_emb`,`fill_kv_cache` 和 `paged_attention_fwd` ,分别负责实现 rotary embedding,填充 kv cache 还有 attention 的计算。 + +有了上述的两个模块后,还需要将他们注册到 `lmdeploy/pytorch/models/module_map.py` 中,进行原模块与 patch 模块的映射 + +```python +# lmdeploy/pytorch/models/module_map.py +MODEL_MAP.update({ + 'transformers.models.llama.modeling_llama.LlamaAttention': + 'lmdeploy.pytorch.models.llama.LlamaAttention', + 'transformers.models.llama.modeling_llama.LlamaModel': + 'lmdeploy.pytorch.models.llama.LlamaModel' +}) +``` + +完成注册后,Engine 在启动时就会将这两个模块 patch 成新的实现,完成后续的部署任务。 + +## Tensor 并发支持 + +为了支持 Tensor 并发,需要对模型的权重做切分。让我们试着为上面接入的 Llama 模型添加 TP 的支持。 + +Llama 中涉及到 Tensor 并发的模块是 LlamaAttention 中的 qkvo proj 和 LlamaMLP 中的 gate,up 和 down proj。其中 o_proj 和 down_proj 需要按行切分,剩下的按列切分。我们可以在对应的模块中实现 `_distribution_partition_fn` 函数: + +```python +# lmdeploy/pytorch/models/llama.py +from ..dist_utils import (colwise_parallelize_linear_fn, + rowwise_parallelize_linear_fn) + +class LlamaAttention(nn.Module): + @classmethod + def _distribute_partition_fn(cls, mod_name: str, mod: nn.Module, + device_mesh: DeviceMesh): + """Distribution partition callback.""" + if mod_name in ['q_proj', 'k_proj', 'v_proj']: + colwise_parallelize_linear_fn(mod, + device_mesh=device_mesh, + to_local=True) + elif mod_name in ['o_proj']: + rowwise_parallelize_linear_fn(mod, + device_mesh=device_mesh, + to_local=True) + +class LlamaMLP(nn.Module): + @classmethod + def _distribute_partition_fn(cls, mod_name: str, mod: nn.Module, + device_mesh: DeviceMesh): + """Distribution partition callback.""" + if mod_name in ['gate_proj', 'up_proj']: + colwise_parallelize_linear_fn(mod, + device_mesh=device_mesh, + to_local=True) + elif mod_name in ['down_proj']: + rowwise_parallelize_linear_fn(mod, + device_mesh=device_mesh, + to_local=True) + +``` + +`_distribute_partition_fn` 会在加载模型权重时被调用,对应的权重会被按照特定的形式分配到对应的设备中。 + +按照目前的方案切分后的权重,需要对 o_proj 和 down_proj 的结果进行 all_reduce 操作才能得到正确的结果。可以选择将 all_reduce 放在模型的 forward 函数中,也可以选择另一种方案,添加 `_distribute_output_fn` 函数: + +```python +# lmdeploy/pytorch/models/llama.py +import torch.distributed as dist + +class LlamaAttention(nn.Module): + @classmethod + def _distribute_output_fn(cls, outputs, device_mesh: DeviceMesh): + """Distribution output hook.""" + dist.all_reduce(outputs[0]) + return outputs + +class LlamaMLP(nn.Module): + @classmethod + def _distribute_output_fn(cls, outputs, device_mesh: DeviceMesh): + """Distribution output hook.""" + dist.all_reduce(outputs) + return outputs +``` + +最后别忘了将 LlamaMLP 也注册进 module_map 中 + +```python +# lmdeploy/pytorch/models/module_map.py +MODEL_MAP.update({ + 'transformers.models.llama.modeling_llama.LlamaMLP': + 'lmdeploy.pytorch.models.llama.LlamaMLP' +}) +``` + +这样就可以利用多卡的优势,让更大的模型部署成为可能 + +## 附录 + +### context 结构 + +```python +@dataclass +class StepContext: + """context of Model. + """ + inputs: ModelInputs + block_offsets: torch.LongTensor + position_ids: torch.LongTensor + position_ids_1d: torch.LongTensor + q_start_loc: torch.LongTensor + history_lengths: torch.LongTensor + seq_length: torch.LongTensor + max_seq_length: int + kv_seq_length: torch.LongTensor + kv_caches: List + is_decoding: bool + world_size: int = 1 + json_config: Dict = None + local_adapter_ids: torch.LongTensor = None + global_adapter_ids: torch.LongTensor = None + adapter_offsets: torch.LongTensor = None + max_rank: int = 0 +``` + +### FAQ + +- 如何访问 patch 前的模块? + +有时我们只希望在函数前后加一个 hook 代码,不希望大段的拷贝函数,可以通过 `self.origin_mod` 访问 patch 前的模块。 + +- 非 transformers 官方的模型该如何注册? + +一些模型的实现代码可能是以 remote code 的形式添加的,这样的模块无法通过完整的 qualname 来定位。lmdeploy.pytorch 支持使用缩写的模块名进行注册: + +```python +MODULE_MAP.update({ + 'modeling_internlm.InternLMAttention': + 'lmdeploy.pytorch.models.internlm.PatchedInternLMAttention', +}) +``` + +缩写的优先级会更低,有条件的话还是鼓励使用完整的 qualname 进行注册。 + +- 模块出现同名但不同实现怎么处理? + +目前推荐的做法是同名就映射到同一个实现中,然后在实现内部根据模块的固有参数来判断模型该使用的类型,以 baichuan2 7b/13b 为例: + +```python +class BaichuanModel(nn.Module): + def forward(self, ...): + if self.config.num_hidden_layers == 32: + return forward_7b(...) + else: + return forward_default(...) +``` + +- 如果希望在推理前对模块进行初始化? + +可以实现模块的 `_update_model_fn` 函数,它会在模块的权重都加载完,完成 TP 权重切分后被调用 + +```python +class LlamaAttention: + def _update_model_fn(self): + # ADD YOUR CODE HERE +``` From 9a4e6831d310dddf1c6bc9d1522ffc2272b50909 Mon Sep 17 00:00:00 2001 From: grimoire Date: Wed, 3 Jan 2024 17:44:13 +0800 Subject: [PATCH 3/9] merge main --- README_zh-CN.md | 2 +- .../{pytorch_contributing.md => advance/pytorch_new_model.md} | 0 docs/zh_cn/index.rst | 1 + docs/zh_cn/{ => inference}/pytorch.md | 2 +- 4 files changed, 3 insertions(+), 2 deletions(-) rename docs/zh_cn/{pytorch_contributing.md => advance/pytorch_new_model.md} (100%) rename docs/zh_cn/{ => inference}/pytorch.md (98%) diff --git a/README_zh-CN.md b/README_zh-CN.md index a5df5c57e2..44dbe47bd9 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -103,7 +103,7 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型 - 用户指南 - 推理pipeline - [推理引擎 - TurboMind](./docs/zh_cn/inference/turbomind.md) - - 推理引擎 - PyTorch + - [推理引擎 - PyTorch](./docs/zh_cn/inference/pytorch.md) - [推理服务](./docs/zh_cn/serving/restful_api.md) - [模型量化](./docs/zh_cn/quantization) - 进阶指南 diff --git a/docs/zh_cn/pytorch_contributing.md b/docs/zh_cn/advance/pytorch_new_model.md similarity index 100% rename from docs/zh_cn/pytorch_contributing.md rename to docs/zh_cn/advance/pytorch_new_model.md diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst index a9f9a9f58e..a4535cf527 100644 --- a/docs/zh_cn/index.rst +++ b/docs/zh_cn/index.rst @@ -57,6 +57,7 @@ :caption: 进阶指南 serving/qos.md + advance/pytorch_new_model.md 索引与表格 diff --git a/docs/zh_cn/pytorch.md b/docs/zh_cn/inference/pytorch.md similarity index 98% rename from docs/zh_cn/pytorch.md rename to docs/zh_cn/inference/pytorch.md index a244031981..a210364f15 100644 --- a/docs/zh_cn/pytorch.md +++ b/docs/zh_cn/inference/pytorch.md @@ -4,7 +4,7 @@ ## 设计 -\[PLACEHOLDER\] +![pytorch arch](https://github.com/grimoire/lmdeploy/blob/media/lmdeploy_pytorch_arch.png?raw=true) ## API From 8e35681181b0641bfe544a77e8c0e6361264c9e1 Mon Sep 17 00:00:00 2001 From: grimoire Date: Mon, 8 Jan 2024 11:26:07 +0800 Subject: [PATCH 4/9] docs --- README.md | 2 +- docs/en/advance/pytorch_new_model.md | 320 ++++++++++++++++++++++++ docs/en/inference/pytorch.md | 126 +++++----- docs/zh_cn/advance/pytorch_new_model.md | 4 +- docs/zh_cn/inference/pytorch.md | 12 +- 5 files changed, 397 insertions(+), 67 deletions(-) create mode 100644 docs/en/advance/pytorch_new_model.md diff --git a/README.md b/README.md index 3a9377f662..d1093a2899 100644 --- a/README.md +++ b/README.md @@ -102,7 +102,7 @@ For detailed user guides and advanced guides, please refer to our [tutorials](ht - User Guide - Inference pipeline - [Inference Engine - TurboMind](docs/en/inference/turbomind.md) - - Inference Engine - PyTorch + - [Inference Engine - PyTorch](docs/zh_cn/inference/pytorch.md) - [Serving](docs/en/serving/restful_api.md) - [Quantization](docs/en/quantization) - Advance Guide diff --git a/docs/en/advance/pytorch_new_model.md b/docs/en/advance/pytorch_new_model.md new file mode 100644 index 0000000000..6baaca1b22 --- /dev/null +++ b/docs/en/advance/pytorch_new_model.md @@ -0,0 +1,320 @@ +# How to support new model in lmdeploy.pytorch + +lmdeploy.pytorch is designed to ease new model deployment and prototype verification. If you are willing to use our engine, here is the tutorial. + +## Support New Model + +Let's start with Llama. + +before we start, let's take a look at the inputs of the model. To support new features in our engine, the inputs are a little bit different from the inputs in transformers. + +1. Continuous batching is used to avoid batch padding, so the `input_ids` would be the concatenation of all input sequence in batch, than `unsqueeze(0)` to match the dimension of origin input_ids. +2. Paged attention is used to reduce the memory usage of key/value cache, `past_key_value` become a big Tensor with shape `[num_blocks, block_size, num_heads, head_dim]`, where num_blocks is the number of page block, block_size is the the size of each block. +3. Extra inputs are necessary to support the inputs above, such as block table, history length. These extra inputs are not listed in arguments of origin forward method. A context object is used to provide these info. + +Because of the change of the inputs above, we need to rewrite forward of `LlamaModel` and `LlamaAttention` to fit the new inputs. First, let's rewrite the `LlamaModel`, we only keep the minimal codes to support deployment: + +```python +# lmdeploy/pytorch/models/llama.py + +class LlamaModel(nn.Module): + def forward( + self, + input_ids: torch.LongTensor = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[List[torch.FloatTensor]] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPast]: + """Rewrite implementation of LlamaModel.forward.""" + inputs_embeds = self.embed_tokens(input_ids) + hidden_states = inputs_embeds + + # decoder layers + for idx, decoder_layer in enumerate(self.layers): + past_key_value = past_key_values[idx] + layer_outputs = decoder_layer( + hidden_states, + attention_mask=attention_mask, + position_ids=position_ids, + past_key_value=past_key_value, + output_attentions=output_attentions, + use_cache=use_cache, + ) + hidden_states = layer_outputs[0] + hidden_states = self.norm(hidden_states) + + return BaseModelOutputWithPast( + last_hidden_state=hidden_states, + past_key_values=past_key_values, + hidden_states=None, + attentions=None, + ) +``` + +For LlamaAttention module, we need to perform following steps: + +1. kqv proj +2. rotary embedding +3. filling kv cache +4. MHA +5. o proj + +```python +# lmdeploy/pytorch/models/llama.py +from lmdeploy.pytorch.kernels import apply_rotary_pos_emb, fill_kv_cache, paged_attention_fwd + +class LlamaAttention(nn.Module): + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_value: Optional[Tuple[torch.Tensor]] = None, + output_attentions: bool = False, + use_cache: bool = False, + ) -> Tuple[torch.Tensor, Optional[torch.Tensor], + Optional[Tuple[torch.Tensor]]]: + """Rewrite of LlamaAttention.forward.""" + context = self.context.context + history_lengths = context.history_lengths + position_ids_1d = context.position_ids_1d + block_offsets = context.block_offsets + + # qkv proj + query_states = q_proj(hidden_states) + key_states = k_proj(hidden_states) + value_states = v_proj(hidden_states) + query_states = query_states.view(-1, num_heads, head_dim) + key_states = key_states.view(-1, num_kv_heads, head_dim) + value_states = value_states.view(-1, num_kv_heads, head_dim) + + # rotary embedding + max_seq_len = position_ids.size(-1) + kv_seq_len = max_seq_len + max(history_lengths) + if kv_seq_len >= self.rotary_emb.max_seq_len_cached: + cos, sin = self.rotary_emb(value_states, + seq_len=kv_seq_len + 128) + query_states, key_states = apply_rotary_pos_emb( + query_states, + key_states, + self.rotary_emb.cos_cached, + self.rotary_emb.sin_cached, + position_ids, + position_ids_1d, + q_embed=query_states, + k_embed=key_states) + + # fill kv cache + kv_seq_length = context.kv_seq_length + q_seq_length = context.seq_length + q_start_loc = context.q_start_loc + fill_kv_cache(key_states, + value_states, + past_key_value[0], + past_key_value[1], + q_start_loc, + q_seq_length, + block_offsets=block_offsets, + history_lengths=history_lengths, + context=context) + + # attention + attn_output = query_states + block_size = past_key_value[0].size(1) + paged_attention_fwd( + query_states, + past_key_value[0], + past_key_value[1], + attn_output, + block_offsets, + b_start_loc=q_start_loc, + b_seq_len=q_seq_length, + b_kv_seq_len=kv_seq_length, + max_input_len=max_seq_len, + ) + hidden_size = num_heads * head_dim + attn_output = attn_output.reshape(*hidden_states.shape[:-1], hidden_size) + + # o proj + attn_output = o_proj(attn_output) + return attn_output, None, past_key_value +``` + +Notice that some arguments such as `history_lengths` and `block_offsets` comes from `self.context.context`. As we have mentioned above, continuous batching and paged attention require extra arguments to support them, `context` is the container to store these inputs. If you need more detail about context object, please read [context info](#context-info). + +We replace some operation to our custom triton kernel for two reason. + +1. Custom triton kernel can be used to support new features such as `paged_attention_fwd`. +2. Fuse kernels have better performance than the pure PyTorch implementation. + +Now we have new implementations of two modules, let's register them into `lmdeploy/pytorch/models/module_map.py`. + +```python +# lmdeploy/pytorch/models/module_map.py +MODEL_MAP.update({ + 'transformers.models.llama.modeling_llama.LlamaAttention': + 'lmdeploy.pytorch.models.llama.LlamaAttention', + 'transformers.models.llama.modeling_llama.LlamaModel': + 'lmdeploy.pytorch.models.llama.LlamaModel' +}) +``` + +The rewritten module has been mapped to the origin module. When we create an Engine, ModelAgent would patch the model automatically, then we can perform inference with these new implementation. + +## Support Tensor Parallelism + +If we want to support tensor parallelism(tp), we have partition the weights in the model. Let's try extend the rewrite above. + +In Llama (and most LLM), most Linear layers are involved in the weight partition. Among them: + +- `LlamaAttention`: `q_proj`, `k_proj`, `v_proj` need column wise partition; `o_proj` needs row wise partition. +- `LlamaMLP`: `gate_proj`, `up_proj` need column wise partition; `down_proj` needs row wise partition. + +We can implement `_distribution_partition_fn` in each rewrite modules: + +```python +# lmdeploy/pytorch/models/llama.py +from ..dist_utils import (colwise_parallelize_linear_fn, + rowwise_parallelize_linear_fn) + +class LlamaAttention(nn.Module): + @classmethod + def _distribute_partition_fn(cls, mod_name: str, mod: nn.Module, + device_mesh: DeviceMesh): + """Distribution partition callback.""" + if mod_name in ['q_proj', 'k_proj', 'v_proj']: + colwise_parallelize_linear_fn(mod, + device_mesh=device_mesh, + to_local=True) + elif mod_name in ['o_proj']: + rowwise_parallelize_linear_fn(mod, + device_mesh=device_mesh, + to_local=True) + +class LlamaMLP(nn.Module): + @classmethod + def _distribute_partition_fn(cls, mod_name: str, mod: nn.Module, + device_mesh: DeviceMesh): + """Distribution partition callback.""" + if mod_name in ['gate_proj', 'up_proj']: + colwise_parallelize_linear_fn(mod, + device_mesh=device_mesh, + to_local=True) + elif mod_name in ['down_proj']: + rowwise_parallelize_linear_fn(mod, + device_mesh=device_mesh, + to_local=True) + +``` + +`_distribute_partition_fn` would be called when loading model weights, the weights of special module would be distributed to different devices. + +After partition, we need to perform `all_reduce` on the output of `o_proj` and `down_proj`. Of cause you can just put `all_reduce` in the forward method, another option is add an `_distribute_output_fn` call: + +```python +# lmdeploy/pytorch/models/llama.py +import torch.distributed as dist + +class LlamaAttention(nn.Module): + @classmethod + def _distribute_output_fn(cls, outputs, device_mesh: DeviceMesh): + """Distribution output hook.""" + dist.all_reduce(outputs[0]) + return outputs + +class LlamaMLP(nn.Module): + @classmethod + def _distribute_output_fn(cls, outputs, device_mesh: DeviceMesh): + """Distribution output hook.""" + dist.all_reduce(outputs) + return outputs +``` + +Don't forget to add `LlamaMLP` in `module_map`. + +```python +# lmdeploy/pytorch/models/module_map.py +MODEL_MAP.update({ + 'transformers.models.llama.modeling_llama.LlamaMLP': + 'lmdeploy.pytorch.models.llama.LlamaMLP' +}) +``` + +That's all. Now it is possible to utilize multiple GPUs to deploy LLM. + +## Appendix + +### context info + +```python +@dataclass +class StepContext: + """context of Model. + """ + inputs: ModelInputs + block_offsets: torch.LongTensor + position_ids: torch.LongTensor + position_ids_1d: torch.LongTensor + q_start_loc: torch.LongTensor + history_lengths: torch.LongTensor + seq_length: torch.LongTensor + max_seq_length: int + kv_seq_length: torch.LongTensor + kv_caches: List + is_decoding: bool + world_size: int = 1 + json_config: Dict = None + local_adapter_ids: torch.LongTensor = None + global_adapter_ids: torch.LongTensor = None + adapter_offsets: torch.LongTensor = None + max_rank: int = 0 +``` + +### FAQ + +- How to call origin forward? + +It is a common practice to add hooks to a method instead a full rewrite. You can use `self.origin_mod` to visit the unpatched module. + +- How to register modules in remote code? + +Some modules are contained in remote code, it is hard to locate the module with `qualname`. `lmdeploy.pytorch` support register them with abbreviation: + +```python +MODULE_MAP.update({ + 'modeling_internlm.InternLMAttention': + 'lmdeploy.pytorch.models.internlm.PatchedInternLMAttention', +}) +``` + +> \[!NOTE\] +> +> Abbreviation tends to have a low priority. It is recommend to register modules with `qualname`. + +- How to support different modules with same name? + +You can support them in the same rewrite module, and give them different implement by their attribute, take `baichuan2` 7b/13b as example: + +```python +class BaichuanModel(nn.Module): + def forward(self, ...): + if self.config.num_hidden_layers == 32: + return forward_7b(...) + else: + return forward_default(...) +``` + +- How to do post-initialization for rewrite module? + +Add a `_update_model_fn` method, it will be called after weight loading. + +```python +class LlamaAttention: + def _update_model_fn(self): + # ADD YOUR CODE HERE +``` diff --git a/docs/en/inference/pytorch.md b/docs/en/inference/pytorch.md index e4cd5a9cbe..baf94119f6 100644 --- a/docs/en/inference/pytorch.md +++ b/docs/en/inference/pytorch.md @@ -1,74 +1,82 @@ -# Pytorch +# Architecture of lmdeploy.pytorch -## Chat in command line +`lmdeploy.pytorch` is an inference engine in LMDeploy. It provides a developer friendly framework to users who want to deploy their own model and develop new features. -LMDeploy support chatting with PyTorch models with submodule `lmdeploy.pytorch.chat`. +## Design -This submodule allow user to chat with language model through command line, and optionally accelerate model using backends like deepspeed. +![pytorch arch](https://github.com/grimoire/lmdeploy/blob/media/lmdeploy_pytorch_arch.png?raw=true) -**Example 1**: Chat with default setting +## API -```shell -lmdeploy chat torch $PATH_TO_HF_MODEL -``` +`lmdeploy.pytorch` share service interfaces with `Turbomind`, these interfaces perform inference through `Engine` and `EngineInstance` in lmdeploy.pytorch. -**Example 2**: Disable sampling and chat history +EngineInstance is the inference request sender, it will pack the inference request and send the packed request to Engine. EngineInstance is thread-safe, multiple threads can send request through their own EngineInstance simultaneously. Engine will perform batching automatically according to resources usage. -```shell -lmdeploy chat torch \ - $PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \ - --temperature 0 --max-history 0 -``` +Engine is the request receiver and executor. It contain modules that support the task as follow: -**Example 3**: Accelerate with deepspeed inference +- `ModelAgent` is a wrapper of the model. It is responsible for loading model/adapters, cache management and tensor parallelism. +- `Scheduler` is the sequence manager. It will decide which sequences and adapters would participated in current step, then allocate resources for them. +- `RequestManager` is responsible for request sending and receiving. It is the bridge between Engine and EngineInstance. -```shell -lmdeploy chat torch \ - $PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \ - --accel deepspeed -``` +## Engine + +Engine would response the requests in a sub-thread, looping as following: + +1. Get new requests through RequestManager. These requests would be cached. +2. Scheduler perform scheduling, decide which cached requests should be processed and allocate resources for them. +3. ModelAgent would swap the caches according to the information provided by Scheduler, then performing inference with the patched model. +4. Scheduler update the status of requests according to the inference result of ModelAgent. +5. RequestManager response to the sender (EngineInstance), back to step 1. + +Let's dive deeper into these modules. + +### Scheduler + +It is a common practice to cache history key and value states in LLM inference to prevent redundant computation. Since history lengths are different in batch of sequences, we have to padding the caches so we can perform the batching inference. The padding would waste a lot of memory and limit the performance of the transformer. + +[vLLM](https://docs.vllm.ai) provide a paging based strategy, allocating caches in page blocks to prevent extra memory usage. The Scheduler module in our Engine share the same design, allocating resources according to the sequence length in blocks and evicting unused blocks to support larger batching and longer session length. + +We also support [S-LoRA](https://github.com/S-LoRA/S-LoRA). S-LoRA can be used to support multiple LoRA adapters on limited memory. + +### ModelAgent + +lmdeploy.pytorch support Tensor Parallelism, which would leads to complex model initialization, cache allocation and weight partition. ModelAgent is designed to hide these details so Engine just need to focus on maintaining the pipeline. -Note: to use deepspeed, you need to install deepspeed, and if hope to accelerate InternLM, you need a customized version +ModelAgent is composed of two component: -**Example 4**: Tensor parallel the model on 2 GPUs +1. `patched_model` is the transformer model after patch. Compared to the origin model, patched model has more features, such as TP, quantization and high performance kernels. +2. `cache_engine` is the maintainer of caches. It receive command from Scheduler, perform host-device page swap. Only gpu blocks can be used to cache key/value and adapters. -```shell -deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \ - $PATH_TO_LLAMA_MODEL_IN_HF_FORMAT \ - --accel deepspeed \ +## Patching + +In order to ease the deployment of new model, we have develop a tool to patch the modules. + +Let's say, if we want to reimplement the forward of `LlamaAttention.forward`: + +```python +class CustomLlamaAttention(nn.Module): + def forward(self, ...): + # custom forward ``` -This module also allow the following control commands to change generation behaviors during chat. - -- `exit`: terminate and exit chat -- `config set key=value`: change generation config `key` to `value`, e.g. config temperature=0 disable sampling for following chats -- `clear`: clear chat history - -### Simple diagram of components - -```mermaid -graph LR; - subgraph model specific adapter - p((user_input))-->tokenize-->id((input_ids))-->decorate - tmpl_ids((template_ids))-->decorate; - end - subgraph generate - model[CausalLM_model.generate]-->gen_result(("gen_result")) - gen_result-->hid - gen_result-->attn((attention)) - end - subgraph streamer - model-->s[streamer]--value-->decode_single--token-->output - end - subgraph session_manager - prepend_history-->fullid((complete_ids)); - trim-->prepend_history - end - decorate-->prepend_history - hid((history_ids))-->trim; - attn-->trim; - fullid-->model - tokenizer((tokenizer))-->decode_single - tokenizer-->tokenize - p-->genconfig(GenConfig)-->model +Just register the implementation above into `lmdeploy.pytorch.models.module_map`. + +```python +MODULE_MAP.update({ +'transformers.models.llama.modeling_llama.LlamaAttention': +'qualname.to.CustomLlamaAttention'}) ``` + +ModelAgent would load and patch `LlamaAttention` with `CustomLlamaAttention` and leave anything other unchanged. Than you can perform inference with the new implementation. + +## Features + +lmdeploy.pytorch support new features include: + +- Continuous Batching: Since the sequence length in a batch might be different, padding is required to support batching inference. Large padding leads to extra memory usage and useless computation. We use continuous batching, concatenate all sequence into a single long sequence to avoid padding. + +- Tensor Parallelism: The GPU memory usage of LLM might be larger than the memory of a single GPU. Tensor parallelism can be used to fit such model on multiple devices. Each device has parts of the model and can be computed simultaneous, the result would be gathered to ensure the correctness. + +- S-LoRA: LoRA adapter can be used to support training LLM on device with limited memory. It is a common practice to merge adapter into weights of the model before deployment, load multiple adapter in such way would consume a lot of memory. We have support S-LoRA, adapters would be paged and swapped in when necessary, special kernels are developed to support inference with unmerged adapters. Which made it possible to load a lot of different adapters. + +- Quantization: Model quantization perform computation with low precision. lmdeploy.pytorch has support w8a8 quantization. Read [w8a8](../quantization/w8a8.md) for more details. diff --git a/docs/zh_cn/advance/pytorch_new_model.md b/docs/zh_cn/advance/pytorch_new_model.md index 36441cb864..0af2e1da93 100644 --- a/docs/zh_cn/advance/pytorch_new_model.md +++ b/docs/zh_cn/advance/pytorch_new_model.md @@ -286,7 +286,9 @@ MODULE_MAP.update({ }) ``` -缩写的优先级会更低,有条件的话还是鼓励使用完整的 qualname 进行注册。 +> \[!NOTE\] +> +> 缩写的优先级会更低,有条件的话还是鼓励使用完整的 qualname 进行注册。 - 模块出现同名但不同实现怎么处理? diff --git a/docs/zh_cn/inference/pytorch.md b/docs/zh_cn/inference/pytorch.md index a210364f15..c8ad790834 100644 --- a/docs/zh_cn/inference/pytorch.md +++ b/docs/zh_cn/inference/pytorch.md @@ -1,4 +1,4 @@ -# Pytorch +# lmdeploy.pytorch 架构 `lmdeploy.pytorch` 是 LMDeploy 提供的推理后端之一。与着重于性能的 turbomind 相比,lmdeploy.pytorch 以较小的性能开销为代价,提供了一套更容易开发与扩展的大模型推理实现。 @@ -14,7 +14,7 @@ EngineInstance 是推理请求的发起者,它会将推理请求组织成特 Engine 是推理请求的接收与执行者。它包含如下的组件来完成这项任务: -- ModelAgent 对象负责模型的加载、缓存管理以及 TensorParallelism 的管理。 +- ModelAgent 对象负责模型的加载、缓存管理以及 tensor parallelism 的管理。 - Scheduler 对象负责 session 的管理,sequence 与 lora adapter 所需要的资源的分配。 - RequestManager 负责请求的发送与接收,可以通过它与 EngineInstance 交互。 @@ -44,7 +44,7 @@ lmdeploy.pytorch 中对 Tensor Parallelism(TP)进行了支持,不同的 TP ModelAgent 有两个重要组件: -1. patched_model 是更新后的 huggingface 模型,更新后的模型添加了各种功能的支持,包括更高性能的子模块实现、TP、量化等等 +1. patched_model 是更新后的 transformer 模型,更新后的模型添加了各种功能的支持,包括更高性能的子模块实现、TP、量化等等 2. cache_engine 是缓存的分配与交换模块。它接收来自 scheduler 的交换请求,执行 host-device 间显存交换,adapter 加载等工作 ## Patching @@ -69,12 +69,12 @@ MODULE_MAP.update({ 经过 patch 后的模型就会使用新的 forward 实现。TP、量化等功能也依赖 patch 机制,这里不做太多展开。 -## 能力 +## 特性 -- continuous batching: 由于输入序列的长度不一样,batching 通常需要打击输入进行 padding,这种 padding 会导致后续运算的计算量增加、影响速度,也会使得显存的占用大幅增加。遵循许多其他成熟框架的方案,lmdeploy.pytorch 采用了 continuous batching 的方式对输入做了连续化处理,避免了多余的资源占用。 +- Continuous Batching: 由于输入序列的长度不一样,batching 通常需要打击输入进行 padding,这种 padding 会导致后续运算的计算量增加、影响速度,也会使得显存的占用大幅增加。遵循许多其他成熟框架的方案,lmdeploy.pytorch 采用了 continuous batching 的方式对输入做了连续化处理,避免了多余的资源占用。 - Tensor Parallelism: 大模型可能会占用远超一张显卡的显存量,为了支持这样的大模型的推理,我们实现了 Tensor 并发,模型的权重会被分布在不同的设备中,每张 GPU 设备负责一部分计算,减少了单卡显存占用,也充分利用了多显卡的计算优势。 - S-LoRA: LoRA adapter 可以帮助我们使用有限的显存来调优大模型,S-LoRA 可以帮助我们在有限的显存中同时使用复数个 LoRA 权重,扩展模型的能力。 -- Quantization: 量化可以帮助我们进一步减少显存占用,提高推理性能。lmdeploy.pytorch 分支中添加了 w8a8 模型量化的支持,可以阅读 [w8a8.md](w8a8.md) 了解更多细节。 +- Quantization: 量化可以帮助我们进一步减少显存占用,提高推理性能。lmdeploy.pytorch 分支中添加了 w8a8 模型量化的支持,可以阅读 [w8a8](../quantization/w8a8.md) 了解更多细节。 From 4b24014212c6a69f376166303ad4789dcb8f26ae Mon Sep 17 00:00:00 2001 From: grimoire Date: Mon, 8 Jan 2024 11:31:20 +0800 Subject: [PATCH 5/9] bord --- docs/en/advance/pytorch_new_model.md | 8 ++++---- docs/en/inference/pytorch.md | 8 ++++---- docs/zh_cn/advance/pytorch_new_model.md | 8 ++++---- docs/zh_cn/inference/pytorch.md | 8 ++++---- 4 files changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/en/advance/pytorch_new_model.md b/docs/en/advance/pytorch_new_model.md index 6baaca1b22..093f6dcfad 100644 --- a/docs/en/advance/pytorch_new_model.md +++ b/docs/en/advance/pytorch_new_model.md @@ -277,11 +277,11 @@ class StepContext: ### FAQ -- How to call origin forward? +- **How to call origin forward?** It is a common practice to add hooks to a method instead a full rewrite. You can use `self.origin_mod` to visit the unpatched module. -- How to register modules in remote code? +- **How to register modules in remote code?** Some modules are contained in remote code, it is hard to locate the module with `qualname`. `lmdeploy.pytorch` support register them with abbreviation: @@ -296,7 +296,7 @@ MODULE_MAP.update({ > > Abbreviation tends to have a low priority. It is recommend to register modules with `qualname`. -- How to support different modules with same name? +- **How to support different modules with same name?** You can support them in the same rewrite module, and give them different implement by their attribute, take `baichuan2` 7b/13b as example: @@ -309,7 +309,7 @@ class BaichuanModel(nn.Module): return forward_default(...) ``` -- How to do post-initialization for rewrite module? +- **How to do post-initialization for rewrite module?** Add a `_update_model_fn` method, it will be called after weight loading. diff --git a/docs/en/inference/pytorch.md b/docs/en/inference/pytorch.md index baf94119f6..af03dfa2f3 100644 --- a/docs/en/inference/pytorch.md +++ b/docs/en/inference/pytorch.md @@ -73,10 +73,10 @@ ModelAgent would load and patch `LlamaAttention` with `CustomLlamaAttention` and lmdeploy.pytorch support new features include: -- Continuous Batching: Since the sequence length in a batch might be different, padding is required to support batching inference. Large padding leads to extra memory usage and useless computation. We use continuous batching, concatenate all sequence into a single long sequence to avoid padding. +- **Continuous Batching**: Since the sequence length in a batch might be different, padding is required to support batching inference. Large padding leads to extra memory usage and useless computation. We use continuous batching, concatenate all sequence into a single long sequence to avoid padding. -- Tensor Parallelism: The GPU memory usage of LLM might be larger than the memory of a single GPU. Tensor parallelism can be used to fit such model on multiple devices. Each device has parts of the model and can be computed simultaneous, the result would be gathered to ensure the correctness. +- **Tensor Parallelism**: The GPU memory usage of LLM might be larger than the memory of a single GPU. Tensor parallelism can be used to fit such model on multiple devices. Each device has parts of the model and can be computed simultaneous, the result would be gathered to ensure the correctness. -- S-LoRA: LoRA adapter can be used to support training LLM on device with limited memory. It is a common practice to merge adapter into weights of the model before deployment, load multiple adapter in such way would consume a lot of memory. We have support S-LoRA, adapters would be paged and swapped in when necessary, special kernels are developed to support inference with unmerged adapters. Which made it possible to load a lot of different adapters. +- **S-LoRA**: LoRA adapter can be used to support training LLM on device with limited memory. It is a common practice to merge adapter into weights of the model before deployment, load multiple adapter in such way would consume a lot of memory. We have support S-LoRA, adapters would be paged and swapped in when necessary, special kernels are developed to support inference with unmerged adapters. Which made it possible to load a lot of different adapters. -- Quantization: Model quantization perform computation with low precision. lmdeploy.pytorch has support w8a8 quantization. Read [w8a8](../quantization/w8a8.md) for more details. +- **Quantization**: Model quantization perform computation with low precision. lmdeploy.pytorch has support w8a8 quantization. Read [w8a8](../quantization/w8a8.md) for more details. diff --git a/docs/zh_cn/advance/pytorch_new_model.md b/docs/zh_cn/advance/pytorch_new_model.md index 0af2e1da93..cfdbca097d 100644 --- a/docs/zh_cn/advance/pytorch_new_model.md +++ b/docs/zh_cn/advance/pytorch_new_model.md @@ -271,11 +271,11 @@ class StepContext: ### FAQ -- 如何访问 patch 前的模块? +- **如何访问 patch 前的模块?** 有时我们只希望在函数前后加一个 hook 代码,不希望大段的拷贝函数,可以通过 `self.origin_mod` 访问 patch 前的模块。 -- 非 transformers 官方的模型该如何注册? +- **非 transformers 官方的模型该如何注册?** 一些模型的实现代码可能是以 remote code 的形式添加的,这样的模块无法通过完整的 qualname 来定位。lmdeploy.pytorch 支持使用缩写的模块名进行注册: @@ -290,7 +290,7 @@ MODULE_MAP.update({ > > 缩写的优先级会更低,有条件的话还是鼓励使用完整的 qualname 进行注册。 -- 模块出现同名但不同实现怎么处理? +- **模块出现同名但不同实现怎么处理?** 目前推荐的做法是同名就映射到同一个实现中,然后在实现内部根据模块的固有参数来判断模型该使用的类型,以 baichuan2 7b/13b 为例: @@ -303,7 +303,7 @@ class BaichuanModel(nn.Module): return forward_default(...) ``` -- 如果希望在推理前对模块进行初始化? +- **如果希望在推理前对模块进行初始化?** 可以实现模块的 `_update_model_fn` 函数,它会在模块的权重都加载完,完成 TP 权重切分后被调用 diff --git a/docs/zh_cn/inference/pytorch.md b/docs/zh_cn/inference/pytorch.md index c8ad790834..820e9f933a 100644 --- a/docs/zh_cn/inference/pytorch.md +++ b/docs/zh_cn/inference/pytorch.md @@ -71,10 +71,10 @@ MODULE_MAP.update({ ## 特性 -- Continuous Batching: 由于输入序列的长度不一样,batching 通常需要打击输入进行 padding,这种 padding 会导致后续运算的计算量增加、影响速度,也会使得显存的占用大幅增加。遵循许多其他成熟框架的方案,lmdeploy.pytorch 采用了 continuous batching 的方式对输入做了连续化处理,避免了多余的资源占用。 +- **Continuous Batching**: 由于输入序列的长度不一样,batching 通常需要打击输入进行 padding,这种 padding 会导致后续运算的计算量增加、影响速度,也会使得显存的占用大幅增加。遵循许多其他成熟框架的方案,lmdeploy.pytorch 采用了 continuous batching 的方式对输入做了连续化处理,避免了多余的资源占用。 -- Tensor Parallelism: 大模型可能会占用远超一张显卡的显存量,为了支持这样的大模型的推理,我们实现了 Tensor 并发,模型的权重会被分布在不同的设备中,每张 GPU 设备负责一部分计算,减少了单卡显存占用,也充分利用了多显卡的计算优势。 +- **Tensor Parallelism**: 大模型可能会占用远超一张显卡的显存量,为了支持这样的大模型的推理,我们实现了 Tensor 并发,模型的权重会被分布在不同的设备中,每张 GPU 设备负责一部分计算,减少了单卡显存占用,也充分利用了多显卡的计算优势。 -- S-LoRA: LoRA adapter 可以帮助我们使用有限的显存来调优大模型,S-LoRA 可以帮助我们在有限的显存中同时使用复数个 LoRA 权重,扩展模型的能力。 +- **S-LoRA: LoRA adapter**: 可以帮助我们使用有限的显存来调优大模型,S-LoRA 可以帮助我们在有限的显存中同时使用复数个 LoRA 权重,扩展模型的能力。 -- Quantization: 量化可以帮助我们进一步减少显存占用,提高推理性能。lmdeploy.pytorch 分支中添加了 w8a8 模型量化的支持,可以阅读 [w8a8](../quantization/w8a8.md) 了解更多细节。 +- **Quantization**: 量化可以帮助我们进一步减少显存占用,提高推理性能。lmdeploy.pytorch 分支中添加了 w8a8 模型量化的支持,可以阅读 [w8a8](../quantization/w8a8.md) 了解更多细节。 From a2d1c9c9a8b93bce0c1f64b1716a63be36ad3d32 Mon Sep 17 00:00:00 2001 From: grimoire Date: Mon, 8 Jan 2024 15:11:28 +0800 Subject: [PATCH 6/9] update en index, fix en README link --- README.md | 2 +- docs/en/index.rst | 1 + docs/zh_cn/inference/pytorch.md | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d1093a2899..44309fb45d 100644 --- a/README.md +++ b/README.md @@ -102,7 +102,7 @@ For detailed user guides and advanced guides, please refer to our [tutorials](ht - User Guide - Inference pipeline - [Inference Engine - TurboMind](docs/en/inference/turbomind.md) - - [Inference Engine - PyTorch](docs/zh_cn/inference/pytorch.md) + - [Inference Engine - PyTorch](docs/en/inference/pytorch.md) - [Serving](docs/en/serving/restful_api.md) - [Quantization](docs/en/quantization) - Advance Guide diff --git a/docs/en/index.rst b/docs/en/index.rst index ed9f12cecb..bea379bfd2 100644 --- a/docs/en/index.rst +++ b/docs/en/index.rst @@ -55,6 +55,7 @@ Welcome to LMDeploy's tutorials! :caption: Advanced Guide serving/qos.md + advance/pytorch_new_model.md Indices and tables ================== diff --git a/docs/zh_cn/inference/pytorch.md b/docs/zh_cn/inference/pytorch.md index 820e9f933a..736e16ee63 100644 --- a/docs/zh_cn/inference/pytorch.md +++ b/docs/zh_cn/inference/pytorch.md @@ -75,6 +75,6 @@ MODULE_MAP.update({ - **Tensor Parallelism**: 大模型可能会占用远超一张显卡的显存量,为了支持这样的大模型的推理,我们实现了 Tensor 并发,模型的权重会被分布在不同的设备中,每张 GPU 设备负责一部分计算,减少了单卡显存占用,也充分利用了多显卡的计算优势。 -- **S-LoRA: LoRA adapter**: 可以帮助我们使用有限的显存来调优大模型,S-LoRA 可以帮助我们在有限的显存中同时使用复数个 LoRA 权重,扩展模型的能力。 +- **S-LoRA**: LoRA adapter 可以帮助我们使用有限的显存来调优大模型,S-LoRA 可以帮助我们在有限的显存中同时使用复数个 LoRA 权重,扩展模型的能力。 - **Quantization**: 量化可以帮助我们进一步减少显存占用,提高推理性能。lmdeploy.pytorch 分支中添加了 w8a8 模型量化的支持,可以阅读 [w8a8](../quantization/w8a8.md) 了解更多细节。 From 792ac874bc9eb9cb2a201e14bbf069a68eabe4e1 Mon Sep 17 00:00:00 2001 From: grimoire Date: Mon, 8 Jan 2024 15:55:25 +0800 Subject: [PATCH 7/9] fix typo --- docs/zh_cn/inference/pytorch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/zh_cn/inference/pytorch.md b/docs/zh_cn/inference/pytorch.md index 736e16ee63..a1f4eb6d56 100644 --- a/docs/zh_cn/inference/pytorch.md +++ b/docs/zh_cn/inference/pytorch.md @@ -71,7 +71,7 @@ MODULE_MAP.update({ ## 特性 -- **Continuous Batching**: 由于输入序列的长度不一样,batching 通常需要打击输入进行 padding,这种 padding 会导致后续运算的计算量增加、影响速度,也会使得显存的占用大幅增加。遵循许多其他成熟框架的方案,lmdeploy.pytorch 采用了 continuous batching 的方式对输入做了连续化处理,避免了多余的资源占用。 +- **Continuous Batching**: 由于输入序列的长度不一样,batching 通常需要对输入进行 padding,这种 padding 会导致后续运算的计算量增加、影响速度,也会使得显存的占用大幅增加。遵循许多其他成熟框架的方案,lmdeploy.pytorch 采用了 continuous batching 的方式对输入做了连续化处理,避免了多余的资源占用。 - **Tensor Parallelism**: 大模型可能会占用远超一张显卡的显存量,为了支持这样的大模型的推理,我们实现了 Tensor 并发,模型的权重会被分布在不同的设备中,每张 GPU 设备负责一部分计算,减少了单卡显存占用,也充分利用了多显卡的计算优势。 From bf58f015448f1d8eb9def6edd2af930fb2e015ee Mon Sep 17 00:00:00 2001 From: grimoire Date: Tue, 9 Jan 2024 11:03:34 +0800 Subject: [PATCH 8/9] add link to pytorch_new_model --- docs/en/inference/pytorch.md | 6 +++--- docs/zh_cn/inference/pytorch.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/en/inference/pytorch.md b/docs/en/inference/pytorch.md index af03dfa2f3..303efed5e0 100644 --- a/docs/en/inference/pytorch.md +++ b/docs/en/inference/pytorch.md @@ -8,9 +8,9 @@ ## API -`lmdeploy.pytorch` share service interfaces with `Turbomind`, these interfaces perform inference through `Engine` and `EngineInstance` in lmdeploy.pytorch. +`lmdeploy.pytorch` shares service interfaces with `Turbomind`, and the inference service is implemented by `Engine` and `EngineInstance`. -EngineInstance is the inference request sender, it will pack the inference request and send the packed request to Engine. EngineInstance is thread-safe, multiple threads can send request through their own EngineInstance simultaneously. Engine will perform batching automatically according to resources usage. +EngineInstance is the sender of the inference requests, and it sends the encapsulated request to the Engine to achieve streaming inference. The inference interface of EngineInstance is thread-safe, and EngineInstances in different threads can initiate requests simultaneously. The Engine will automatically perform batch processing based on the current system resources. Engine is the request receiver and executor. It contain modules that support the task as follow: @@ -67,7 +67,7 @@ MODULE_MAP.update({ 'qualname.to.CustomLlamaAttention'}) ``` -ModelAgent would load and patch `LlamaAttention` with `CustomLlamaAttention` and leave anything other unchanged. Than you can perform inference with the new implementation. +ModelAgent would load and patch `LlamaAttention` with `CustomLlamaAttention` and leave anything other unchanged. Than you can perform inference with the new implementation. Read [support new model](../advance/pytorch_new_model.md) for more detail about model patching. ## Features diff --git a/docs/zh_cn/inference/pytorch.md b/docs/zh_cn/inference/pytorch.md index a1f4eb6d56..982153f94f 100644 --- a/docs/zh_cn/inference/pytorch.md +++ b/docs/zh_cn/inference/pytorch.md @@ -67,7 +67,7 @@ MODULE_MAP.update({ 'qualname.to.CustomLlamaAttention'}) ``` -经过 patch 后的模型就会使用新的 forward 实现。TP、量化等功能也依赖 patch 机制,这里不做太多展开。 +经过 patch 后的模型就会使用新的 forward 实现。TP、量化等功能也依赖 patch 机制,请阅读 [lmdeploy.pytorch 新模型支持](../advance/pytorch_new_model.md) 了解更多细节。 ## 特性 From abd8acb42906fd4aead252755c415f62e0d31384 Mon Sep 17 00:00:00 2001 From: grimoire Date: Wed, 10 Jan 2024 14:25:04 +0800 Subject: [PATCH 9/9] optimize en --- docs/en/advance/pytorch_new_model.md | 72 ++++++++++++++-------------- docs/en/inference/pytorch.md | 58 +++++++++++----------- 2 files changed, 64 insertions(+), 66 deletions(-) diff --git a/docs/en/advance/pytorch_new_model.md b/docs/en/advance/pytorch_new_model.md index 093f6dcfad..90dd60ab0d 100644 --- a/docs/en/advance/pytorch_new_model.md +++ b/docs/en/advance/pytorch_new_model.md @@ -4,15 +4,19 @@ lmdeploy.pytorch is designed to ease new model deployment and prototype verifica ## Support New Model -Let's start with Llama. +Let's begin with Llama. -before we start, let's take a look at the inputs of the model. To support new features in our engine, the inputs are a little bit different from the inputs in transformers. +Before delving into the details, it's essential to acquaint ourselves with the input specifications of the model. In order to accommodate new features within our engine, there are some deviations from the typical transformer inputs. -1. Continuous batching is used to avoid batch padding, so the `input_ids` would be the concatenation of all input sequence in batch, than `unsqueeze(0)` to match the dimension of origin input_ids. -2. Paged attention is used to reduce the memory usage of key/value cache, `past_key_value` become a big Tensor with shape `[num_blocks, block_size, num_heads, head_dim]`, where num_blocks is the number of page block, block_size is the the size of each block. -3. Extra inputs are necessary to support the inputs above, such as block table, history length. These extra inputs are not listed in arguments of origin forward method. A context object is used to provide these info. +1. To circumvent the need for batch padding, continuous batching is employed. Consequently, the `input_ids` now represents the concatenation of all input sequences in the batch, followed by a `unsqueeze(0)` operation to align with the original `input_ids` dimension. -Because of the change of the inputs above, we need to rewrite forward of `LlamaModel` and `LlamaAttention` to fit the new inputs. First, let's rewrite the `LlamaModel`, we only keep the minimal codes to support deployment: +2. In an effort to optimize memory usage for the key/value cache, we implement paged attention. This transforms the `past_key_value` into a substantial tensor with dimensions `[num_blocks, block_size, num_heads, head_dim]`. Here, `num_blocks` denotes the number of page blocks, and `block_size` indicates the size of each block. + +3. Accompanying these changes, additional inputs are imperative to support the modified inputs described above. These include the block table and history length. It's important to note that these supplementary inputs are not explicitly listed as arguments in the original forward method. Instead, a context object is utilized to furnish this essential information. + +Due to the alterations in the input structure mentioned earlier, the forward methods for both `LlamaModel` and `LlamaAttention` modules need to be adjusted. Below are the modified implementations: + +For `LlamaModel`: ```python # lmdeploy/pytorch/models/llama.py @@ -56,13 +60,7 @@ class LlamaModel(nn.Module): ) ``` -For LlamaAttention module, we need to perform following steps: - -1. kqv proj -2. rotary embedding -3. filling kv cache -4. MHA -5. o proj +For LlamaAttention: ```python # lmdeploy/pytorch/models/llama.py @@ -145,14 +143,14 @@ class LlamaAttention(nn.Module): return attn_output, None, past_key_value ``` -Notice that some arguments such as `history_lengths` and `block_offsets` comes from `self.context.context`. As we have mentioned above, continuous batching and paged attention require extra arguments to support them, `context` is the container to store these inputs. If you need more detail about context object, please read [context info](#context-info). +Note: The additional arguments like `history_lengths` and `block_offsets` are accessed from the `context` object, which acts as a container for the necessary inputs required by continuous batching and paged attention. Refer to the [context info](#context-info) for more detail about `context` object. -We replace some operation to our custom triton kernel for two reason. +We have replaced certain operations with our custom Triton kernel for two reasons: -1. Custom triton kernel can be used to support new features such as `paged_attention_fwd`. -2. Fuse kernels have better performance than the pure PyTorch implementation. +1. The custom Triton kernel allows us to incorporate new features, such as `paged_attention_fwd`. +2. Fused kernels offer superior performance compared to the pure PyTorch implementation. -Now we have new implementations of two modules, let's register them into `lmdeploy/pytorch/models/module_map.py`. +Now that we have the updated implementations for the two modules, let's register them in `lmdeploy/pytorch/models/module_map.py`. ```python # lmdeploy/pytorch/models/module_map.py @@ -164,18 +162,18 @@ MODEL_MAP.update({ }) ``` -The rewritten module has been mapped to the origin module. When we create an Engine, ModelAgent would patch the model automatically, then we can perform inference with these new implementation. +In this mapping, the revised modules are associated with their original counterparts. When creating an `Engine`, the `ModelAgent` will automatically patch the model. Subsequently, we can conduct inference using these updated implementations. ## Support Tensor Parallelism -If we want to support tensor parallelism(tp), we have partition the weights in the model. Let's try extend the rewrite above. +If we aim to enable tensor parallelism (TP), it is necessary to partition the weights in the model. Let's build upon the previously mentioned modifications to accommodate TP in the Llama model: -In Llama (and most LLM), most Linear layers are involved in the weight partition. Among them: +In Llama (as well as in most Language Model models), the weight partition primarily affects the Linear layers. Specifically, for the following components: -- `LlamaAttention`: `q_proj`, `k_proj`, `v_proj` need column wise partition; `o_proj` needs row wise partition. -- `LlamaMLP`: `gate_proj`, `up_proj` need column wise partition; `down_proj` needs row wise partition. +- In `LlamaAttention`: `q_proj`, `k_proj`, `v_proj` require column-wise partitioning, while `o_proj` necessitates row-wise partitioning. +- In `LlamaMLP`: `gate_proj` and `up_proj` require column-wise partitioning, while `down_proj` requires row-wise partitioning. -We can implement `_distribution_partition_fn` in each rewrite modules: +We can implement the \_distribution_partition_fn in each of the rewritten modules: ```python # lmdeploy/pytorch/models/llama.py @@ -212,9 +210,7 @@ class LlamaMLP(nn.Module): ``` -`_distribute_partition_fn` would be called when loading model weights, the weights of special module would be distributed to different devices. - -After partition, we need to perform `all_reduce` on the output of `o_proj` and `down_proj`. Of cause you can just put `all_reduce` in the forward method, another option is add an `_distribute_output_fn` call: +In the process of loading model weights, the `_distribute_partition_fn` is called to distribute the weights of specific modules across different devices. Following the weight partitioning, it becomes necessary to perform `all_reduce` on the output tensors of `o_proj` and `down_proj`. While one option is to include `all_reduce` directly in the forward method, an alternative approach is to introduce the `_distribute_output_fn` call: ```python # lmdeploy/pytorch/models/llama.py @@ -235,7 +231,7 @@ class LlamaMLP(nn.Module): return outputs ``` -Don't forget to add `LlamaMLP` in `module_map`. +It is essential to remember to add `LlamaMLP` to the `module_map`: ```python # lmdeploy/pytorch/models/module_map.py @@ -245,7 +241,7 @@ MODEL_MAP.update({ }) ``` -That's all. Now it is possible to utilize multiple GPUs to deploy LLM. +With these adjustments, the model is now capable of utilizing multiple GPUs for deploying Large Language Models (LLM). This enables efficient distribution of computations across different devices in a parallelized manner. ## Appendix @@ -277,13 +273,13 @@ class StepContext: ### FAQ -- **How to call origin forward?** +- **How to invoke the original forward method?** -It is a common practice to add hooks to a method instead a full rewrite. You can use `self.origin_mod` to visit the unpatched module. +A common approach is to add hooks to a method rather than performing a complete rewrite. To access the unpatched module, you can utilize self.origin_mod within the rewritten method. - **How to register modules in remote code?** -Some modules are contained in remote code, it is hard to locate the module with `qualname`. `lmdeploy.pytorch` support register them with abbreviation: +For modules located in remote code, pinpointing them via `qualname` might be challenging. `lmdeploy.pytorch` facilitates registration using abbreviations for such modules:n: ```python MODULE_MAP.update({ @@ -294,11 +290,11 @@ MODULE_MAP.update({ > \[!NOTE\] > -> Abbreviation tends to have a low priority. It is recommend to register modules with `qualname`. +> Although abbreviations are supported, they tend to have lower priority. It is advisable to register modules using their complete `qualname` for more robust and accurate mapping. -- **How to support different modules with same name?** +- **How to support different modules with the same name?** -You can support them in the same rewrite module, and give them different implement by their attribute, take `baichuan2` 7b/13b as example: +You can accommodate multiple modules with the same name within a single rewrite module by providing distinct implementations based on their attributes. For instance, consider `baichuan2` 7b/13b: ```python class BaichuanModel(nn.Module): @@ -309,12 +305,14 @@ class BaichuanModel(nn.Module): return forward_default(...) ``` -- **How to do post-initialization for rewrite module?** +- **How to perform post-initialization for a rewrite module?** -Add a `_update_model_fn` method, it will be called after weight loading. +To execute tasks after model weight loading, introduce a `_update_model_fn` method in your rewrite module. This method will be automatically called post-initialization: ```python class LlamaAttention: def _update_model_fn(self): # ADD YOUR CODE HERE ``` + +Here, you can include any additional post-initialization steps or configurations needed for your specific use case. diff --git a/docs/en/inference/pytorch.md b/docs/en/inference/pytorch.md index 303efed5e0..80323a3719 100644 --- a/docs/en/inference/pytorch.md +++ b/docs/en/inference/pytorch.md @@ -1,6 +1,6 @@ # Architecture of lmdeploy.pytorch -`lmdeploy.pytorch` is an inference engine in LMDeploy. It provides a developer friendly framework to users who want to deploy their own model and develop new features. +`lmdeploy.pytorch` is an inference engine in LMDeploy that offers a developer-friendly framework to users interested in deploying their own models and developing new features. ## Design @@ -10,48 +10,48 @@ `lmdeploy.pytorch` shares service interfaces with `Turbomind`, and the inference service is implemented by `Engine` and `EngineInstance`. -EngineInstance is the sender of the inference requests, and it sends the encapsulated request to the Engine to achieve streaming inference. The inference interface of EngineInstance is thread-safe, and EngineInstances in different threads can initiate requests simultaneously. The Engine will automatically perform batch processing based on the current system resources. +`EngineInstance` acts as the sender of inference requests, encapsulating and sending requests to the `Engine` to achieve streaming inference. The inference interface of `EngineInstance` is thread-safe, allowing instances in different threads to initiate requests simultaneously. The `Engine` will automatically perform batch processing based on the current system resources. -Engine is the request receiver and executor. It contain modules that support the task as follow: +Engine is the request receiver and executor. It contain modules: -- `ModelAgent` is a wrapper of the model. It is responsible for loading model/adapters, cache management and tensor parallelism. -- `Scheduler` is the sequence manager. It will decide which sequences and adapters would participated in current step, then allocate resources for them. -- `RequestManager` is responsible for request sending and receiving. It is the bridge between Engine and EngineInstance. +- `ModelAgent` serves as a wrapper for the model, handling tasks such as loading model/adapters, managing the cache, and implementing tensor parallelism. +- The `Scheduler` functions as the sequence manager, determining the sequences and adapters to participate in the current step, and subsequently allocating resources for them. +- `RequestManager` is tasked with sending and receiving requests. acting as the bridge between the `Engine` and `EngineInstance`. ## Engine -Engine would response the requests in a sub-thread, looping as following: +The Engine responses to requests in a sub-thread, following this looping sequence: -1. Get new requests through RequestManager. These requests would be cached. -2. Scheduler perform scheduling, decide which cached requests should be processed and allocate resources for them. -3. ModelAgent would swap the caches according to the information provided by Scheduler, then performing inference with the patched model. -4. Scheduler update the status of requests according to the inference result of ModelAgent. -5. RequestManager response to the sender (EngineInstance), back to step 1. +1. Get new requests through `RequestManager`. These requests are cached for now. +2. The `Scheduler` performs scheduling, deciding which cached requests should be processed and allocating resources for them. +3. `ModelAgent` swaps the caches according to the information provided by the Scheduler, then performs inference with the patched model. +4. The `Scheduler` updates the status of requests based to the inference results from `ModelAgent`. +5. `RequestManager` responds to the sender (`EngineInstance`), and the process return to step 1. -Let's dive deeper into these modules. +Now, Let's delve deeper into the modules that participate in these steps. ### Scheduler -It is a common practice to cache history key and value states in LLM inference to prevent redundant computation. Since history lengths are different in batch of sequences, we have to padding the caches so we can perform the batching inference. The padding would waste a lot of memory and limit the performance of the transformer. +In LLM inference, caching history key and value states is a common practice to prevent redundant computation. However, as history lengths vary in a batch of sequences, we need to pad the caches to enable batching inference. Unfortunately, this padding can lead to significant memory wastage, limiting the transformer's performance. -[vLLM](https://docs.vllm.ai) provide a paging based strategy, allocating caches in page blocks to prevent extra memory usage. The Scheduler module in our Engine share the same design, allocating resources according to the sequence length in blocks and evicting unused blocks to support larger batching and longer session length. +[vLLM](https://docs.vllm.ai) employs a paging-based strategy, allocating caches in page blocks to minimize extra memory usage. Our Scheduler module in the Engine shares a similar design, allocating resources based on sequence length in blocks and evicting unused blocks to support larger batching and longer session lengths. -We also support [S-LoRA](https://github.com/S-LoRA/S-LoRA). S-LoRA can be used to support multiple LoRA adapters on limited memory. +Additionally, we support [S-LoRA](https://github.com/S-LoRA/S-LoRA), which enables the use of multiple LoRA adapters on limited memory. ### ModelAgent -lmdeploy.pytorch support Tensor Parallelism, which would leads to complex model initialization, cache allocation and weight partition. ModelAgent is designed to hide these details so Engine just need to focus on maintaining the pipeline. +`lmdeploy.pytorch` supports Tensor Parallelism, which leads to complex model initialization, cache allocation, and weight partitioning. ModelAgent is designed to abstract these complexities, allowing the Engine to focus solely on maintaining the pipeline. -ModelAgent is composed of two component: +ModelAgent consists of two components: -1. `patched_model` is the transformer model after patch. Compared to the origin model, patched model has more features, such as TP, quantization and high performance kernels. -2. `cache_engine` is the maintainer of caches. It receive command from Scheduler, perform host-device page swap. Only gpu blocks can be used to cache key/value and adapters. +1. \`**patched_model**: : This is the transformer model after patching. In comparison to the original model, the patched model incorporates additional features such as Tensor Parallelism, quantization, and high-performance kernels. +2. **cache_engine**: This component manages the caches. It receives commands from the Scheduler and performs host-device page swaps. Only GPU blocks are utilized for caching key/value pairs and adapters. ## Patching -In order to ease the deployment of new model, we have develop a tool to patch the modules. +In order to facilitate the deployment of a new model, we have developed a tool to patch the modules. -Let's say, if we want to reimplement the forward of `LlamaAttention.forward`: +For example, if we want to reimplement the forward method of `LlamaAttention`: ```python class CustomLlamaAttention(nn.Module): @@ -59,7 +59,7 @@ class CustomLlamaAttention(nn.Module): # custom forward ``` -Just register the implementation above into `lmdeploy.pytorch.models.module_map`. +We register the implementation above into `lmdeploy.pytorch.models.module_map`: ```python MODULE_MAP.update({ @@ -67,16 +67,16 @@ MODULE_MAP.update({ 'qualname.to.CustomLlamaAttention'}) ``` -ModelAgent would load and patch `LlamaAttention` with `CustomLlamaAttention` and leave anything other unchanged. Than you can perform inference with the new implementation. Read [support new model](../advance/pytorch_new_model.md) for more detail about model patching. +`ModelAgent` would then load and patch `LlamaAttention` with `CustomLlamaAttention` while leaving everything else unchanged. You can perform inference with the new implementation. For more detail about model patching, please refer to [support new model](../advance/pytorch_new_model.md) . ## Features -lmdeploy.pytorch support new features include: +`lmdeploy.pytorch` supports new features including: -- **Continuous Batching**: Since the sequence length in a batch might be different, padding is required to support batching inference. Large padding leads to extra memory usage and useless computation. We use continuous batching, concatenate all sequence into a single long sequence to avoid padding. +- **Continuous Batching**: As the sequence length in a batch may vary, padding is often necessary for batching inference. However, large padding can lead to additional memory usage and unnecessary computation. To address this, we employ continuous batching, where all sequences are concatenated into a single long sequence to avoid padding. -- **Tensor Parallelism**: The GPU memory usage of LLM might be larger than the memory of a single GPU. Tensor parallelism can be used to fit such model on multiple devices. Each device has parts of the model and can be computed simultaneous, the result would be gathered to ensure the correctness. +- **Tensor Parallelism**: The GPU memory usage of LLM might exceed the capacity of a single GPU. Tensor parallelism is utilized to accommodate such models on multiple devices. Each device handles parts of the model simultaneously, and the results are gathered to ensure correctness. -- **S-LoRA**: LoRA adapter can be used to support training LLM on device with limited memory. It is a common practice to merge adapter into weights of the model before deployment, load multiple adapter in such way would consume a lot of memory. We have support S-LoRA, adapters would be paged and swapped in when necessary, special kernels are developed to support inference with unmerged adapters. Which made it possible to load a lot of different adapters. +- **S-LoRA**: LoRA adapters can be used to train LLM on devices with limited memory. While it's common practice to merge adapters into the model weights before deployment, loading multiple adapters in this way can consume a significant amount of memory. We support S-LoRA, where adapters are paged and swapped in when necessary. Special kernels are developed to support inference with unmerged adapters, enabling the loading of various adapters efficiently. -- **Quantization**: Model quantization perform computation with low precision. lmdeploy.pytorch has support w8a8 quantization. Read [w8a8](../quantization/w8a8.md) for more details. +- **Quantization**: Model quantization involves performing computations with low precision. `lmdeploy.pytorch` supports w8a8 quantization. For more details, refer to [w8a8](../quantization/w8a8.md).