介绍

rtp-llm 是阿里巴巴大模型预测团队开发的 LLM 推理加速引擎。rtp-llm 在阿里巴巴内部被广泛使用，支持了包括淘宝、天猫、闲鱼、菜鸟、高德、饿了么、AE、Lazada 等多个部门的大模型推理业务。
rtp-llm项目是havenask项目的子项目。

特点

实战验证

在众多LLM场景中得到实际应用与检验

淘宝问问
阿里巴巴国际AI平台Aidge
OpenSearch LLM智能问答版
Large Language Model based Long-tail Query Rewriting in Taobao Search

高性能

使用高性能的 CUDA kernel, 包括 PagedAttention、FlashAttention、FlashDecoding 等
WeightOnly INT8 量化，加载时自动量化; WeightOnly INT4量化，支持GPTQ/AWQ
自适应 KVCache 量化
框架上对动态凑批的 overhead 进行了细致优化
对 V100 进行了特别优化

灵活易用

和 HuggingFace 无缝对接，支持 SafeTensors/Pytorch/Megatron 等多种权重格式
单模型实例同时部署多 LoRA 服务
多模态(图片和文本混合输入)
多机/多卡 Tensor 并行
P-tuning 模型

高级加速

剪枝后的不规则模型加载
多轮对话上下文 Prefix Cache
System Prompt Cache
Speculative Decoding
Medusa

使用方法

需求

操作系统: Linux
Python: 3.10
NVIDIA GPU: Compute Capability 7.0 或者更高 (e.g., RTX20xx, RTX30xx, RTX40xx, V100, T4, A10/A30/A100, L4, H100, etc.)

安装和启动

从docker启动

docker版本见镜像发布历史

cd rtp-llm/docker
# IMAGE_NAME =
# if cuda11: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:{version}_cuda11
# if cuda12: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:{version}_cuda12
sh ./create_container.sh <CONTAINER_NAME> <IMAGE_NAME>
sh CONTAINER_NAME/sshme.sh

cd ../
# start http service
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d '{"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}}'

从wheel包启动

# Install rtp-llm
cd rtp-llm
# For cuda12 environment, please use requirements_torch_gpu_cuda12.txt
pip3 install -r ./open_source/deps/requirements_torch_gpu.txt
# Use the corresponding whl from the release version, here's an example for the cuda11 version 0.1.0, for the cuda12 whl package please check the release page.
pip3 install maga_transformer-0.1.9+cuda118-cp310-cp310-manylinux1_x86_64.whl
# start http service

cd ../
TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
# request to server
curl -XPOST http://localhost:8088 -d '{"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}}'

从源码构建并启动

见源码构建

常见问题

libcufft.so

Error log: OSError: libcufft.so.11: cannot open shared object file: No such file or directory

Resolution: 检查cuda和rtp-llm版本是否匹配
libth_transformer.so

Error log: OSError: /rtp-llm/maga_transformer/libs/libth_transformer.so: cannot open shared object file: No such file or directory

Resolution: 如果是通过whl或者开发镜像安装，测试时请不要在rtp-llm目录下，否则python会用相对路径下的rtp-llm包而非安装好的
Bazel build time out

Error log: ERROR: no such package '@pip_gpu_cuda12_torch//': rules_python_external failed: (Timed out)

Resolution:
1. 修改pip源，在open_source/deps/pip.bzl里添加extra_pip_args=["--index_url=xxx"]配置
2. 手动安装python的依赖包，尤其是对于pytorch，因为bazel build默认的600秒超时对于pytorch的下载可能是不够的

文档

镜像发布历史
启动服务样例
RWKV-Runner 样例
Python Library 样例
在Aliyun Ecs中使用RTP-LLm
配置参数
内置请求格式
多卡推理
LoRA
PTuning
SystemPrompt
Embedding/Reranker模型部署
多轮会话
多模态
结构化剪枝
量化
投机采样
Roadmap
Contributing
Benchmark&Performance

致谢

我们的项目主要基于FasterTransformer，并在此基础上集成了TensorRT-LLM的部分kernel实现。FasterTransformer和TensorRT-LLM为我们提供了可靠的性能保障。Flash-Attention2和cutlass也在我们持续的性能优化过程中提供了大量帮助。我们的continuous batching和increment decoding参考了vllm的实现；采样参考了transformers，投机采样部分集成了Medusa的实现，多模态部分集成了llava和qwen-vl的实现。感谢这些项目对我们的启发和帮助。

支持模型列表

LLM

Aquila 和 Aquila2(BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B等)
Baichuan 和 Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B)
Bloom (bigscience/bloom, bigscience/bloomz)
ChatGlm (THUDM/chatglm2-6b, THUDM/chatglm3-6b)
Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b等)
GptNeox (EleutherAI/gpt-neox-20b)
GPT BigCode (bigcode/starcoder, bigcode/starcoder2等)
LLaMA 和 LLaMA-2 (meta-llama/Llama-2-7b, meta-llama/Llama-2-13b-hf, meta-llama/Llama-2-70b-hf, lmsys/vicuna-33b-v1.3, 01-ai/Yi-34B, xverse/XVERSE-13B等)
MPT (mosaicml/mpt-30b-chat等)
Phi (microsoft/phi-1_5等)
Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, Qwen/Qwen-14B, Qwen/Qwen-14B-Chat, Qwen1.5等)
InternLM (internlm/internlm-7b, internlm/internlm-chat-7b等)
Gemma (google/gemma-it等)
Mixtral (mistralai/Mixtral-8x7B-v0.1等)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_cn.md

README_cn.md

介绍

特点

实战验证

高性能

灵活易用

高级加速

使用方法

需求

安装和启动

从docker启动

从wheel包启动

从源码构建并启动

常见问题

文档

致谢

支持模型列表

LLM

LLM + 多模态

联系我们

钉钉群

微信群

Files

README_cn.md

Latest commit

History

README_cn.md

File metadata and controls

介绍

特点

实战验证

高性能

灵活易用

高级加速

使用方法

需求

安装和启动

从docker启动

从wheel包启动

从源码构建并启动

常见问题

文档

致谢

支持模型列表

LLM

LLM + 多模态

联系我们

钉钉群

微信群