LMDeploy Release V0.6.0a0
Highlight
- Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
- Add GPTQ-INT4 inference
- Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
- Optimize the prefilling inference stage of PyTorchEngine
- Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate
Before:
lmdeploy serve api_server /the/path/of/your/awesome/model \
--model-name customized_chat_template.json
After
lmdeploy serve api_server /the/path/of/your/awesome/model \
--model-name "the served model name"
--chat-template customized_chat_template.json
What's Changed
🚀 Features
- support vlm custom image process parameters in openai input format by @irexyc in #2245
- New GEMM kernels for weight-only quantization by @lzhangzz in #2090
- Fix hidden size and support mistral nemo by @AllentDan in #2215
- Support custom logits processors by @AllentDan in #2329
- support openbmb/MiniCPM-V-2_6 by @irexyc in #2351
- Support phi3.5 for pytorch engine by @RunningLeon in #2361
💥 Improvements
- Remove deprecated arguments from API and clarify model_name and chat_template_name by @lvhan028 in #1931
- Fix duplicated session_id when pipeline is used by multithreads by @irexyc in #2134
- remove eviction param by @grimoire in #2285
- Remove QoS serving by @AllentDan in #2294
- Support send tool_calls back to internlm2 by @AllentDan in #2147
- Add stream options to control usage by @AllentDan in #2313
- add device type for pytorch engine in cli by @RunningLeon in #2321
- Update error status_code to raise error in openai client by @AllentDan in #2333
- Change to use device instead of device-type in cli by @RunningLeon in #2337
- Add GEMM test utils by @lzhangzz in #2342
- Add environment variable to control SILU fusion by @lzhangzz in #2343
- Use single thread per model instance by @lzhangzz in #2339
- add cache to speed up docker building by @RunningLeon in #2344
- add max_prefill_token_num argument in CLI by @lvhan028 in #2345
- torch engine optimize prefill for long context by @grimoire in #1962
- Refactor turbomind (1/N) by @lzhangzz in #2352
- feat(server): enable
seed
parameter for openai compatible server. by @DearPlanet in #2353
🐞 Bug fixes
- enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
- fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
- Fix internvl2 template and update docs by @irexyc in #2292
- fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
- Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
- fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
- Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
- Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362
📚 Documentations
- Reorganize the user guide and update the get_started section by @lvhan028 in #2038
- cancel support baichuan2 7b awq in pytorch engine by @grimoire in #2246
- Add user guide about slora serving by @AllentDan in #2084
🌐 Other
- test prtest image update by @zhulinJulia24 in #2192
- Update python support version by @wuhongsheng in #2290
- fix Windows compile error by @zhyncs in #2303
- fix: follow up #2303 by @zhyncs in #2307
- [ci] benchmark react by @zhulinJulia24 in #2183
- bump version to v0.6.0a0 by @lvhan028 in #2371
New Contributors
- @wuhongsheng made their first contribution in #2290
- @ColorfulDick made their first contribution in #2240
- @DearPlanet made their first contribution in #2353
Full Changelog: v0.5.3...v0.6.0a0