Skip to content

LMDeploy Release V0.6.0a0

Compare
Choose a tag to compare
@lvhan028 lvhan028 released this 26 Aug 09:12
· 211 commits to main since this release
97b880b

Highlight

  • Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
    • Add GPTQ-INT4 inference
    • Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
  • Optimize the prefilling inference stage of PyTorchEngine
  • Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate

Before:

lmdeploy serve api_server /the/path/of/your/awesome/model \
    --model-name customized_chat_template.json 

After

lmdeploy serve api_server  /the/path/of/your/awesome/model \
    --model-name "the served model name"
    --chat-template customized_chat_template.json

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
  • fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
  • Fix internvl2 template and update docs by @irexyc in #2292
  • fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
  • Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
  • fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
  • Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
  • Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.3...v0.6.0a0