Release LMDeploy Release V0.6.0a0 · InternLM/lmdeploy

Highlight

Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
- Add GPTQ-INT4 inference
- Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
Optimize the prefilling inference stage of PyTorchEngine
Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate

Before:

lmdeploy serve api_server /the/path/of/your/awesome/model \
    --model-name customized_chat_template.json

After

lmdeploy serve api_server  /the/path/of/your/awesome/model \
    --model-name "the served model name"
    --chat-template customized_chat_template.json

What's Changed

🚀 Features

support vlm custom image process parameters in openai input format by @irexyc in #2245
New GEMM kernels for weight-only quantization by @lzhangzz in #2090
Fix hidden size and support mistral nemo by @AllentDan in #2215
Support custom logits processors by @AllentDan in #2329
support openbmb/MiniCPM-V-2_6 by @irexyc in #2351
Support phi3.5 for pytorch engine by @RunningLeon in #2361

💥 Improvements

Remove deprecated arguments from API and clarify model_name and chat_template_name by @lvhan028 in #1931
Fix duplicated session_id when pipeline is used by multithreads by @irexyc in #2134
remove eviction param by @grimoire in #2285
Remove QoS serving by @AllentDan in #2294
Support send tool_calls back to internlm2 by @AllentDan in #2147
Add stream options to control usage by @AllentDan in #2313
add device type for pytorch engine in cli by @RunningLeon in #2321
Update error status_code to raise error in openai client by @AllentDan in #2333
Change to use device instead of device-type in cli by @RunningLeon in #2337
Add GEMM test utils by @lzhangzz in #2342
Add environment variable to control SILU fusion by @lzhangzz in #2343
Use single thread per model instance by @lzhangzz in #2339
add cache to speed up docker building by @RunningLeon in #2344
add max_prefill_token_num argument in CLI by @lvhan028 in #2345
torch engine optimize prefill for long context by @grimoire in #1962
Refactor turbomind (1/N) by @lzhangzz in #2352
feat(server): enable seed parameter for openai compatible server. by @DearPlanet in #2353

🐞 Bug fixes

enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
Fix internvl2 template and update docs by @irexyc in #2292
fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362

📚 Documentations

Reorganize the user guide and update the get_started section by @lvhan028 in #2038
cancel support baichuan2 7b awq in pytorch engine by @grimoire in #2246
Add user guide about slora serving by @AllentDan in #2084

🌐 Other

test prtest image update by @zhulinJulia24 in #2192
Update python support version by @wuhongsheng in #2290
fix Windows compile error by @zhyncs in #2303
fix: follow up #2303 by @zhyncs in #2307
[ci] benchmark react by @zhulinJulia24 in #2183
bump version to v0.6.0a0 by @lvhan028 in #2371

New Contributors

@wuhongsheng made their first contribution in #2290
@ColorfulDick made their first contribution in #2240
@DearPlanet made their first contribution in #2353

Full Changelog: v0.5.3...v0.6.0a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMDeploy Release V0.6.0a0

Highlight

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors