Release DJLServing v0.29.0 Release · deepjavalibrary/djl-serving

Key Features

Details regarding the latest LMI container image_uris can be found here

DJL Serving Changes (applicable to all containers)

Allows configuring health checks to fail based on various types of error rates
When not streaming responses, all invocation errors will respond with the appropriate 4xx or 5xx HTTP response code
- Previously, for some inference backends (vllm, lmi-dist, tensorrt-llm) the behavior was to return 2xx HTTP responses when errors occurred during inference
HTTP Response Codes are now configurable if you require a specific 4xx or 5xx status to be returned in certain situations
Introduced annotations @input_formatter and @output_formatter to bring your own script for pre- and post-postprocessing.

LMI Container (vllm, lmi-dist)

vLLM updated to version 0.5.3.post1
Added MultiModal Support for Vision Language Models using the OpenAI Chat Completions Schema.
- More details available here
Supports Llama 3.1 models
Supports beam search, best_of and n with non streaming output.
Supports chunked prefill support in both vllm and lmi-dist.

TensorRT-LLM Container

TensorRT-LLM updated to version 0.11.0
[Breaking change] Flan-T5 is now supported with C++ triton backend. Removed Flan-T5 support for TRTLLM python backend.

Transformers NeuronX Container

Upgraded to Transformers NeuronX 2.19.1

Text Embedding (using the LMI container)

Various performance improvements

Enhancements

best_of support in output formatter by @sindhuvahinis in #1992
Convert cuda env tgi variables to lmi by @sindhuvahinis in #2013
[serving] Update python typing to support py38 by @tosterberg in #2034
Refactor lmi_dist and vllm to support best_of with RequestOutput by @sindhuvahinis in #2011
[metrics] Improve prometheus metric type handling by @frankfliu in #2039
[serving] Fail ping if error rate exceed by @frankfliu in #2040
[serving] Update default max worker to 1 for GPU by @xyang16 in #2048
[Serving] Implement SageMaker Secure Mode & support for multiple data sources by @ethnzhng in #2042
AutoAWQ Integration Script by @a-ys in #2038
[secure-mode] Refactor secure mode plugin by @frankfliu in #2058
[vllm, lmi-dist] add support for top_n_tokens by @sindhuvahinis in #2051
[python] refactor rolling batch inference method by @sindhuvahinis in #2090
Record telemetry including acceptance rate by @zachgk in #2088
bump up trtllm to 0.10.0 by @ydm-amazon in #2043
[fix] Set tokenizer on output_formatter for TRT-LLM Handlers by @maaquib in #2100
[dockerfile] pin datasets to 2.19.1 in trtllm by @sindhuvahinis in #2104
[serving] make http response codes configurable for exception cases by @siddvenk in #2114
update flags to prevent deprecation by @lanking520 in #2118
[docker] Update DJL to 0.29.0-SNAPSHOT by @frankfliu in #2119
[awscurl] Handles Bedrock special url case by @frankfliu in #2120
[secure-mode] Add properties allowlist validation by @ethnzhng in #2129
[secure-mode] add per-model configs to allowlist by @ethnzhng in #2132
[docker] remove tensorflow native from cpu-full image by @frankfliu in #2136
[onnx] Allows to customize onnxruntime optimization level by @frankfliu in #2137
[python] add support for 3p use-case by @siddvenk in #2122
[python] move parse input functions to input_parser.py by @sindhuvahinis in #2092
[python] log exception stacktrace for exceptions in python, improve r… by @siddvenk in #2142
[engine] include lmi recommended entrypoint when model.py exists by @sindhuvahinis in #2148
[3p][python]add metering and error details to 3p outputs by @siddvenk in #2143
Support multi node for lmi-dist by @xyang16 in #2125
[python] refactor input parser to support Request by @sindhuvahinis in #2145
[python] add max_logprobs vllm configuration to EngineArgs by @sindhuvahinis in #2154
[python] parse input only when new requests are received by @sindhuvahinis in #2155
[lmi] remove redundant auto logic from python handler by @siddvenk in #2152
[python] support multimodal models openai api in vllm by @sindhuvahinis in #2147
Add stronger typing for chat completions use-cases by @siddvenk in #2161
[awscurl] Supports full jsonquery syntax by @frankfliu in #2163
[Neo] Neo compilation/quantization script bugfixes by @a-ys in #2115
[docker] bump neuron to 2.19 SDK by @tosterberg in #2160
[python] add input formatter decorator by @sindhuvahinis in #2158
[docker] bump neuron vllm to 5.0 by @tosterberg in #2169
[lmi] Upgrade lmi dockerfile for 0.29.0 release by @maaquib in #2156
add 0.5.1 supported models by @lanking520 in #2151
[python] update max_logprobs default for vllm 0.5.1 by @sindhuvahinis in #2159
[wlm] Trim whitespce for model_id by @frankfliu in #2175
[fix] optimum update stable diffusion support by @tosterberg in #2179
[serving][python] Support non 200 HTTP response codes for non-streami… by @siddvenk in #2173
[awscurl] Includes input data in output file by @frankfliu in #2184
[multimodal] support specifying image_token, infering default if not … by @siddvenk in #2183
[Neo] Refactor Neo TRT-LLM partition script by @ethnzhng in #2166
[Partiton] Don't output option.parallel_loading when partitioning by @a-ys in #2189
Introduce pipeline parallel degree config by @nikhil-sk in #2171
add trtllm container update by @lanking520 in #2191
[serving] Adds mutliple node cluster configuration support by @frankfliu in #2190
[aot] fix aot partition args, add pipeline parallel by @tosterberg in #2196
bump up bench to 0.29.0 by @ydm-amazon in #2199
[serving] Download model while initialize multi-node cluster by @frankfliu in #2198
[lmi] Dependencies upgrade for 0.29.0 by @maaquib in #2194
[chat][lmi] use generation prompt in tokenizer to avoid bot prompt re… by @siddvenk in #2195
[lmi] support multimodal in lmi-dist by @siddvenk in #2182
Revert "Update max_model_len for llama-3 lora test" by @lanking520 in #2207
[wlm] Fixes retrieve config.json error by @frankfliu in #2212
determine bedrock usage based on explicit property rather than inferr… by @siddvenk in #2214
[post-7/22]Add chunked prefill support in vllm and lmi-dist by @rohithkrn in #2202
lazy compute input ids by @lanking520 in #2216
[docker] neuron bump to 2.19.1 by @tosterberg in #2223
[vLLM][0.5.3] add new configs to engine by @lanking520 in #2220
[docker] neuron bump transformers for llama3.1 by @tosterberg in #2226
update the transformers version by @lanking520 in #2227
[lmi] use vllm wheel with hanging patch by @siddvenk in #2231
[serving][post 7/24] Fixes tensor_parallel_degree detection on CPU by @frankfliu in #2229
[multimodal] add additional vlm architectures to lmi config recommender by @siddvenk in #2235
put back neuron installation by @lanking520 in #2237
[neo] Fix AttributeError in Neo partition scripts by @a-ys in #2236
[vLLM] add prefix caching support by @lanking520 in #2239
Convert model_id to rust artifacts by @xyang16 in #2241
[python] remove flan-t5 python backend in recommender and add test cases by @sindhuvahinis in #2240
[cherrypick] 29 new commits from master by @ydm-amazon in #2276
dockerfile versions non nightly by @ydm-amazon in #2277
[cherry-pick] add ignore_eos support in chat completions schema (#2281) by @siddvenk in #2283
[Cherry-pick][TRTLLM] take out cudnn (#2286) by @lanking520 in #2288
[Cherrypick][python] fix the NoneType error when decoded_token is empty (#2289) by @lanking520 in #2292

Documentation

Adds SECURITY.md file by @zachgk in #1989
[docs][lmi] add warning to deepspeed user guide indicating deprecated… by @siddvenk in #1999
[docs] Update lmi-dist and vllm docs for 0.28.0 release by @maaquib in #2001
[lmi][docs] fix token format in api schema docs by @siddvenk in #2014
[docs][lmi] update lmi docs in general for 0.28.0 by @siddvenk in #2003
[docs] Update trtllm docs for 0.28.0 by @ydm-amazon in #1990
[doc] custom output formatter schema by @sindhuvahinis in #2018
[lmi][docs] add deepspeed deprecation notice to lmi docs by @siddvenk in #2024
[doc] Add LMI Text Embedding Inference user guide by @xyang16 in #2022
[docs] Fix broken links in user guide by @xyang16 in #2025
[docs] Omit response array in text embedding doc by @xyang16 in #2026
[doc] add tgi_compat and see details by @sindhuvahinis in #2031
[doc] improve output schema doc by @sindhuvahinis in #2028
[doc] SSE jsonline document by @sindhuvahinis in #2041
[docs] Updates embedding user guide by @frankfliu in #2047
[lmi][docs] update to 0.28.0 in lmi docs by @siddvenk in #2063
[doc] adding release notes for docs.djl.ai by @sindhuvahinis in #2032
[docs]Update DJL-Serving Read Me by @Varun-Dutta in #2127
[lmi][doc] update docs for vision language model support by @siddvenk in #2192
[cherry-pick] [docs][lmi] update user guides for lmi v11 (#2290) by @siddvenk in #2291
remove usage of SERVING_LOAD_MODELS in examples/docs/tests by @siddvenk in #1825
Updates DJL version to 0.29.0-SNAPSHOT by @frankfliu in #2029
[docs] Adds benchmark guide for bedrock by @frankfliu in #2181
[awscurl] Updates docs for downloading stable version by @frankfliu in #2170
Update documentation to latest release 0.28.0. by @david-sitsky in #2197

CI + Testing

[CI] add nvme support and update model by @lanking520 in #2010
[CI] use split by '\n' to solve the issue by @lanking520 in #2020
Add ecr login credentials by @vinayburugu in #2019
Add llama-3 lora test by @rohithkrn in #2027
[ci] Switch gradle build script from groovy to kotlin by @frankfliu in #1969
[CI] fix a bug in dataset prep by @lanking520 in #2030
[ci] Add back missing publish task in gradle by @frankfliu in #2037
[test] Fixes java-client test dependency by @frankfliu in #2033
[Post 0.28.0] update pipeline machine to g6 by @lanking520 in #1983
[test] remove trtllm flan t5 integration test by @sindhuvahinis in #2052
[IB] use a file to pass to the output by @lanking520 in #2044
[IB] write template into file system by @lanking520 in #2045
[ci][fix] don't use env vars for llm integ test as it causes issues w… by @siddvenk in #2068
[ib] fix docker env file write location by @tosterberg in #2073
Fix benchmark deb build by @zachgk in #2050
[IB] Supports forwarding environment variables by @zachgk in #2081
codeql for python by @siddvenk in #2087
[CI] LLM Integration Tests through pytest suite by @zachgk in #2023
[CI] Fix post autoAwq cleanup by @zachgk in #2089
[ci] fix ground truth for neuron unit test by @tosterberg in #2097
[ci] fix nightly wheel pip installs by @tosterberg in #2096
[ci] fix trtllm nightly wheel pip installs by @tosterberg in #2098
[CI] Inferentia tests through pytest by @zachgk in #2091
pin numpy to <2 in ci/docker by @siddvenk in #2071
[ci] add trtllm chat test by @sindhuvahinis in #2102
[ci] Fix gpt2 integ test failure by @maaquib in #2108
[CI] fix bugs by @lanking520 in #2111
[ci] remove HF_MODEL_ID env for lmi_dist_1 test by @sindhuvahinis in #2131
[ci] Adds formatShell gradle tasks by @frankfliu in #2141
[CI] Integration tests through pytest by @zachgk in #2109
[CI] Action for manual pytest runs by @zachgk in #2144
[ci] Fixes slf4j version conflict by @frankfliu in #2164
[CI] Supports temp repo in execute job by @zachgk in #2150
[ci] Updates dependency versions to latest by @frankfliu in #2167
[docs] Adds document for MAX_NETTY_BUFFER_SIZE env var by @frankfliu in #2177
[CI] upgrade deps version by @lanking520 in #2180
[multimodal][ci] do not use tp for paligemma due to issues in vllm by @siddvenk in #2238
[ci] fix hf lora test for accelerate by @tosterberg in #2208
[CI] add P4D test missing deps by @lanking520 in #2209
[ci] Fix pipeline failures by @xyang16 in #2210
[ci] Add mmlu dataset to correctness testing by @xyang16 in #2211
[CI] Add integration pytest marks by @zachgk in #2176
[CI] update cuda version and the pipeline by @lanking520 in #2186
[ci] fix adapters client testing by @tosterberg in #2188
add mmlu dataset by @lanking520 in #2185
[ci] Add correctness testing by @xyang16 in #2200
[CI] add client logger to output logs by @lanking520 in #2201
[CI] fix no code pipeline issues by @lanking520 in #2215
[CI] Miscellaneous CI Fixes by @zachgk in #2218
take out falcon from the test list by @lanking520 in #2219
[CI] update the test by @lanking520 in #2206
[multimodal][ci] adding multimodal tests by @siddvenk in #2234
[CI] use default github runner by @lanking520 in #2222
fix typo in test name by @rohithkrn in #2224
[ci][hotfix] changes to build new vllm wheel with hanging fix by @siddvenk in #2228
[CI] comment optimum from installation by @lanking520 in #2225
[ci] Update correctness testing by @xyang16 in #2221
[Cherrypick][ci] fix hf hub flakiness remove unused prepare (#2278) by @lanking520 in #2294
Update max num tokens workflow, fix multiple bugs by @ydm-amazon in #2012
[lmi][hf]remove usage of HF Conversational pipeline as it is deprecated by @siddvenk in #2124
[python] fix integration test failures by @sindhuvahinis in #2153
[tnx] neuron rolling batch test suite by @tosterberg in #2172
add json error testing by @lanking520 in #2178
Update max_model_len for llama-3 lora test by @rohithkrn in #2205

Bug Fixes

[serving] Fixes onnx model conversion bug by @frankfliu in #2000
[docker] Fixes duplicated onnx jar file issue by @frankfliu in #2017
[python] Fixes typing issue with py3.8 by @frankfliu in #2035
[docker] Fixes docker build script for version detection by @frankfliu in #2036
fix bug with duplicate models when HF_MODEL_ID points to model store by @siddvenk in #2054
[plugin] Fixes plugin scaning bug by @frankfliu in #2059
[secure-mode] Fix entrypoint control name by @ethnzhng in #2065
[fix] remove validating quantization in properties_manager.py by @sindhuvahinis in #2067
[fix] initialize sequence dictionary for default sequence index to pr… by @siddvenk in #2072
[Neo] Fix Neo Quantization properties output. Add some additional configuration. by @a-ys in #2077
[aot] Fix aot quantization for weight only quantization by @tosterberg in #2079
Sm vllm config by @maaquib in #2005
[Neuron] Fix Neuron compilation logging by @a-ys in #2095
[fix] align default neuron behavior between model server and handler by @tosterberg in #2116
[docker] Fixes onnxruntime engine installation by @frankfliu in #2121
[serving] Fixes snake case pattern by @frankfliu in #2123
[docker] Install missing pytorch jni package by @frankfliu in #2007
[awscurl] Fixes missing "text" case in jsonlines output by @frankfliu in #2135
[docker] Fixes cpu-full docker build script by @frankfliu in #2138
[docker] Fixes pytorch-gpu docker build script by @frankfliu in #2139
[docker] Fixes docker -SNAPSHOT version build issue by @frankfliu in #2140
Fix acceptance history check in speculative telemetry by @zachgk in #2112
[python] Fix new logprobs computation in vllm_utils by @sindhuvahinis in #2146
[awscurl] Fixes tokenizer cache directory by @frankfliu in #2168
fix the cache miss issue by @lanking520 in #2174
fix the handler by @lanking520 in #2203
[chat-completions] fix minor issues with response schema by @siddvenk in #2204
[tnx] fix optimum token selection and sampling by @tosterberg in #2233

New Contributors

@vinayburugu made their first contribution in #2019
@Varun-Dutta made their first contribution in #2127
@david-sitsky made their first contribution in #2197

Full Changelog: v0.28.0...v0.29.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DJLServing v0.29.0 Release

Key Features

DJL Serving Changes (applicable to all containers)

LMI Container (vllm, lmi-dist)

TensorRT-LLM Container

Transformers NeuronX Container

Text Embedding (using the LMI container)

Enhancements

Documentation

CI + Testing

Bug Fixes

New Contributors

Contributors