Key Features
Details regarding the latest LMI container image_uris can be found here
DJL Serving Changes (applicable to all containers)
- Allows configuring health checks to fail based on various types of error rates
- When not streaming responses, all invocation errors will respond with the appropriate 4xx or 5xx HTTP response code
- Previously, for some inference backends (vllm, lmi-dist, tensorrt-llm) the behavior was to return 2xx HTTP responses when errors occurred during inference
- HTTP Response Codes are now configurable if you require a specific 4xx or 5xx status to be returned in certain situations
- Introduced annotations
@input_formatter
and@output_formatter
to bring your own script for pre- and post-postprocessing.
LMI Container (vllm, lmi-dist)
- vLLM updated to version 0.5.3.post1
- Added MultiModal Support for Vision Language Models using the OpenAI Chat Completions Schema.
- More details available here
- Supports Llama 3.1 models
- Supports beam search,
best_of
andn
with non streaming output. - Supports chunked prefill support in both vllm and lmi-dist.
TensorRT-LLM Container
- TensorRT-LLM updated to version 0.11.0
- [Breaking change] Flan-T5 is now supported with C++ triton backend. Removed Flan-T5 support for TRTLLM python backend.
Transformers NeuronX Container
- Upgraded to Transformers NeuronX 2.19.1
Text Embedding (using the LMI container)
- Various performance improvements
Enhancements
- best_of support in output formatter by @sindhuvahinis in #1992
- Convert cuda env tgi variables to lmi by @sindhuvahinis in #2013
- [serving] Update python typing to support py38 by @tosterberg in #2034
- Refactor lmi_dist and vllm to support best_of with RequestOutput by @sindhuvahinis in #2011
- [metrics] Improve prometheus metric type handling by @frankfliu in #2039
- [serving] Fail ping if error rate exceed by @frankfliu in #2040
- [serving] Update default max worker to 1 for GPU by @xyang16 in #2048
- [Serving] Implement SageMaker Secure Mode & support for multiple data sources by @ethnzhng in #2042
- AutoAWQ Integration Script by @a-ys in #2038
- [secure-mode] Refactor secure mode plugin by @frankfliu in #2058
- [vllm, lmi-dist] add support for top_n_tokens by @sindhuvahinis in #2051
- [python] refactor rolling batch inference method by @sindhuvahinis in #2090
- Record telemetry including acceptance rate by @zachgk in #2088
- bump up trtllm to 0.10.0 by @ydm-amazon in #2043
- [fix] Set tokenizer on output_formatter for TRT-LLM Handlers by @maaquib in #2100
- [dockerfile] pin datasets to 2.19.1 in trtllm by @sindhuvahinis in #2104
- [serving] make http response codes configurable for exception cases by @siddvenk in #2114
- update flags to prevent deprecation by @lanking520 in #2118
- [docker] Update DJL to 0.29.0-SNAPSHOT by @frankfliu in #2119
- [awscurl] Handles Bedrock special url case by @frankfliu in #2120
- [secure-mode] Add properties allowlist validation by @ethnzhng in #2129
- [secure-mode] add per-model configs to allowlist by @ethnzhng in #2132
- [docker] remove tensorflow native from cpu-full image by @frankfliu in #2136
- [onnx] Allows to customize onnxruntime optimization level by @frankfliu in #2137
- [python] add support for 3p use-case by @siddvenk in #2122
- [python] move parse input functions to input_parser.py by @sindhuvahinis in #2092
- [python] log exception stacktrace for exceptions in python, improve r… by @siddvenk in #2142
- [engine] include lmi recommended entrypoint when model.py exists by @sindhuvahinis in #2148
- [3p][python]add metering and error details to 3p outputs by @siddvenk in #2143
- Support multi node for lmi-dist by @xyang16 in #2125
- [python] refactor input parser to support Request by @sindhuvahinis in #2145
- [python] add max_logprobs vllm configuration to EngineArgs by @sindhuvahinis in #2154
- [python] parse input only when new requests are received by @sindhuvahinis in #2155
- [lmi] remove redundant auto logic from python handler by @siddvenk in #2152
- [python] support multimodal models openai api in vllm by @sindhuvahinis in #2147
- Add stronger typing for chat completions use-cases by @siddvenk in #2161
- [awscurl] Supports full jsonquery syntax by @frankfliu in #2163
- [Neo] Neo compilation/quantization script bugfixes by @a-ys in #2115
- [docker] bump neuron to 2.19 SDK by @tosterberg in #2160
- [python] add input formatter decorator by @sindhuvahinis in #2158
- [docker] bump neuron vllm to 5.0 by @tosterberg in #2169
- [lmi] Upgrade lmi dockerfile for 0.29.0 release by @maaquib in #2156
- add 0.5.1 supported models by @lanking520 in #2151
- [python] update max_logprobs default for vllm 0.5.1 by @sindhuvahinis in #2159
- [wlm] Trim whitespce for model_id by @frankfliu in #2175
- [fix] optimum update stable diffusion support by @tosterberg in #2179
- [serving][python] Support non 200 HTTP response codes for non-streami… by @siddvenk in #2173
- [awscurl] Includes input data in output file by @frankfliu in #2184
- [multimodal] support specifying image_token, infering default if not … by @siddvenk in #2183
- [Neo] Refactor Neo TRT-LLM partition script by @ethnzhng in #2166
- [Partiton] Don't output
option.parallel_loading
when partitioning by @a-ys in #2189 - Introduce pipeline parallel degree config by @nikhil-sk in #2171
- add trtllm container update by @lanking520 in #2191
- [serving] Adds mutliple node cluster configuration support by @frankfliu in #2190
- [aot] fix aot partition args, add pipeline parallel by @tosterberg in #2196
- bump up bench to 0.29.0 by @ydm-amazon in #2199
- [serving] Download model while initialize multi-node cluster by @frankfliu in #2198
- [lmi] Dependencies upgrade for 0.29.0 by @maaquib in #2194
- [chat][lmi] use generation prompt in tokenizer to avoid bot prompt re… by @siddvenk in #2195
- [lmi] support multimodal in lmi-dist by @siddvenk in #2182
- Revert "Update max_model_len for llama-3 lora test" by @lanking520 in #2207
- [wlm] Fixes retrieve config.json error by @frankfliu in #2212
- determine bedrock usage based on explicit property rather than inferr… by @siddvenk in #2214
- [post-7/22]Add chunked prefill support in vllm and lmi-dist by @rohithkrn in #2202
- lazy compute input ids by @lanking520 in #2216
- [docker] neuron bump to 2.19.1 by @tosterberg in #2223
- [vLLM][0.5.3] add new configs to engine by @lanking520 in #2220
- [docker] neuron bump transformers for llama3.1 by @tosterberg in #2226
- update the transformers version by @lanking520 in #2227
- [lmi] use vllm wheel with hanging patch by @siddvenk in #2231
- [serving][post 7/24] Fixes tensor_parallel_degree detection on CPU by @frankfliu in #2229
- [multimodal] add additional vlm architectures to lmi config recommender by @siddvenk in #2235
- put back neuron installation by @lanking520 in #2237
- [neo] Fix AttributeError in Neo partition scripts by @a-ys in #2236
- [vLLM] add prefix caching support by @lanking520 in #2239
- Convert model_id to rust artifacts by @xyang16 in #2241
- [python] remove flan-t5 python backend in recommender and add test cases by @sindhuvahinis in #2240
- [cherrypick] 29 new commits from master by @ydm-amazon in #2276
- dockerfile versions non nightly by @ydm-amazon in #2277
- [cherry-pick] add ignore_eos support in chat completions schema (#2281) by @siddvenk in #2283
- [Cherry-pick][TRTLLM] take out cudnn (#2286) by @lanking520 in #2288
- [Cherrypick][python] fix the NoneType error when decoded_token is empty (#2289) by @lanking520 in #2292
Documentation
- Adds SECURITY.md file by @zachgk in #1989
- [docs][lmi] add warning to deepspeed user guide indicating deprecated… by @siddvenk in #1999
- [docs] Update lmi-dist and vllm docs for 0.28.0 release by @maaquib in #2001
- [lmi][docs] fix token format in api schema docs by @siddvenk in #2014
- [docs][lmi] update lmi docs in general for 0.28.0 by @siddvenk in #2003
- [docs] Update trtllm docs for 0.28.0 by @ydm-amazon in #1990
- [doc] custom output formatter schema by @sindhuvahinis in #2018
- [lmi][docs] add deepspeed deprecation notice to lmi docs by @siddvenk in #2024
- [doc] Add LMI Text Embedding Inference user guide by @xyang16 in #2022
- [docs] Fix broken links in user guide by @xyang16 in #2025
- [docs] Omit response array in text embedding doc by @xyang16 in #2026
- [doc] add tgi_compat and see details by @sindhuvahinis in #2031
- [doc] improve output schema doc by @sindhuvahinis in #2028
- [doc] SSE jsonline document by @sindhuvahinis in #2041
- [docs] Updates embedding user guide by @frankfliu in #2047
- [lmi][docs] update to 0.28.0 in lmi docs by @siddvenk in #2063
- [doc] adding release notes for docs.djl.ai by @sindhuvahinis in #2032
- [docs]Update DJL-Serving Read Me by @Varun-Dutta in #2127
- [lmi][doc] update docs for vision language model support by @siddvenk in #2192
- [cherry-pick] [docs][lmi] update user guides for lmi v11 (#2290) by @siddvenk in #2291
- remove usage of SERVING_LOAD_MODELS in examples/docs/tests by @siddvenk in #1825
- Updates DJL version to 0.29.0-SNAPSHOT by @frankfliu in #2029
- [docs] Adds benchmark guide for bedrock by @frankfliu in #2181
- [awscurl] Updates docs for downloading stable version by @frankfliu in #2170
- Update documentation to latest release 0.28.0. by @david-sitsky in #2197
CI + Testing
- [CI] add nvme support and update model by @lanking520 in #2010
- [CI] use split by '\n' to solve the issue by @lanking520 in #2020
- Add ecr login credentials by @vinayburugu in #2019
- Add llama-3 lora test by @rohithkrn in #2027
- [ci] Switch gradle build script from groovy to kotlin by @frankfliu in #1969
- [CI] fix a bug in dataset prep by @lanking520 in #2030
- [ci] Add back missing publish task in gradle by @frankfliu in #2037
- [test] Fixes java-client test dependency by @frankfliu in #2033
- [Post 0.28.0] update pipeline machine to g6 by @lanking520 in #1983
- [test] remove trtllm flan t5 integration test by @sindhuvahinis in #2052
- [IB] use a file to pass to the output by @lanking520 in #2044
- [IB] write template into file system by @lanking520 in #2045
- [ci][fix] don't use env vars for llm integ test as it causes issues w… by @siddvenk in #2068
- [ib] fix docker env file write location by @tosterberg in #2073
- Fix benchmark deb build by @zachgk in #2050
- [IB] Supports forwarding environment variables by @zachgk in #2081
- codeql for python by @siddvenk in #2087
- [CI] LLM Integration Tests through pytest suite by @zachgk in #2023
- [CI] Fix post autoAwq cleanup by @zachgk in #2089
- [ci] fix ground truth for neuron unit test by @tosterberg in #2097
- [ci] fix nightly wheel pip installs by @tosterberg in #2096
- [ci] fix trtllm nightly wheel pip installs by @tosterberg in #2098
- [CI] Inferentia tests through pytest by @zachgk in #2091
- pin numpy to <2 in ci/docker by @siddvenk in #2071
- [ci] add trtllm chat test by @sindhuvahinis in #2102
- [ci] Fix gpt2 integ test failure by @maaquib in #2108
- [CI] fix bugs by @lanking520 in #2111
- [ci] remove HF_MODEL_ID env for lmi_dist_1 test by @sindhuvahinis in #2131
- [ci] Adds formatShell gradle tasks by @frankfliu in #2141
- [CI] Integration tests through pytest by @zachgk in #2109
- [CI] Action for manual pytest runs by @zachgk in #2144
- [ci] Fixes slf4j version conflict by @frankfliu in #2164
- [CI] Supports temp repo in execute job by @zachgk in #2150
- [ci] Updates dependency versions to latest by @frankfliu in #2167
- [docs] Adds document for MAX_NETTY_BUFFER_SIZE env var by @frankfliu in #2177
- [CI] upgrade deps version by @lanking520 in #2180
- [multimodal][ci] do not use tp for paligemma due to issues in vllm by @siddvenk in #2238
- [ci] fix hf lora test for accelerate by @tosterberg in #2208
- [CI] add P4D test missing deps by @lanking520 in #2209
- [ci] Fix pipeline failures by @xyang16 in #2210
- [ci] Add mmlu dataset to correctness testing by @xyang16 in #2211
- [CI] Add integration pytest marks by @zachgk in #2176
- [CI] update cuda version and the pipeline by @lanking520 in #2186
- [ci] fix adapters client testing by @tosterberg in #2188
- add mmlu dataset by @lanking520 in #2185
- [ci] Add correctness testing by @xyang16 in #2200
- [CI] add client logger to output logs by @lanking520 in #2201
- [CI] fix no code pipeline issues by @lanking520 in #2215
- [CI] Miscellaneous CI Fixes by @zachgk in #2218
- take out falcon from the test list by @lanking520 in #2219
- [CI] update the test by @lanking520 in #2206
- [multimodal][ci] adding multimodal tests by @siddvenk in #2234
- [CI] use default github runner by @lanking520 in #2222
- fix typo in test name by @rohithkrn in #2224
- [ci][hotfix] changes to build new vllm wheel with hanging fix by @siddvenk in #2228
- [CI] comment optimum from installation by @lanking520 in #2225
- [ci] Update correctness testing by @xyang16 in #2221
- [Cherrypick][ci] fix hf hub flakiness remove unused prepare (#2278) by @lanking520 in #2294
- Update max num tokens workflow, fix multiple bugs by @ydm-amazon in #2012
- [lmi][hf]remove usage of HF Conversational pipeline as it is deprecated by @siddvenk in #2124
- [python] fix integration test failures by @sindhuvahinis in #2153
- [tnx] neuron rolling batch test suite by @tosterberg in #2172
- add json error testing by @lanking520 in #2178
- Update max_model_len for llama-3 lora test by @rohithkrn in #2205
Bug Fixes
- [serving] Fixes onnx model conversion bug by @frankfliu in #2000
- [docker] Fixes duplicated onnx jar file issue by @frankfliu in #2017
- [python] Fixes typing issue with py3.8 by @frankfliu in #2035
- [docker] Fixes docker build script for version detection by @frankfliu in #2036
- fix bug with duplicate models when HF_MODEL_ID points to model store by @siddvenk in #2054
- [plugin] Fixes plugin scaning bug by @frankfliu in #2059
- [secure-mode] Fix entrypoint control name by @ethnzhng in #2065
- [fix] remove validating quantization in properties_manager.py by @sindhuvahinis in #2067
- [fix] initialize sequence dictionary for default sequence index to pr… by @siddvenk in #2072
- [Neo] Fix Neo Quantization properties output. Add some additional configuration. by @a-ys in #2077
- [aot] Fix aot quantization for weight only quantization by @tosterberg in #2079
- Sm vllm config by @maaquib in #2005
- [Neuron] Fix Neuron compilation logging by @a-ys in #2095
- [fix] align default neuron behavior between model server and handler by @tosterberg in #2116
- [docker] Fixes onnxruntime engine installation by @frankfliu in #2121
- [serving] Fixes snake case pattern by @frankfliu in #2123
- [docker] Install missing pytorch jni package by @frankfliu in #2007
- [awscurl] Fixes missing "text" case in jsonlines output by @frankfliu in #2135
- [docker] Fixes cpu-full docker build script by @frankfliu in #2138
- [docker] Fixes pytorch-gpu docker build script by @frankfliu in #2139
- [docker] Fixes docker -SNAPSHOT version build issue by @frankfliu in #2140
- Fix acceptance history check in speculative telemetry by @zachgk in #2112
- [python] Fix new logprobs computation in vllm_utils by @sindhuvahinis in #2146
- [awscurl] Fixes tokenizer cache directory by @frankfliu in #2168
- fix the cache miss issue by @lanking520 in #2174
- fix the handler by @lanking520 in #2203
- [chat-completions] fix minor issues with response schema by @siddvenk in #2204
- [tnx] fix optimum token selection and sampling by @tosterberg in #2233
New Contributors
- @vinayburugu made their first contribution in #2019
- @Varun-Dutta made their first contribution in #2127
- @david-sitsky made their first contribution in #2197
Full Changelog: v0.28.0...v0.29.0