- Overview
- Downloading the IBM Z Accelerated for NVIDIA Triton™ Inference server Container Image
- Container Image Contents
- IBM Z Accelerated for NVIDIA Triton™ Inference Server container image usage
- Security and Deployment Guidelines
- IBM Z Accelerated for NVIDIA Triton™ Inference Server Backends
- REST APIs
- Model Validation
- Using the Code Samples
- Additional Topics
- Limitations and Known Issues
- Versions and Release Cadence
- Frequently Asked Questions
- Technical Support
- Licenses
Triton Inference Server is an open source fast, scalable, and open-source AI inference server, by standardizing model deployment and execution streamlined and optimized for high performance. The Triton Inference Server can deploy AI models such as deep learning (DL) and machine learning (ML).
Client program initiates the inference request to Triton Inference Server. Inference requests arrive at the server via either HTTP/REST or GRPC or by the C API and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured per model-by-model basis. Each model's scheduler optionally performs batching of inference requests and then passes the requests to the corresponding backend module based on the model type. The backend performs inferencing using the inputs (duly pre-processed) to produce the requested outputs. The outputs are then returned to the client program.
A Triton backend API is provided by Triton Inference Server exposes C API and backend libraries and frameworks in an optimized manner. This enables ONNX-MLIR support on IBM Z.
Triton Inference Server python backend enables pre-processing, post-processing and server models written in python programming language. This enables IBM Snap ML support on IBM Z.
The models being served by Triton Inference Server can be queried and controlled by a dedicated model management API that is available by HTTP/REST or GRPC protocol, or by the C API.
From above architecture diagram, The model repository is a file-system based repository of the models that Triton Inference Server will make available for a deployment.
On IBM® z16™ and later (running Linux on IBM Z or IBM® z/OS® Container Extensions (IBM zCX)), With Triton Inference server 2.33.0 python backend for IBM Snap ML or custom backend like ONNX-MLIR will leverage new inference acceleration capabilities that transparently target the IBM Integrated Accelerator for AI through the IBM z Deep Neural Network (zDNN) library. The IBM zDNN library contains a set of primitives that support Deep Neural Networks. These primitives transparently target the IBM Integrated Accelerator for AI on IBM z16 and later. No changes to the original model are needed to take advantage of the new inference acceleration capabilities.
Please visit the section Downloading theIBM Z Accelerated for NVIDIA Triton™ Inference Server container image to get started.
Downloading the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image requires credentials for the IBM Z and LinuxONE Container Registry, icr.io.
Documentation on obtaining credentials to icr.io
is located
here.
Once credentials to icr.io
are obtained and have been used to login to the
registry, you may pull (download) the IBM Z Accelerated for NVIDIA Triton™
Inference Server container image with the following code block:
docker pull icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:X.Y.Z
In the docker pull
command illustrated above, the version specified above is
X.Y.Z
. This is based on the version available in the
IBM Z and LinuxONE Container Registry.
To remove the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image, please follow the commands in the code block:
# Find the Image ID from the image listing
docker images
# Remove the image
docker rmi <IMAGE ID>
*Note. This documentation will refer to image/containerization commands in
terms of Docker. If you are utilizing Podman, please replace docker
with
podman
when using our example code snippets.
To view a brief overview of the operating system version, software versions and
content installed in the container, as well as any release notes for each
released container image version, please visit the releases
section of this
GitHub Repository, or you can click
here.
For documentation how serving models with Triton Inference Server please visit the official Open Source Triton Inference Server documentation.
For brief examples on deploying models with Triton Inference Server, please visit our samples section
Launching and maintaining IBM Z Accelerated Triton™ Inference Server revolves around official
quick start tutorial.
This documentation will cover :
- Creating a Model Repository
- Launching IBM Z Accelerated for NVIDIA Triton™ Inference Server
- Deploying the model
User can follow the steps describe at model repository to create a model repository. The following steps would launch the Triton Inference Server.
By default services of IBM Z Accelerated for NVIDIA Triton™ Inference Server docker container listening at the following ports.
Service Name | Port |
---|---|
HTTP – Triton Inference Server | 8000 |
GRPC – Triton Inference Server | 8001 |
HTTP – Metrics | 8002 |
IBM Z Accelerated for NVIDIA Triton™ Inference Server can be launched by running the following command.
docker run --shm-size 1G --rm
-p <EXPOSE_HTTP_PORT_NUM>:8000
-p <EXPOSE_GRPC_PORT_NUM>:8001
-p <EXPOSE_Metrics_PORT_NUM>:8002
-v $PWD/models:/models <triton_inference_server_image> tritonserver
--model-repository=/models
Use IBM Z Accelerated for NVIDIA Triton™ Inference Server's REST API endpoint to verify if the server and the models are ready for inferencing. From the host system use curl to access the HTTP endpoint that provides server status.
curl -v localhost:8000/v2/health/ready
...
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
The HTTP request returns status 200 if IBM Z Accelerated for NVIDIA Triton™ Inference Server is ready and non-200 if it is not ready.
Once the model been available either on a system or in a model repository, model can be deployed automatically by specifying the path to the model location while launching the Triton Inference Server.
Command to start the Triton Inference Server
docker run --shm-size 1G --rm \
-p <TRITONSERVER_HTTP_PORT_NUM>:8000 \
-p <TRITONSERVER_GRPC_PORT_NUM>:8001 \
-p <TRITONSERVER_METRICS_PORT_NUM>:8002 \
-v $PWD/models:/models <triton_inference_server_image> tritonserver \
--model-repository=/models
Configuring Triton Inference Server for HTTPS and gRPC Secure ensures secure and encrypted communication channels, protecting clients and server data confidentiality, integrity, and authenticity.
HTTPS (Hypertext Transfer Protocol Secure) is a secure version of the HTTP protocol used for communication between clients (such as web browsers) and servers over the internet. It provides encryption and secure data transmission by using SSL/TLS (Secure Sockets Layer/Transport Layer Security) protocols.
Reverse proxy servers can help secure Triton Inference Server communications with HTTPS by protecting backend servers and enabling secure communication and performance required to handle incoming request well.
The HTTPS protocol ensures that the data exchanged between the client and server is encrypted and protected from eavesdropping or tampering. By configuring reverse proxy server with HTTPS, you enable secure communication between the Triton Inference Server and clients, ensuring data confidentiality and integrity.
Triton Inference Server supports the gRPC protocol, which is a high-performance. gRPC provides efficient and fast communication between clients and servers, making it ideal for real-time inferencing scenarios.
Using gRPC with Triton Inference Server offers benefits such as high performance, bidirectional streaming, support for client and server-side streaming, and automatic code generation for client and server interfaces.
SSL/TLS: gRPC support SSL/TLS and the use of SSL/TLS to authenticate the server, and to encrypt all the data exchanged between the gRPC client and the Triton Inference Server. Optional mechanisms are available for clients to provide certificates for mutual authentication
- For security and deployment best practices, please visit the common AI Toolkit documentation found here.
IBM Z Accelerated for NVIDIA Triton™ Inference Server supports two backends as of today.
Python backend Triton Inference Server has a Python backend that allows you to deploy machine learning models written in Python for inference. This backend is known as the "Python backend" or "Python script backend."
For more details about triton python backend are documented here
Format of the python backend model directory looks like below
$ model_1
|-- 1
| |-- model.py
| `-- model.txt
`-- config.pbtxt
Every Python Backend model must provide config.pbtxt file describing the model configuration.
Below is a sample config.pbtxt
for Python Backend:
max_batch_size: 32
input {
name: "IN0"
data_type: TYPE_FP32
dims: 5
}
output {
name: "OUT0"
data_type: TYPE_FP64
dims: 1
}
backend: "python"
Triton Inference Server exposes some flags to control the execution mode of models through parameters section in the model’s config.pbtxt file.
-
Backend : Backend parameter must be provided as “python” while utilising ONNX-MLIR Backend. For more details related to backend here
backend: "python"
-
Inputs and Outputs: : Each model input and output must specify a name, datatype, and shape. The name specified for an input or output tensor must match the name expected by the model. For more details on inputs and output tensors check documentation of Triton Inference server here
For more options see Model Configuration.
Using the Python backend in Triton is especially useful for deploying custom models or models developed with specific libraries that are not natively supported by Triton's other backends. It provides a flexible way to bring in your own machine learning code and integrate it with the server's inference capabilities. It provides flexibility by allowing you to use any Python machine learning library to define your model's inference logic.
NOTE:
- model.py should be there in the model repository to use the python backend framework.
- Multiple versions are supported, only positive values as version model are supported
A triton backend which allows the deployment of onnx-mlir or zDLC compiled models (model.so) with the triton inference server. For more details about the onnx-mlir backend are documented here
Format of the onnx-mlir backend model directory looks like below
$ model_1
|-- 1
| `-- model.so
`-- config.pbtxt
Every ONNX-MLIR Backend model must provide config.pbtxt file describing the model configuration.
Below is a sample config.pbtxt
for ONNX-MLIR Backend:
max_batch_size: 32
input {
name: "IN0"
data_type: TYPE_FP32
dims: 5
dims: 5
dims: 1
}
output {
name: "OUT0"
data_type: TYPE_FP64
dims: 1
}
backend: "onnxmlir"
-
Backend : Backend parameter must be provided as “onnxmlir” while utilising ONNX-MLIR Backend. For more details related to backend here
backend: "onnxmlir"
-
Inputs and Outputs: : Each model input and output must specify a name, datatype, and shape. The name specified for an input or output tensor must match the name expected by the model. For more details on inputs and output tensors check documentation of Triton Inference server here
For more options see Model Configuration.
NOTE: Multiple versions are supported, only positive values as version model are supported
Snap Machine Learning (Snap ML in short) is a library for training and scoring traditional machine learning models with end to end inferencing capabilities. As part of the preprocessing pipeline it supports Normalizer, Kbinsdiscritizer, and one-hot encoding transformers. This backend supports importing tree ensembles models that were trained with other frameworks (e.g., scikit-learn, XGBoost, LightGBM) so one can leverage Integrated On-Chip Accelerator on IBM Z and IBM LinuxONE transparently via Snap ML’s accelerated inference engine. For more information on supported models refer Snap ML documentation here.
In Order to deploy any model on Triton Inference Server, one should have a model repository and Model configuration ready.
Model Configuration: The model configuration (config.pbtxt)
in Triton Inference Server defines metadata, optimization settings, and customised parameters for each model. This configuration ensures models are served with optimal performance and tailored behaviour.For more details Model Configuration.
Like all Triton backends, models deployed via the Snap ML C++ Backend make use of a specially laid-out "model repository" directory containing at least one serialized model and a "config.pbtxt" configuration file.
Typically, a models directory should look like below:
models
└── test_snapml_model
│ ├── 1
│ ├── model.pmml
│ └── pipeline.json
└── config.pbtxt
Given above is sample for pmml
format. User should change the model formats as per the model framework chosen.
pipeline.json: This is optional and required to be provided only when pre-processing is chosen.
Every Snap ML C++ Backend model must provide config.pbtxt file describing the model configuration.
Below is a sample config.pbtxt
for Snap ML C++ Backend:
max_batch_size: 32
input {
name: "IN0"
data_type: TYPE_FP32
dims: 5
}
output {
name: "OUT0"
data_type: TYPE_FP64
dims: 1
}
instance_group {
count: 2
kind: KIND_CPU
}
dynamic_batching {
preferred_batch_size: 32
max_queue_delay_microseconds: 25000
}
parameters {
key: "MODEL_FILE_FORMAT"
value {
string_value: "pmml"
}
}
backend: "ibmsnapml"
-
Backend : Backend parameter must be provided as “ibmsnapml” while utilising Snap ML C++ Backend.
backend: "ibmsnapml"
-
PREPROCESSING_FILE : This configuration file parameter specifies the name of the file to be used for preprocessing. The preprocessing file must be named as ‘pipeline.json’ when preprocessing is selected for end-to-end inferencing.
If this parameter is not provided, the backend assumes that preprocessing is not required,even if pipeline.json is present in the model repository.
parameters { key: "PREPROCESSING_FILE" value { string_value: "pipeline.json" } }
Note : The
‘pipeline.json’
is serialised preprocessing pipeline captured during training using Snap ML API(export_preprocessing_pipeline(pipeline_xgb['preprocessor'],'pipeline.json’)
.For more details visit here. -
SNAPML_TARGET_CLASS : This configuration parameter specifies which model needs to be imported by Snap ML C++ Backend as per Snap ML documentation here.
parameters { key: "SNAPML_TARGET_CLASS" value { string_value: "BoostingMachineClassifier" } }
For example, pre-trained model is
‘sklearn.ensemble.RandomForestClassifier’
then target SnapML class should‘snapml.RandomForestClassifier’
. In case of C++ backend same target class will be referred as‘RandomForestClassifier’
as a value forSNAPML_TARGET_CLASS
in the config.pbtxt. -
MODEL_FILE_FORMAT :
Model File Format provided in the configuration file should be as per the Snap ML supported models. Refer to Snap ML documentation hereparameters { key: "MODEL_FILE_FORMAT" value { string_value: "pmml" }
-
NUM_OF_PREDICT_THREADS :
This defines the CPU threads running for each inference call.parameters { key: "NUM_OF_PREDICT_THREADS" value { string_value: "12" } }
-
PREDICT_PROBABILITY : This Configuration parameter controls whether the model’s prediction is to be in terms of probability.
If set to
true
, the model will return probability value as response. The probability value will always be for the positive class label. If set tofalse
(or the parameter is omitted), the model will return the most likely predicted class label without probabilities. This is the default behaviour. Note thatPREDICT_PROBABILITY
is case sensitive and accepts only‘true’
or‘false’
. Any other values apart from these like (True/False/1/0) are provided , by default‘false’
will be considered.Note : Probability will be always for the positive class label.
parameters { key: "PREDICT_PROBABILITY" value { string_value: "true" } }
-
Inputs and Outputs: : Each model input and output must specify a name, datatype, and shape( for more details on inputs and output tensors check documentation of Triton Inference server here).The name specified for an input or output tensor must match the name expected by the model. An input shape indicates the shape of an input tensor expected by the model and by Triton inference request. An output shape indicates the shape of an output tensor produced by the model and returned by Triton in response to an inference request. Both input and output shape must have rank greater-or-equal-to 1, that is, the empty shape [ ] is not allowed.In case of preprocessing, the data type of Input must be
TYPE_STRING
and output tensor can be of eitherTYPE_FP64
orTYPE_FP32
. For more details on supported tensor data types by Triton Inference server ,vist here.
- Model Management
- Health Check
- Inference
- Logging
- Metrics Collection
- Traces
- Statistics
- Server Metadata
Triton Inference Server operates in one of three model control modes: NONE, EXPLICIT, or POLL. The model control determines how Triton Inference Server handles changes to the model repository and which protocols and APIs are available.
More details about model management can be found here
The model-repository extension allows a client to query and control the one or more model repositories being served by Triton Inference Server.
- Index API
- POST
v2/repository/index
- POST
- Load API
- POST
v2/repository/models/${MODEL_NAME}/load
- POST
- unload API
- POST
v2/repository/models/${MODEL_NAME}/unload
- POST
For more details about the model repository index, load and unload API calls please visit the Triton documentation website link here
The model configuration extension allows Triton Inference Server to return server-specific information.
GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/config
For more details about the model configuration API calls please visit the Triton documentation website link here
Model Metadata per-model endpoint provides the following details
GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]
- name : The name of the model.
- versions : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don’t support versions. Optional for models that don’t allow a version to be explicitly requested.
- platform : The framework/backend for the model. See Platforms.
- inputs : The inputs required by the model.
- outputs : The outputs produced by the model.
For more details about the model configuration API calls please visit the Triton documentation website link here
Health Check API provides status of Triton Inference Server, Model etc.
GET v2/health/live
GET v2/health/ready
GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready
For more details about the statistics API calls please visit the kserve documentation website link here
An inference request is made with an HTTP POST to an inference endpoint.
POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer
For more details about the inference API calls please visit the kserve documentation website link here
- Classification
POST /v2/models/<model_name>/infer
–data <JSON_Data>
The classification extension allows Triton Inference Server to return an output as a classification index and (optional) label instead of returning the output as raw tensor data. Because this extension is supported, Triton Inference Server reports “classification” in the extensions field of its Server Metadata.
For more details about the classification API calls please visit the Triton documentation website link here
- Binary data *
POST /v2/models/<model_name>/infer
. The binary tensor data extension allows Triton Inference Server to support tensor data represented in a binary format in the body of an HTTP/REST request.
For more details about the binary data please visit the Triton documentation website link here
Managing and troubleshooting machine learning models on Triton Inference Server can be effectively accomplished by configuring the logging settings and monitoring the logs.
Explore more command line options:
docker run <triton_inference_server_image> tritonserver [options]
GET v2/logging
POST v2/logging
{
"logging": {
"log-verbose": false,
"log-info": true,
"log-warning": true,
"log-error": true,
"log-file:triton.log"
},
}
View the logs:
tail -f /path/to/log/directory/triton.log
Logs will be written/overwritten into the file mentioned during the server runtime.
NOTE: Triton Inference Server allows creation of 40 log files in total. 20 for each protocol (http and grpc)
For more details about the logging API calls please visit the Triton documentation website link here
Triton Inference Server provides Prometheus metrics indicating CPU and request statistics.
GET /metrics
For more details about the metrics collections please visit the Triton documentation website link here
Traces extension enables client to fetch or configure trace settings for a given model while Triton Inference Server is running.
GET v2[/models/${MODEL_NAME}]/trace/setting
POST v2[/models/${MODEL_NAME}]/trace/setting
For more details about the trace API calls please visit the Triton documentation website link here
GET v2/models[/${MODEL_NAME}[/versions/${MODEL_VERSION}]]/stats
For more details about the statistics API calls please visit the Triton documentation website link here
The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint.
GET v2
For more details about the statistics API calls please visit the kserve documentation website link here
Various models that were trained on x86 or IBM Z have demonstrated focused optimizations that transparently target the IBM Integrated Accelerator for AI for a number of compute-intensive operations during inferencing.
Note. Models that were trained on x86 ecosystem may throw endianness issues.
The purpose of this section is to provide details on how to deploy a model of type Random Forest Classifier trained with scikit-learn, but deployable on triton inference server leveraging snapml and python backend.
- As per the documentation create python runtime
- Train scikit learn python Random Forest Classifier. This will generate model.pmml file
- Create model folder structure including relevant file that are needed to be deployed using Triton Inference Server
- Deploy the model For more details about the sample model is documented here
The purpose of this section is to provide details on how to deploy a model of type onnx-mlir trained with cntk a deep learning toolkit, but deployable on triton inference server leveraging onnx-mlir backend. Convolutional Neural Network with MNIST
- Jupyter notebook for Convolutional Neural Network with MNIST can be found here
- Use step 1 to train model on your own or download pre-trained model from here
- Use IBM zDLC to transform onnx-mlir model to model.so file
- Import the compiled model into mode repository.
- Create model folder structure including relevant files that are needed to be deployed
- Deploy the model
For more details about the sample model is documented here
See Verified for the list of models from the ONNX Model Zoo that have been built and verified with the IBM Z Deep Learning Compiler.
- Import the compiled model into mode repository.
- Create the model structure including relevant files that are needed to be deployed.
- Deploy the model on to Triton Inference Server.
Tensor data types are shown in the following table along with the size of each type, in bytes are supported.
Data Type | Size (bytes) |
---|---|
BOOL | 1 |
UINT8 | 1 |
UINT16 | 2 |
UINT32 | 4 |
UINT64 | 8 |
INT8 | 1 |
INT16 | 2 |
INT32 | 4 |
INT64 | 8 |
FP16 | 2 |
FP32 | 4 |
FP64 | 8 |
BYTES | Variable (max 232) |
The Triton response cache is used by Triton to hold inference results generated for previous executed inference requests and sent as response if new inference request hits cache.
For more details about the Triton response cache is documented here
Repository agent allows to introduce code that will perform authentication, decryption, conversion, or similar operations when a model is loaded.
For more details about the Repository agent(checksum_repository_agent) is documented here
Below details provides how the version policy been applied while model with different versions available in the model repository.
|-- model_1
| |-- -1
| | `-- model.txt
| `-- config.pbtxt
`-- model_2
|-- -1
| `-- model.so
`-- config.pbtxt
---------------------------------------+
| Model | Version | Status |
+-----------+---------+-----------------
------------------------------------------------------------------------------------------------------------------------------------+
| model_1|| -1 | UNAVAILABLE: Unavailable: unable to find '/models/model_1/18446744073709551615/model.txt' for model 'model_1' |
| model_2|| -1 | UNAVAILABLE: Unavailable: unable to find '/models/model_2/18446744073709551615/model.so' for model 'model_2' |
+-----------+---------+-------------------------------------------------------------------------------------------------------------+
|-- model_1
| |-- 0
| | |-- model.py
| | `-- model.txt
| `-- config.pbtxt
`-- model_2
|-- 0
| `-- model.so
`-- config.pbtxt
I0718 12:07:48.131561 1 server.cc:653
+-----------+---------+--------+
| Model | Version | Status |
+-----------+---------+--------+
| model_1 | 0 | READY |
| model_2 | 0 | READY |
+-----------+---------+--------+
|-- model_1
| |-- 1
| | |-- model.py
| | `-- model.txt
| `-- config.pbtxt
`-- model_2
|-- 1
| `-- model.so
`-- config.pbtxt
+-----------+---------+--------+
| Model | Version | Status |
+-----------+---------+--------+
| model_1 | 1 | READY |
| model_2 | 1 | READY |
+-----------+---------+--------+
Each model can have one or more versions. The ModelVersionPolicy property of the model configuration is used to set one of the following policies.
• All: Load all the versions of the model.All versions of the model that are available in the model repository are available for inferencing. version_policy: { all: {}}
• Latest: Only the latest ‘n’versions of the model in the repository are available for inferencing. The latest versions of the model are the numerically greatest version numbers.version_policy: { latest: { num_versions: 2}}
• Specific: Only the specifically listed versions of the model are available for inferencing. version_policy: {specific: { versions: [1,3]}}
If no version policy is specified, then Latest (with n=1) is used as the default, indicating that only the most recent version of the model is made available by Triton. In all cases, the addition or removal of version subdirectories from the model repository can change which model version is used on subsequent inference requests.
|-- model_1
| |-- -1
| | `-- model.txt
| |-- 4
| | |-- model.py
| | `-- model.txt
`-- model_2
|-- -1
| `-- model.so
|-- 3
`-- model.so
---------------------------------------+
| Model | Version | Status |
+-----------+---------+-----------------
------------------------------------------------------------------------------------------------------------------------------------+
| model_1 | -1 | UNAVAILABLE: Unavailable: unable to find '/models/model_1/18446744073709551615/model.txt' for model 'model_1' |
| model_1 | 4 | UNLOADING |
| model_2 | -1 | UNAVAILABLE: Unavailable: unable to find '/models/model_2/18446744073709551615/model.so' for model 'model_2' |
| model_2 | 3 | UNAVAILABLE: unloaded |
------------------------------------------------------------------------------------------------------------------------------------+
error: creating server: Internal - failed to load all models
version_policy: { latest: { num_versions: 2}}
to load latest 2 versions of the
model. The default is the higher version of a model.
|-- model_1
| |-- -1
| | |-- model.py
| | `-- model.txt
| |-- 13
| | |-- model.py
| | `-- model.txt
| |-- 17
| | |-- model.py
| | `-- model.txt
`-- model_2
|-- -1
| `-- model.so
|-- 15
| `-- model.so
|-- 3
`-- model.so
+-----------+---------+--------+
| Model | Version | Status |
+-----------+---------+--------+
| model_1 | 13 | READY |
| model_1 | 17 | READY |
| model_2 | 9 | READY |
| model_2 | 15 | READY |
+-----------+---------+--------+
version_policy: { specific: { versions: [4]}}
to load specific versions of the model. – model_1
version_policy: { specific: { versions: [3,9]}}
to load specific versions of the model. – model_2
+-----------+---------+--------+
| Model | Version | Status |
+-----------+---------+--------+
| model_1 | 4 | READY |
| model_2 | 3 | READY |
| model_2 | 9 | READY |
+-----------+---------+--------+
-- model_1
|-- 1
| |-- model.py
| |-- model.txt
`-- config.pbtxt
`-- model_1
|-- 1
| |-- model.py
| `-- model.txt
|-- 2
| |-- model.py
| `-- model.txt
`-- config.pbtxt
|-- model_1
| |-- 1
| | |-- model.py
| | `-- model.txt
| `-- config.pbtxt
|-- model_2
| |-- 0
| | `-- model.so
| `-- config.pbtxt
|-- model_1
| |-- 1
| | |-- model.py
| | |-- model.txt
| |-- 2
| | |-- model.py
| | |-- model.txt
|-- model_3
| |-- 1
| | |-- model.py
| | |-- model_rfd10.pmml
| | |-- pipeline_rfd10.joblib
| |-- 2
| | |-- model.py
| | |-- model_rfd10.pmml
| | |-- pipeline_rfd10.joblib
-
Consumer of Triton Inference Server may or may not face an issue when utilizing Triton Inference Server with a Python backend and an HTTP endpoint on a Big Endian machine, experience errors related to the TYPE_STRING datatype. For more details, see link
-
Consumer of Triton Inference server may or may not face an issue when running Triton Inference Server on a Big Endian machine, specifically related to GRPC calls with BYTES input. It appears that the current configuration may not fully support GRPC calls with BYTES input. For more details, see link
-
In case, user of Triton Inference Server want restrict access the protocols on a given endpoint by leveraging the configuration option '--grpc-restricted-protocol'. This feature provides fine-grained control over access to various endpoints by specifying protocols and associated restricted keys and values. Consumer of Triton Inference server may or may not find similar capability for restricting endpoint access for the HTTP protocol as currently not available. For more details, see link
-
Consumer of Triton server will only be able to create up to 40 log files in total out of which 20 for protocol http and 20 for protocol grpc. For more details, see link
-
Consumer of Triton server may or may not face an issue while having model with version -1 or model.py isn't present for python backend. For more details, see link
IBM Z Accelerated for NVIDIA Triton™ Inference Server will follow the semantic versioning guidelines with a few deviations. Overall IBM Z Accelerated for NVIDIA Triton™ Inference Server follows a continuous release model with a cadence of 1-2 minor releases per year. In general, bug fixes will be applied to the next minor release and not back ported to prior major or minor releases. Major version changes are not frequent and may include features supporting new IBM Z hardware as well as major feature changes in Triton Inference Server that are not likely backward compatible.
Each release version of IBM Z Accelerated for NVIDIA Triton™ Inference Server has the form MAJOR.MINOR.PATCH (X.Y.Z). For example, IBM Z Accelerated for NVIDIA Triton™ Inference Server version 1.2.3 has MAJOR version 1, MINOR version 2, and PATCH version 3. Changes to each number have the following meaning:
All releases with the same major version number will have API compatibility. Major version numbers will remain stable. For instance, 1.Y.Z may last 1 year or more. It will potentially have backwards incompatible changes. Code and data that worked with a previous major release will not necessarily work with the new release.
Note:
pybind11 PyRuntimes, any other python packages for versions of Python
that have reached end of life can be removed or updated to newer stable version
without a major release increase change.
Minor releases will typically contain new backward compatible features, improvements, and bug fixes.
Maintenance releases will occur more frequently and depend on specific patches introduced (e.g. bug fixes) and their urgency. In general, these releases are designed to patch bugs.
Feature releases for IBM Z Accelerated for NVIDIA Triton™ Inference Server occur about every 6 months in general. Hence, IBM Z Accelerated for NVIDIA Triton™ Inference Server X.3.0 would generally be released about 6 months after X.2.0. Maintenance releases happen as needed in between feature releases. Major releases do not happen according to a fixed schedule.
Please visit this link here. Or read the section titled Downloading the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image.
You may run the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image on IBM Linux on Z or IBM® z/OS® Container Extensions (IBM zCX).
Note. The IBM Z Accelerated for NVIDIA Triton™ Inference Server will transparently target the IBM Integrated Accelerator for AI on IBM z16 and later. However, if using the IBM Z Accelerated for NVIDIA Triton™ Inference Server on either an IBM z15® or an IBM z14®, IBM Snap ML or ONNX-MLIR will transparently target the CPU with no changes to the model.
Error Type | Description |
---|---|
Model load errors | These errors occur when the server fails to load the machine learning model. Possible reasons could be incorrect model configuration, incompatible model format, or missing model files Backend errors Triton supports multiple backends for running models, such as Python, ONNX-MLIR. Errors can occur if there are issues with the backend itself, such as version compatibility problems or unsupported features. |
Input data errors | When sending requests to the Triton server, issues might arise with the input data provided by the client. This could include incorrect data types, shape mismatches, or missing required inputs. Any valid request batched with invalid request might lead to either inaccurate or invalid response by inference server to the entire batch. |
Inference errors | Errors during the inference process can happen due to problems with the model's architecture or issues within the model's code. |
Resource errors | Triton uses system resources like CPU and memory to perform inference. Errors can occur if there are resource allocation problems or resource constraints are not handled properly. |
Networking errors | Triton is a server that communicates with clients over the network. Network-related issues such as timeouts, connection problems, or firewall restrictions can lead to errors. |
Configuration errors | Misconfigurations in the Triton server settings or environment variables could result in unexpected behavior or failures. |
Scaling errors | When deploying Triton in a distributed or multi-instance setup, errors can occur due to load balancing issues, communication problems between instances, or synchronization failures. |
Information regarding technical support can be found here.
The International License Agreement for Non-Warranted Programs (ILAN) agreement can be found here
The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis.
NVIDIA and Triton are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and/or other countries.
Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc. in the United States and/or other countries. Docker, Inc. and other parties may also have trademark rights in other terms used herein.
IBM, the IBM logo, and ibm.com, IBM z16, IBM z15, IBM z14 are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. The current list of IBM trademarks can be found here.