Using the IBM Z Accelerated for NVIDIA Triton™ Inference Server Container Image

Overview
Downloading the IBM Z Accelerated for NVIDIA Triton™ Inference server Container Image
Container Image Contents
IBM Z Accelerated for NVIDIA Triton™ Inference Server container image usage
Security and Deployment Guidelines
IBM Z Accelerated for NVIDIA Triton™ Inference Server Backends
REST APIs
Model Validation
Using the Code Samples
Additional Topics
Limitations and Known Issues
Versions and Release Cadence
Frequently Asked Questions
Technical Support
Licenses

Overview

Triton Inference Server is an open source fast, scalable, and open-source AI inference server, by standardizing model deployment and execution streamlined and optimized for high performance. The Triton Inference Server can deploy AI models such as deep learning (DL) and machine learning (ML).

Client program initiates the inference request to Triton Inference Server. Inference requests arrive at the server via either HTTP/REST or GRPC or by the C API and are then routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured per model-by-model basis. Each model's scheduler optionally performs batching of inference requests and then passes the requests to the corresponding backend module based on the model type. The backend performs inferencing using the inputs (duly pre-processed) to produce the requested outputs. The outputs are then returned to the client program.

A Triton backend API is provided by Triton Inference Server exposes C API and backend libraries and frameworks in an optimized manner. This enables ONNX-MLIR support on IBM Z.

Triton Inference Server python backend enables pre-processing, post-processing and server models written in python programming language. This enables IBM Snap ML support on IBM Z.

The models being served by Triton Inference Server can be queried and controlled by a dedicated model management API that is available by HTTP/REST or GRPC protocol, or by the C API.

From above architecture diagram, The model repository is a file-system based repository of the models that Triton Inference Server will make available for a deployment.

On IBM® z16™ and later (running Linux on IBM Z or IBM® z/OS® Container Extensions (IBM zCX)), With Triton Inference server 2.33.0 python backend for IBM Snap ML or custom backend like ONNX-MLIR will leverage new inference acceleration capabilities that transparently target the IBM Integrated Accelerator for AI through the IBM z Deep Neural Network (zDNN) library. The IBM zDNN library contains a set of primitives that support Deep Neural Networks. These primitives transparently target the IBM Integrated Accelerator for AI on IBM z16 and later. No changes to the original model are needed to take advantage of the new inference acceleration capabilities.

Please visit the section Downloading theIBM Z Accelerated for NVIDIA Triton™ Inference Server container image to get started.

Downloading the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image

Downloading the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image requires credentials for the IBM Z and LinuxONE Container Registry, icr.io.

Documentation on obtaining credentials to icr.io is located here.

Once credentials to icr.io are obtained and have been used to login to the registry, you may pull (download) the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image with the following code block:

docker pull icr.io/ibmz/ibmz-accelerated-for-nvidia-triton-inference-server:X.Y.Z

In the docker pull command illustrated above, the version specified above is X.Y.Z. This is based on the version available in the IBM Z and LinuxONE Container Registry.

To remove the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image, please follow the commands in the code block:

# Find the Image ID from the image listing
docker images

# Remove the image
docker rmi <IMAGE ID>

*Note. This documentation will refer to image/containerization commands in terms of Docker. If you are utilizing Podman, please replace docker with podman when using our example code snippets.

Container Image Contents

To view a brief overview of the operating system version, software versions and content installed in the container, as well as any release notes for each released container image version, please visit the releases section of this GitHub Repository, or you can click here.

IBM Z Accelerated for NVIDIA Triton™ Inference Server container image usage

For documentation how serving models with Triton Inference Server please visit the official Open Source Triton Inference Server documentation.

For brief examples on deploying models with Triton Inference Server, please visit our samples section

Launch IBM Z Accelerated for NVIDIA Triton™ Inference Server container

Launching and maintaining IBM Z Accelerated Triton™ Inference Server revolves around official

quick start tutorial.

This documentation will cover :

Creating a Model Repository
Launching IBM Z Accelerated for NVIDIA Triton™ Inference Server
Deploying the model

Creating a Model Repository

User can follow the steps describe at model repository to create a model repository. The following steps would launch the Triton Inference Server.

Launching IBM Z Accelerated for NVIDIA Triton™ Inference Server

By default services of IBM Z Accelerated for NVIDIA Triton™ Inference Server docker container listening at the following ports.

Service Name	Port
HTTP – Triton Inference Server	8000
GRPC – Triton Inference Server	8001
HTTP – Metrics	8002

IBM Z Accelerated for NVIDIA Triton™ Inference Server can be launched by running the following command.

docker run --shm-size 1G --rm
    -p <EXPOSE_HTTP_PORT_NUM>:8000
    -p <EXPOSE_GRPC_PORT_NUM>:8001
    -p <EXPOSE_Metrics_PORT_NUM>:8002
    -v $PWD/models:/models <triton_inference_server_image> tritonserver
    --model-repository=/models

Use IBM Z Accelerated for NVIDIA Triton™ Inference Server's REST API endpoint to verify if the server and the models are ready for inferencing. From the host system use curl to access the HTTP endpoint that provides server status.

curl -v localhost:8000/v2/health/ready

...
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain

The HTTP request returns status 200 if IBM Z Accelerated for NVIDIA Triton™ Inference Server is ready and non-200 if it is not ready.

Security and Deployment Guidelines

Once the model been available either on a system or in a model repository, model can be deployed automatically by specifying the path to the model location while launching the Triton Inference Server.

Command to start the Triton Inference Server

docker run --shm-size 1G --rm \
    -p <TRITONSERVER_HTTP_PORT_NUM>:8000 \
    -p <TRITONSERVER_GRPC_PORT_NUM>:8001 \
    -p <TRITONSERVER_METRICS_PORT_NUM>:8002 \
    -v $PWD/models:/models <triton_inference_server_image>   tritonserver \
    --model-repository=/models

Triton Inference Server using HTTPS/Secure gRPC

Configuring Triton Inference Server for HTTPS and gRPC Secure ensures secure and encrypted communication channels, protecting clients and server data confidentiality, integrity, and authenticity.

HTTPS

HTTPS (Hypertext Transfer Protocol Secure) is a secure version of the HTTP protocol used for communication between clients (such as web browsers) and servers over the internet. It provides encryption and secure data transmission by using SSL/TLS (Secure Sockets Layer/Transport Layer Security) protocols.

Reverse proxy servers can help secure Triton Inference Server communications with HTTPS by protecting backend servers and enabling secure communication and performance required to handle incoming request well.

The HTTPS protocol ensures that the data exchanged between the client and server is encrypted and protected from eavesdropping or tampering. By configuring reverse proxy server with HTTPS, you enable secure communication between the Triton Inference Server and clients, ensuring data confidentiality and integrity.

Secure gRPC

Triton Inference Server supports the gRPC protocol, which is a high-performance. gRPC provides efficient and fast communication between clients and servers, making it ideal for real-time inferencing scenarios.

Using gRPC with Triton Inference Server offers benefits such as high performance, bidirectional streaming, support for client and server-side streaming, and automatic code generation for client and server interfaces.

SSL/TLS: gRPC support SSL/TLS and the use of SSL/TLS to authenticate the server, and to encrypt all the data exchanged between the gRPC client and the Triton Inference Server. Optional mechanisms are available for clients to provide certificates for mutual authentication

For security and deployment best practices, please visit the common AI Toolkit documentation found here.

IBM Z Accelerated for NVIDIA Triton™ Inference Server Backends

IBM Z Accelerated for NVIDIA Triton™ Inference Server supports two backends as of today.

Python Backend
ONNX-MLIR Backend
Snap ML C++ Backend

Python Backend

Python backend Triton Inference Server has a Python backend that allows you to deploy machine learning models written in Python for inference. This backend is known as the "Python backend" or "Python script backend."

For more details about triton python backend are documented here

Format of the python backend model directory looks like below

$ model_1
   |-- 1
   |   |-- model.py
   |   `-- model.txt
   `-- config.pbtxt

Minimal Model Configuration

Every Python Backend model must provide config.pbtxt file describing the model configuration. Below is a sample config.pbtxt for Python Backend:

max_batch_size: 32
input {
  name: "IN0"
  data_type: TYPE_FP32
  dims: 5
}
output {
  name: "OUT0"
  data_type: TYPE_FP64
  dims: 1
}
backend: "python"

Configuration Parameters:

Triton Inference Server exposes some flags to control the execution mode of models through parameters section in the model’s config.pbtxt file.

Backend : Backend parameter must be provided as “python” while utilising ONNX-MLIR Backend. For more details related to backend here
```
backend: "python"
```
Inputs and Outputs: : Each model input and output must specify a name, datatype, and shape. The name specified for an input or output tensor must match the name expected by the model. For more details on inputs and output tensors check documentation of Triton Inference server here

For more options see Model Configuration.

Using the Python backend in Triton is especially useful for deploying custom models or models developed with specific libraries that are not natively supported by Triton's other backends. It provides a flexible way to bring in your own machine learning code and integrate it with the server's inference capabilities. It provides flexibility by allowing you to use any Python machine learning library to define your model's inference logic.

NOTE:

model.py should be there in the model repository to use the python backend framework.
Multiple versions are supported, only positive values as version model are supported

ONNX-MLIR Backend

A triton backend which allows the deployment of onnx-mlir or zDLC compiled models (model.so) with the triton inference server. For more details about the onnx-mlir backend are documented here

Format of the onnx-mlir backend model directory looks like below

$ model_1
    |-- 1
    |   `-- model.so
    `-- config.pbtxt

Minimal Model Configuration

Every ONNX-MLIR Backend model must provide config.pbtxt file describing the model configuration. Below is a sample config.pbtxt for ONNX-MLIR Backend:

max_batch_size: 32
input {
  name: "IN0"
  data_type: TYPE_FP32
  dims: 5
  dims: 5
  dims: 1
}
output {
  name: "OUT0"
  data_type: TYPE_FP64
  dims: 1
}
backend: "onnxmlir"

Configuration Parameters:

Backend : Backend parameter must be provided as “onnxmlir” while utilising ONNX-MLIR Backend. For more details related to backend here
```
backend: "onnxmlir"
```
Inputs and Outputs: : Each model input and output must specify a name, datatype, and shape. The name specified for an input or output tensor must match the name expected by the model. For more details on inputs and output tensors check documentation of Triton Inference server here

For more options see Model Configuration.

NOTE: Multiple versions are supported, only positive values as version model are supported

Snap ML C++ Backend

Snap Machine Learning (Snap ML in short) is a library for training and scoring traditional machine learning models with end to end inferencing capabilities. As part of the preprocessing pipeline it supports Normalizer, Kbinsdiscritizer, and one-hot encoding transformers. This backend supports importing tree ensembles models that were trained with other frameworks (e.g., scikit-learn, XGBoost, LightGBM) so one can leverage Integrated On-Chip Accelerator on IBM Z and IBM LinuxONE transparently via Snap ML’s accelerated inference engine. For more information on supported models refer Snap ML documentation here.

Backend Usage Information

In Order to deploy any model on Triton Inference Server, one should have a model repository and Model configuration ready.

Model Configuration: The model configuration (config.pbtxt) in Triton Inference Server defines metadata, optimization settings, and customised parameters for each model. This configuration ensures models are served with optimal performance and tailored behaviour.For more details Model Configuration.

Models Repository

Like all Triton backends, models deployed via the Snap ML C++ Backend make use of a specially laid-out "model repository" directory containing at least one serialized model and a "config.pbtxt" configuration file.

Typically, a models directory should look like below:

models
└── test_snapml_model
    │   ├── 1
    │       ├── model.pmml
    │       └── pipeline.json
    └── config.pbtxt

Given above is sample for pmml format. User should change the model formats as per the model framework chosen.

pipeline.json: This is optional and required to be provided only when pre-processing is chosen.

Model Configuration

Every Snap ML C++ Backend model must provide config.pbtxt file describing the model configuration. Below is a sample config.pbtxt for Snap ML C++ Backend:

max_batch_size: 32
input {
  name: "IN0"
  data_type: TYPE_FP32
  dims: 5
}
output {
  name: "OUT0"
  data_type: TYPE_FP64
  dims: 1
}
instance_group  {
   count: 2
   kind: KIND_CPU
}
dynamic_batching {
  preferred_batch_size: 32
  max_queue_delay_microseconds: 25000
}
parameters {
  key: "MODEL_FILE_FORMAT"
  value {
    string_value: "pmml"
  }
}
backend: "ibmsnapml"

Configuration Parameters:

Backend : Backend parameter must be provided as “ibmsnapml” while utilising Snap ML C++ Backend.
```
backend: "ibmsnapml"
```
PREPROCESSING_FILE : This configuration file parameter specifies the name of the file to be used for preprocessing. The preprocessing file must be named as ‘pipeline.json’ when preprocessing is selected for end-to-end inferencing.

If this parameter is not provided, the backend assumes that preprocessing is not required,even if pipeline.json is present in the model repository.
```
parameters {
key: "PREPROCESSING_FILE"
value {
string_value: "pipeline.json"
}
}
```
Note : The ‘pipeline.json’ is serialised preprocessing pipeline captured during training using Snap ML API (export_preprocessing_pipeline(pipeline_xgb['preprocessor'],'pipeline.json’).For more details visit here.
SNAPML_TARGET_CLASS : This configuration parameter specifies which model needs to be imported by Snap ML C++ Backend as per Snap ML documentation here.
```
parameters {
key: "SNAPML_TARGET_CLASS"
value {
string_value: "BoostingMachineClassifier"
}
}
```
For example, pre-trained model is ‘sklearn.ensemble.RandomForestClassifier’ then target SnapML class should ‘snapml.RandomForestClassifier’ . In case of C++ backend same target class will be referred as ‘RandomForestClassifier’ as a value for SNAPML_TARGET_CLASS in the config.pbtxt.
MODEL_FILE_FORMAT :
Model File Format provided in the configuration file should be as per the Snap ML supported models. Refer to Snap ML documentation here
```
parameters {
  key: "MODEL_FILE_FORMAT"
  value {
    string_value: "pmml"
  }
```

NUM_OF_PREDICT_THREADS :
This defines the CPU threads running for each inference call.

  parameters {
  key: "NUM_OF_PREDICT_THREADS"
  value {
    string_value: "12"
  }
}

PREDICT_PROBABILITY : This Configuration parameter controls whether the model’s prediction is to be in terms of probability.

If set to true, the model will return probability value as response. The probability value will always be for the positive class label. If set to false (or the parameter is omitted), the model will return the most likely predicted class label without probabilities. This is the default behaviour. Note that PREDICT_PROBABILITY is case sensitive and accepts only ‘true’ or ‘false’ . Any other values apart from these like (True/False/1/0) are provided , by default ‘false’ will be considered.

Note : Probability will be always for the positive class label.
```
parameters {
 key: "PREDICT_PROBABILITY"
 value {
  string_value: "true"
 }
}
```
Inputs and Outputs: : Each model input and output must specify a name, datatype, and shape( for more details on inputs and output tensors check documentation of Triton Inference server here).The name specified for an input or output tensor must match the name expected by the model. An input shape indicates the shape of an input tensor expected by the model and by Triton inference request. An output shape indicates the shape of an output tensor produced by the model and returned by Triton in response to an inference request. Both input and output shape must have rank greater-or-equal-to 1, that is, the empty shape [ ] is not allowed.In case of preprocessing, the data type of Input must be TYPE_STRINGand output tensor can be of either TYPE_FP64 or TYPE_FP32. For more details on supported tensor data types by Triton Inference server ,vist here.

REST APIs

Model Management
Health Check
Inference
Logging
Metrics Collection
Traces
Statistics
Server Metadata

Model Management

Triton Inference Server operates in one of three model control modes: NONE, EXPLICIT, or POLL. The model control determines how Triton Inference Server handles changes to the model repository and which protocols and APIs are available.

More details about model management can be found here

Model Repository

The model-repository extension allows a client to query and control the one or more model repositories being served by Triton Inference Server.

Index API
- POST v2/repository/index
Load API
- POST v2/repository/models/${MODEL_NAME}/load
unload API
- POST v2/repository/models/${MODEL_NAME}/unload

For more details about the model repository index, load and unload API calls please visit the Triton documentation website link here

Model Configuration

The model configuration extension allows Triton Inference Server to return server-specific information.

GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/config

For more details about the model configuration API calls please visit the Triton documentation website link here

Model Metadata

Model Metadata per-model endpoint provides the following details

GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]

name : The name of the model.
versions : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don’t support versions. Optional for models that don’t allow a version to be explicitly requested.
platform : The framework/backend for the model. See Platforms.
inputs : The inputs required by the model.
outputs : The outputs produced by the model.

For more details about the model configuration API calls please visit the Triton documentation website link here

Health Check

Health Check API provides status of Triton Inference Server, Model etc.

GET v2/health/live

GET v2/health/ready

GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready

For more details about the statistics API calls please visit the kserve documentation website link here

Inference

An inference request is made with an HTTP POST to an inference endpoint.

POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer

For more details about the inference API calls please visit the kserve documentation website link here

Classification
- POST /v2/models/<model_name>/infer –data <JSON_Data>

The classification extension allows Triton Inference Server to return an output as a classification index and (optional) label instead of returning the output as raw tensor data. Because this extension is supported, Triton Inference Server reports “classification” in the extensions field of its Server Metadata.

For more details about the classification API calls please visit the Triton documentation website link here

Binary data * POST /v2/models/<model_name>/infer. The binary tensor data extension allows Triton Inference Server to support tensor data represented in a binary format in the body of an HTTP/REST request.

For more details about the binary data please visit the Triton documentation website link here

Logging

Managing and troubleshooting machine learning models on Triton Inference Server can be effectively accomplished by configuring the logging settings and monitoring the logs.

Explore more command line options:

docker run <triton_inference_server_image>  tritonserver [options]

GET v2/logging

POST v2/logging

 {
   "logging": {
     "log-verbose": false,
     "log-info": true,
     "log-warning": true,
     "log-error": true,
     "log-file:triton.log"
   },
 }

View the logs:

   tail -f /path/to/log/directory/triton.log

Logs will be written/overwritten into the file mentioned during the server runtime.

NOTE: Triton Inference Server allows creation of 40 log files in total. 20 for each protocol (http and grpc)

For more details about the logging API calls please visit the Triton documentation website link here

Metrics Collection

Triton Inference Server provides Prometheus metrics indicating CPU and request statistics.

GET /metrics

For more details about the metrics collections please visit the Triton documentation website link here

Traces

Traces extension enables client to fetch or configure trace settings for a given model while Triton Inference Server is running.

GET v2[/models/${MODEL_NAME}]/trace/setting

POST v2[/models/${MODEL_NAME}]/trace/setting

For more details about the trace API calls please visit the Triton documentation website link here

Statistics

GET v2/models[/${MODEL_NAME}[/versions/${MODEL_VERSION}]]/stats

For more details about the statistics API calls please visit the Triton documentation website link here

Server Metadata

The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint.

GET v2

For more details about the statistics API calls please visit the kserve documentation website link here

Model Validation

Various models that were trained on x86 or IBM Z have demonstrated focused optimizations that transparently target the IBM Integrated Accelerator for AI for a number of compute-intensive operations during inferencing.

Note. Models that were trained on x86 ecosystem may throw endianness issues.

Using the Code Samples

Python Backend for IBM Snap ML : Random Forest Classifier

The purpose of this section is to provide details on how to deploy a model of type Random Forest Classifier trained with scikit-learn, but deployable on triton inference server leveraging snapml and python backend.

As per the documentation create python runtime
Train scikit learn python Random Forest Classifier. This will generate model.pmml file
Create model folder structure including relevant file that are needed to be deployed using Triton Inference Server
Deploy the model For more details about the sample model is documented here

ONNX-MLIR Backend : Convolutional Neural Network (CNN) with MNIST

The purpose of this section is to provide details on how to deploy a model of type onnx-mlir trained with cntk a deep learning toolkit, but deployable on triton inference server leveraging onnx-mlir backend. Convolutional Neural Network with MNIST

Jupyter notebook for Convolutional Neural Network with MNIST can be found here
Use step 1 to train model on your own or download pre-trained model from here
Use IBM zDLC to transform onnx-mlir model to model.so file
Import the compiled model into mode repository.
Create model folder structure including relevant files that are needed to be deployed
Deploy the model

For more details about the sample model is documented here

ONNX-MLIR Backend : ONNX Model Zoo

See Verified for the list of models from the ONNX Model Zoo that have been built and verified with the IBM Z Deep Learning Compiler.

Import the compiled model into mode repository.
Create the model structure including relevant files that are needed to be deployed.
Deploy the model on to Triton Inference Server.

Additional Topics

Supported Tensor Data Types

Tensor data types are shown in the following table along with the size of each type, in bytes are supported.

Data Type	Size (bytes)
BOOL	1
UINT8	1
UINT16	2
UINT32	4
UINT64	8
INT8	1
INT16	2
INT32	4
INT64	8
FP16	2
FP32	4
FP64	8
BYTES	Variable (max 2³²)

Triton response cache

The Triton response cache is used by Triton to hold inference results generated for previous executed inference requests and sent as response if new inference request hits cache.

For more details about the Triton response cache is documented here

Repository agent (checksum_repository_agent)

Repository agent allows to introduce code that will perform authentication, decryption, conversion, or similar operations when a model is loaded.

For more details about the Repository agent(checksum_repository_agent) is documented here

Version policy

Below details provides how the version policy been applied while model with different versions available in the model repository.

Snap ML C++ and ONNX_MLIR backends with the -1 version

|-- model_1
|   |-- -1
|   |   `-- model.txt
|   `-- config.pbtxt
`-- model_2
    |-- -1
    |   `-- model.so
    `-- config.pbtxt

---------------------------------------+
| Model     | Version | Status         |
+-----------+---------+-----------------
------------------------------------------------------------------------------------------------------------------------------------+
| model_1|| -1      | UNAVAILABLE: Unavailable: unable to find '/models/model_1/18446744073709551615/model.txt' for model 'model_1' |
| model_2|| -1      | UNAVAILABLE: Unavailable: unable to find '/models/model_2/18446744073709551615/model.so' for model 'model_2' |
+-----------+---------+-------------------------------------------------------------------------------------------------------------+

Python and onnx-mlir backend with the 0 version

|-- model_1
|   |-- 0
|   |   |-- model.py
|   |   `-- model.txt
|   `-- config.pbtxt
`-- model_2
    |-- 0
    |   `-- model.so
    `-- config.pbtxt

I0718 12:07:48.131561 1 server.cc:653
+-----------+---------+--------+
| Model     | Version | Status |
+-----------+---------+--------+
| model_1   | 0       | READY  |
| model_2   | 0       | READY  |
+-----------+---------+--------+

Python and onnx-mlir backend with the 1 version

|-- model_1
|   |-- 1
|   |   |-- model.py
|   |   `-- model.txt
|   `-- config.pbtxt
`-- model_2
    |-- 1
    |   `-- model.so
    `-- config.pbtxt

+-----------+---------+--------+
| Model     | Version | Status |
+-----------+---------+--------+
| model_1   | 1       | READY  |
| model_2   | 1       | READY  |
+-----------+---------+--------+

Each model can have one or more versions. The ModelVersionPolicy property of the model configuration is used to set one of the following policies.

• All: Load all the versions of the model.All versions of the model that are available in the model repository are available for inferencing. version_policy: { all: {}}

• Latest: Only the latest ‘n’versions of the model in the repository are available for inferencing. The latest versions of the model are the numerically greatest version numbers.version_policy: { latest: { num_versions: 2}}

• Specific: Only the specifically listed versions of the model are available for inferencing. version_policy: {specific: { versions: [1,3]}} If no version policy is specified, then Latest (with n=1) is used as the default, indicating that only the most recent version of the model is made available by Triton. In all cases, the addition or removal of version subdirectories from the model repository can change which model version is used on subsequent inference requests.

Version Policy check: All

Test backends with multiple versions along with -1

|-- model_1
|   |-- -1
|   |   `-- model.txt
|   |-- 4
|   |   |-- model.py
|   |   `-- model.txt
`-- model_2
    |-- -1
    |   `-- model.so
    |-- 3
        `-- model.so
---------------------------------------+
| Model     | Version | Status         |
+-----------+---------+-----------------
------------------------------------------------------------------------------------------------------------------------------------+
| model_1 | -1      | UNAVAILABLE: Unavailable: unable to find '/models/model_1/18446744073709551615/model.txt' for model 'model_1' |
| model_1 | 4       | UNLOADING                                                                                                     |
| model_2 | -1      | UNAVAILABLE: Unavailable: unable to find '/models/model_2/18446744073709551615/model.so' for model 'model_2'  |
| model_2 | 3       | UNAVAILABLE: unloaded                                                                                         |
------------------------------------------------------------------------------------------------------------------------------------+

error: creating server: Internal - failed to load all models

Version Policy check: Latest

version_policy: { latest: { num_versions: 2}} to load latest 2 versions of the model. The default is the higher version of a model.

|-- model_1
|   |-- -1
|   |   |-- model.py
|   |   `-- model.txt
|   |-- 13
|   |   |-- model.py
|   |   `-- model.txt
|   |-- 17
|   |   |-- model.py
|   |   `-- model.txt
`-- model_2
    |-- -1
    |   `-- model.so
    |-- 15
    |   `-- model.so
    |-- 3
        `-- model.so
+-----------+---------+--------+
| Model     | Version | Status |
+-----------+---------+--------+
| model_1   | 13      | READY  |
| model_1   | 17      | READY  |
| model_2   | 9       | READY  |
| model_2   | 15      | READY  |
+-----------+---------+--------+

Version Policy check: Specific

version_policy: { specific: { versions: [4]}} to load specific versions of the model. – model_1

version_policy: { specific: { versions: [3,9]}} to load specific versions of the model. – model_2

+-----------+---------+--------+
| Model     | Version | Status |
+-----------+---------+--------+
| model_1   | 4       | READY  |
| model_2   | 3       | READY  |
| model_2   | 9       | READY  |
+-----------+---------+--------+

Model management examples

Single-model with Single version

-- model_1
    |-- 1
    |   |-- model.py
    |   |-- model.txt
    `-- config.pbtxt

Single-model with Multi version

`-- model_1
    |-- 1
    |   |-- model.py
    |   `-- model.txt
    |-- 2
    |   |-- model.py
    |   `-- model.txt
    `-- config.pbtxt

Multi-model with Single version

|-- model_1
|   |-- 1
|   |   |-- model.py
|   |   `-- model.txt
|   `-- config.pbtxt
|-- model_2
|   |-- 0
|   |   `-- model.so
|   `-- config.pbtxt

Multi-model with Multi versions

|-- model_1
|   |-- 1
|   |   |-- model.py
|   |   |-- model.txt
|   |-- 2
|   |   |-- model.py
|   |   |-- model.txt
|-- model_3
|   |-- 1
|   |   |-- model.py
|   |   |-- model_rfd10.pmml
|   |   |-- pipeline_rfd10.joblib
|   |-- 2
|   |   |-- model.py
|   |   |-- model_rfd10.pmml
|   |   |-- pipeline_rfd10.joblib

Limitations and Known Issues

Consumer of Triton Inference Server may or may not face an issue when utilizing Triton Inference Server with a Python backend and an HTTP endpoint on a Big Endian machine, experience errors related to the TYPE_STRING datatype. For more details, see link
Consumer of Triton Inference server may or may not face an issue when running Triton Inference Server on a Big Endian machine, specifically related to GRPC calls with BYTES input. It appears that the current configuration may not fully support GRPC calls with BYTES input. For more details, see link
In case, user of Triton Inference Server want restrict access the protocols on a given endpoint by leveraging the configuration option '--grpc-restricted-protocol'. This feature provides fine-grained control over access to various endpoints by specifying protocols and associated restricted keys and values. Consumer of Triton Inference server may or may not find similar capability for restricting endpoint access for the HTTP protocol as currently not available. For more details, see link
Consumer of Triton server will only be able to create up to 40 log files in total out of which 20 for protocol http and 20 for protocol grpc. For more details, see link
Consumer of Triton server may or may not face an issue while having model with version -1 or model.py isn't present for python backend. For more details, see link

Versions and Release cadence

IBM Z Accelerated for NVIDIA Triton™ Inference Server will follow the semantic versioning guidelines with a few deviations. Overall IBM Z Accelerated for NVIDIA Triton™ Inference Server follows a continuous release model with a cadence of 1-2 minor releases per year. In general, bug fixes will be applied to the next minor release and not back ported to prior major or minor releases. Major version changes are not frequent and may include features supporting new IBM Z hardware as well as major feature changes in Triton Inference Server that are not likely backward compatible.

IBM Z Accelerated for for NVIDIA Triton™ Inference Server versions

Each release version of IBM Z Accelerated for NVIDIA Triton™ Inference Server has the form MAJOR.MINOR.PATCH (X.Y.Z). For example, IBM Z Accelerated for NVIDIA Triton™ Inference Server version 1.2.3 has MAJOR version 1, MINOR version 2, and PATCH version 3. Changes to each number have the following meaning:

MAJOR / VERSION

All releases with the same major version number will have API compatibility. Major version numbers will remain stable. For instance, 1.Y.Z may last 1 year or more. It will potentially have backwards incompatible changes. Code and data that worked with a previous major release will not necessarily work with the new release.

Note: pybind11 PyRuntimes, any other python packages for versions of Python that have reached end of life can be removed or updated to newer stable version without a major release increase change.

MINOR / FEATURE

Minor releases will typically contain new backward compatible features, improvements, and bug fixes.

PATCH / MAINTENANCE

Maintenance releases will occur more frequently and depend on specific patches introduced (e.g. bug fixes) and their urgency. In general, these releases are designed to patch bugs.

Release cadence

Feature releases for IBM Z Accelerated for NVIDIA Triton™ Inference Server occur about every 6 months in general. Hence, IBM Z Accelerated for NVIDIA Triton™ Inference Server X.3.0 would generally be released about 6 months after X.2.0. Maintenance releases happen as needed in between feature releases. Major releases do not happen according to a fixed schedule.

Frequently Asked Questions

Q: Where can I get the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image?

Please visit this link here. Or read the section titled Downloading the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image.

Q: Where can I run the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image?

You may run the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image on IBM Linux on Z or IBM® z/OS® Container Extensions (IBM zCX).

Note. The IBM Z Accelerated for NVIDIA Triton™ Inference Server will transparently target the IBM Integrated Accelerator for AI on IBM z16 and later. However, if using the IBM Z Accelerated for NVIDIA Triton™ Inference Server on either an IBM z15® or an IBM z14®, IBM Snap ML or ONNX-MLIR will transparently target the CPU with no changes to the model.

Q: What are the different errors that can arise while using Triton Inference Server?

Error Type	Description
Model load errors	These errors occur when the server fails to load the machine learning model. Possible reasons could be incorrect model configuration, incompatible model format, or missing model files Backend errors Triton supports multiple backends for running models, such as Python, ONNX-MLIR. Errors can occur if there are issues with the backend itself, such as version compatibility problems or unsupported features.
Input data errors	When sending requests to the Triton server, issues might arise with the input data provided by the client. This could include incorrect data types, shape mismatches, or missing required inputs. Any valid request batched with invalid request might lead to either inaccurate or invalid response by inference server to the entire batch.
Inference errors	Errors during the inference process can happen due to problems with the model's architecture or issues within the model's code.
Resource errors	Triton uses system resources like CPU and memory to perform inference. Errors can occur if there are resource allocation problems or resource constraints are not handled properly.
Networking errors	Triton is a server that communicates with clients over the network. Network-related issues such as timeouts, connection problems, or firewall restrictions can lead to errors.
Configuration errors	Misconfigurations in the Triton server settings or environment variables could result in unexpected behavior or failures.
Scaling errors	When deploying Triton in a distributed or multi-instance setup, errors can occur due to load balancing issues, communication problems between instances, or synchronization failures.

Technical Support

Information regarding technical support can be found here.

Licenses

The International License Agreement for Non-Warranted Programs (ILAN) agreement can be found here

The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis.

NVIDIA and Triton are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and/or other countries.

Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc. in the United States and/or other countries. Docker, Inc. and other parties may also have trademark rights in other terms used herein.

IBM, the IBM logo, and ibm.com, IBM z16, IBM z15, IBM z14 are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. The current list of IBM trademarks can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md

License

Kanupriyagoyal/ibmz-accelerated-for-nvidia-triton-inference-server

Folders and files

Latest commit

History

Repository files navigation

Using the IBM Z Accelerated for NVIDIA Triton™ Inference Server Container Image

Table of contents

Overview

Downloading the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image

Container Image Contents

IBM Z Accelerated for NVIDIA Triton™ Inference Server container image usage

Launch IBM Z Accelerated for NVIDIA Triton™ Inference Server container

Creating a Model Repository

Launching IBM Z Accelerated for NVIDIA Triton™ Inference Server

Security and Deployment Guidelines

Triton Inference Server using HTTPS/Secure gRPC

HTTPS

Secure gRPC

IBM Z Accelerated for NVIDIA Triton™ Inference Server Backends

Python Backend

Minimal Model Configuration

Configuration Parameters:

ONNX-MLIR Backend

Minimal Model Configuration

Configuration Parameters:

Snap ML C++ Backend

Backend Usage Information

Models Repository

Model Configuration

Configuration Parameters:

REST APIs

Model Management

Model Repository

Model Configuration

Model Metadata

Health Check

Inference

Logging

Metrics Collection

Traces

Statistics

Server Metadata

Model Validation

Using the Code Samples

Python Backend for IBM Snap ML : Random Forest Classifier

ONNX-MLIR Backend : Convolutional Neural Network (CNN) with MNIST

ONNX-MLIR Backend : ONNX Model Zoo

Additional Topics

Supported Tensor Data Types

Triton response cache

Repository agent (checksum_repository_agent)

Version policy

Snap ML C++ and ONNX_MLIR backends with the -1 version

Python and onnx-mlir backend with the 0 version

Python and onnx-mlir backend with the 1 version

Version Policy check: All

Test backends with multiple versions along with -1

Version Policy check: Latest

Version Policy check: Specific

Model management examples

Single-model with Single version

Single-model with Multi version

Multi-model with Single version

Multi-model with Multi versions

Limitations and Known Issues

Versions and Release cadence

IBM Z Accelerated for for NVIDIA Triton™ Inference Server versions

MAJOR / VERSION

MINOR / FEATURE

PATCH / MAINTENANCE

Release cadence

Frequently Asked Questions

Q: Where can I get the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image?

Q: Where can I run the IBM Z Accelerated for NVIDIA Triton™ Inference Server container image?

Q: What are the different errors that can arise while using Triton Inference Server?

Technical Support

Licenses

About

Resources

Packages