Skip to content

Torch-TensorRT v2.3.0

Compare
Choose a tag to compare
@narendasan narendasan released this 07 Jun 21:49
52ba6f1

Windows Support, Dynamic Shape and Quantization in Dynamo , PyTorch 2.3, CUDA 12.1, TensorRT 10.0

Torch-TensorRT 2.3.0 targets PyTorch 2.3, CUDA 12.1 (builds for CUDA 11.8 are available via the PyTorch package index - https://download.pytorch.org/whl/cu118) and TensorRT 10.0. 2.3.0 adds official support for Windows as a platform. Windows will only support using the Dynamo frontend and currently users are required to use the Python-only runtime (support for the C++ runtime will be added in a future version). This release also adds support for Dynamic shape without recompilation. Users can also now use quantized models with Torch-TensorRT using the Model Optimizer toolkit (https://github.com/NVIDIA/TensorRT-Model-Optimizer).

Note: Python 3.12 is not supported as the Dynamo stack in PyTorch 2.3.0 does not support Python 3.12

Windows

In this release we introduce Windows support for the Python runtime using the Dynamo paths. Users can now directly optimize PyTorch models with TensorRT on Windows, with minimal code changes. This integration enables Python-only optimization in the Torch-TensorRT Dynamo compilation paths (ir="dynamo" and ir="torch_compile").

import torch
import torch_tensorrt
import torchvision.models as models

model = models.resnet18(pretrained=True).eval().to("cuda")
input = torch.randn((1, 3, 224, 224)).to("cuda")
trt_mod = torch_tensorrt.compile(model, ir="dynamo", inputs=[input])
trt_mod(input)

Dynamic Shaped Model Compilation in Dynamo

Dynamic shape support has become more robust in v2.3.0. Torch-TensorRT now leverages symbolic information in the graph to calculate intermediate shape ranges which allows more dynamic shape cases to be supported. For AOT workflows using torch.export, using these new features requires no changes. For JIT workflows which previously used torch.compile guards to automatically recompile the engines where the input size changes, users can now mark dynamic dimensions using torch APIs (https://pytorch.org/docs/stable/torch.compiler_dynamic_shapes.html). Using these APIs will mean that as long as inputs do not violate the specified constraints, engines would not recompile.

AOT workflow
import torch
import torch_tensorrt 
compile_spec = {"inputs": [torch_tensorrt.Input(min_shape=(1, 3, 224, 224), 
                              opt_shape=(4, 3, 224, 224), 
                              max_shape=(8, 3, 224, 224), 
                              dtype=torch.float32)], 
                              "enabled_precisions": {torch.float}}
trt_model = torch_tensorrt.compile(model, **compile_spec)
JIT workflow
import torch
import torch_tensorrt
compile_spec = {"enabled_precisions": {torch.float}}
inputs = torch.randn((4, 3, 224, 224)).to("cuda")
# This indicates the dimension 0 is dynamic and the range is [1, 8]
torch._dynamo.mark_dynamic(inputs, 0, min=1, max=8)
trt_model = torch.compile(model, backend="tensorrt", options=compile_spec)

More information can be found here: https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html

Explicit Dynamic Shape support in Converters

Converters now explicitly declare their support for dynamic shapes and we are progressively adding and verifying. Converter writers can specify the support for dynamic shapes using the supports_dynamic_shape argument of the dynamo_tensorrt_converter decorator.

@dynamo_tensorrt_converter(
    torch.ops.aten.convolution.default, 
     capability_validator=lambda conv_node: conv_node.args[7] in ([0], [0, 0], [0, 0, 0])
     supports_dynamic_shapes=True,
)  # type: ignore[misc]
def aten_ops_convolution(
    ctx: ConversionContext,
    target: Target,
    args: Tuple[Argument, ...],
    kwargs: Dict[str, Argument],
    name: str,
) -> Union[TRTTensor, Sequence[TRTTensor]]:

By default, if a converter has not been marked as supporting dynamic shape, it's operator will be run in PyTorch if the user has specified the inputs as dynamic. This is done for the sake of ensuring that compilation will succeed with some valid compiled module. However, many operators already support dynamic shape in an untested fashion. Therefore, users can decide to enable to full converter library for dynamic shape using the assume_dynamic_shape_support flag. This flag assumes all converters support dynamic shape, leading to more operations being run in TensorRT with the potential drawback that some ops may cause compilation or runtime failures. Future releases will add progressively add coverage for dynamic shape for all Core ATen Operators.

Quantization in Dynamo

We introduce support for model quantization in FP8. We support models quantized using NVIDIA TensorRT-Model-Optimizer toolkit. This toolkit introduces quantization nodes in the graph which are converted and used by TensorRT to quantize the model into lower precision. Although the toolkit supports quantization in various datatypes, we only support FP8 in this release.
Please refer to our end-end example Torch Compile VGG16 with FP8 and PTQ on how to use this.

Engine Version and Hardware Compatibility

We introduce new compilation arguments, hardware_compatible: bool and version_compatible: bool, which enable two key features in TensorRT.

hardware_compatible

Enabling hardware compatibility mode will generate TRT Engines which are compatible with Ampere and newer GPUs. As a result, engines built on one GPU can later be run on others, without requiring recompilation.

version_compatible

Enabling version compatibility mode will generate TRT Engines which are compatible with newer versions of TensorRT. As a result, engines built with one version of TensorRT will be forward compatible with other TRT versions, without needing recompilation.

...
trt_mod = torch_tensorrt.compile(model, ir="dynamo", inputs=[input], hardware_compatible=True, version_compatible=True)
...

New Data Type Support

Torch-TensorRT includes a number of new data types that leverage dedicated hardware on Ampere, Hopper and future architectures.
bfloat16 has been added as a supported type alongside FP16 and FP32 that can be enabled for additional kernel tactic options. Models that contain BF16 weights can now be provided to Torch-TensorRT without modification. FP8 has been added with support for Hopper and newer architectures as a new quantization format (see below), similar to INT8. Finally, native support for INT64 inputs and computation has been added. In the past, the truncate_long_and_double feature flag must be enabled in order to handle INT64 and FLOAT64 computation, inputs and weights. This flag would cause the compiler to truncate any INT64 or FLOAT64 objects to INT32 and FLOAT32 respectively. Now INT64 objects will not be truncated and remain in INT64. As such, the truncate_long_and_double flag has been renamed truncate_double as FLOAT64 truncation is still required, truncate_long_and_double is now deprecated.

What's Changed

New Contributors

Full Changelog: v2.2.0...v2.3.0