Release Torch-TensorRT v1.3.0 · pytorch/TensorRT

PyTorch 1.13, CUDA 11.7, TensorRT 8.5, Support for Dynamic Batch for Partially Compiled Modules, Engine Profiling, Experimental Unified Runtime for FX and TorchScript Frontends

Torch-TensorRT 1.3.0 targets PyTorch 1.13, CUDA 11.7, cuDNN 8.5 and TensorRT 8.5. This release focuses on adding support for Dynamic Batch Sizes for partially compiled modules using the TorchScript frontend (this is also supported with the FX frontend). It also introduces a new execution profiling utility to understand the execution of specific engine sub blocks that can be used in conjunction with PyTorch profiling tools to understand the performance of your model post compilation. Finally this release introduces a new experimental unified runtime shared by both the TorchScript and FX frontends. This allows you to start using the FX frontend to generate torch.jit.traceable compiled modules.

Dynamic Batch Sizes for Partially Compiled Modules via the TorchScript Frontend

A long-standing limitation of the partitioning system in the TorchScript function is lack of support for dynamic shapes. In this release we address a major subset of these use cases with support for dynamic batch sizes for modules that will be partially compiled. Usage is the same as the fully compiled workflow where using the torch_tensorrt.Input class, you may define the range of shapes that an input may take during runtime. This is represented as a set of 3 shape sizes: min, max and opt. min and max define the dynamic range of the input Tensor. opt informs TensorRT what size to optimize for provided there are multiple valid kernels available. TensorRT will select kernels that are valid for the full range of input shapes but most efficient at the opt size. In this release, partially compiled module inputs can vary in shape for the highest order dimension.

For example:

min_shape: (1, 3, 128, 128)
opt_shape: (8, 3, 128, 128)
max_shape: (32, 3, 128, 128)

Is a valid shape range, however:

min_shape: (1, 3, 128, 128)
opt_shape: (1, 3, 256, 256)
max_shape: (1, 3, 512, 512)

is still not supported.

Engine Profiling [Experimental]

This release introduces a number of profiling tools to measure the performance of TensorRT sub blocks in compiled modules. This can be used in conjunction with PyTorch profiling tools to get a picture of the performance of your model. Profiling for any particular sub block can be enabled by the enabled_profiling() method of any __torch__.classes.tensorrt.Engine attribute, or of any torch_tensorrt.TRTModuleNext. The profiler will dump trace files by default in /tmp, though this path can be customized by either setting the profile_path_prefix of __torch__.classes.tensorrt.Engine or as an argument to torch_tensorrt.TRTModuleNext.enable_precision(profiling_results_dir=""). Traces can be visualized using the Perfetto tool (https://perfetto.dev)

Engine Layer information can also be accessed using get_layer_info which returns a JSON string with the layers / fusions that the engine contains.

Unified Runtime for FX and TorchScript Frontends [Experimental]

In previous versions of Torch-TensorRT, the FX and TorchScript frontends were mostly separate and each had their distinct benefits and limitations. Torch-TensorRT 1.3.0 introduces a new unified runtime to support both FX and TorchScript meaning that you can choose the compilation workflow that makes the most sense for your particular use case, be it pure Python conversion via FX or C++ Torchscript compilation. Both frontends use the same primitives to construct their compiled graphs be it fully compiled or just partially.

Basic Usage

The TorchScript frontend uses the new runtime by default. No additional workflow changes are necessary.

Note: The runtime ABI version was increased to support this feature, as such models compiled with previous versions of Torch-TensorRT will need to be recompiled

For the FX frontend, the new runtime can be chosen but setting use_experimental_fx_rt=True as part of your compile settings to either torch_tensorrt.compile(my_mod, ir="fx", use_experimental_fx_rt=True, explicit_batch_dimension=True) or torch_tensorrt.fx.compile(my_mod, use_experimental_fx_rt=True, explicit_batch_dimension=True)

Note: The new runtime only supports explicit batch dimension

TRTModuleNext

The FX frontend will return a torch.nn.Module containing torch_tensorrt.TRTModuleNext submodules instead of torch_tensorrt.fx.TRTModules. The features of these modules are nearly identical but with a few key improvements.

TRTModuleNext profiling dumps a trace visualizable with Perfetto (see above for more details).
TRTModuleNext modules are torch.jit.trace-able, meaning you can save FX compiled modules as TorchScript for python-less / C++ deployment scenarios. Traced compiled modules have the same deployment instructions as compiled modules produced by the TorchScript frontend.
TRTModuleNext maintains the same serialization workflows TRTModule supports as well (state_dict / extra_state, torch.save/torch.load)

Examples

model_fx = model_fx.cuda()
inputs_fx = [i.cuda() for i in inputs_fx]
trt_fx_module_f16 = torch_tensorrt.compile(
    model_fx,
    ir="fx",
    inputs=inputs_fx,
    enabled_precisions={torch.float16},
    use_experimental_fx_rt=True,
    explicit_batch_dimension=True
)

# Save model using torch.save 

torch.save(trt_fx_module_f16, "trt.pt")
reload_trt_mod = torch.load("trt.pt")

# Trace and save the FX module in TorchScript
scripted_fx_module = torch.jit.trace(trt_fx_module_f16, example_inputs=inputs_fx)
scripted_fx_module.save("/tmp/scripted_fx_module.ts")
scripted_fx_module = torch.jit.load("/tmp/scripted_fx_module.ts")

... #Get a handle for a TRTModuleNext submodule

# Extract state dictionary
st = trt_mod.state_dict()

# Load the state dict into a new module
new_trt_mod = TRTModuleNext()
new_trt_mod.load_state_dict(st)

Using TRTModuleNext as an arbirary TensorRT engine holder

Using TorchScript you have long been able to embed an arbritrary TensorRT engine from any source in a TorchScript module using torch_tensorrt.ts.embed_engine_in_new_module. Now you can do this at the torch.nn.Module level by directly using TRTModuleNext and access all the benefits enumerated above.

trt_mod = TRTModuleNext(
            serialized_engine,
            name="TestModule",
            input_binding_names=input_names,
            output_binding_names=output_names,
 )

The intention is in a future release to have torch_tensorrt.TRTModuleNext replace torch_tensorrt.fx.TRTModule as the default TensorRT Module implementation. Feedback on this class or how it is used, the runtime in general or associated features (profiler, engine inspector) is welcomed.

What's Changed

chore: Bump version to 1.2.0a0 by @narendasan in #1044
feat: Extending nox for cxx11 ABI version by @andi4191 in #1013
docs: Update the documentation theme to PyTorch by @narendasan in #1063
Adding Code of Conduct file by @facebook-github-bot in #1061
Update CONTRIBUTING.md by @frank-wei in #1064
feat: Optimize hub.py download by @andi4191 in #1022
Adding an action to automatically assign reviewers and assignees by @narendasan in #1078
Add PR assigner support by @narendasan in #1080
(//core): Align with prim::Enter in module fallback by @andi4191 in #991
(//core): Added a variant for aten::split by @andi4191 in #992
feat(nox): Replacing session with environment variable by @andi4191 in #1057
Refactor the internal codebase from fx2trt_oss to torch_tensorrt by @frank-wei in #1104
format by buildifier by @frank-wei in #1106
[fx2trt] Modify lower setting class by @frank-wei in #1107
Modified the notebooks directory's README file by @svenchilton in #1102
[FX] Sync to OSS by @frank-wei in #1118
[fx_acc] Add acc_tracer support for torch.mm by @khabinov in #1120
Added Triton deployment instructions to documentation by @tanayvarshney in #1116
amending triton deployment docs by @tanayvarshney in #1126
fix: Update broken repo hyperlink by @lamhoangtung in #1131
fix: Fix keep_dims functionality for aten::max by @peri044 in #1099
fix(tests/core/partitioning): Fix tests of refactoring segmentation in partitioning by @peri044 in #1140
feat(//tests): Update rtol and atol based tolerance for test cases by @andi4191 in #1055
doc: add the explanation for partition phases on docs by @bowang007 in #1090
feat (//cpp): Using atol and rtol based tolerance threshold for torchtrtc by @andi4191 in #1052
CI/CD setup by @frank-wei in #1137
Update README.md by @frank-wei in #1142
[fx2trt] Engineholder feature improvement, test fixes by @frank-wei in #1143
feat (//core/conversion) : Add converter for torch.bitwise_not by @blchu in #1029
fixed typos by @tanayvarshney in #1098
[FX] --fx-only does not need to check bazel by @frank-wei in #1147
[FX] refactor the fx path in compile function by @frank-wei in #1141
[FX] Create getting_started_with_fx_path.rst by @frank-wei in #1145
[FX] move example folder by @frank-wei in #1149
[FX] Sync enhancement done internally at Meta by @yinghai in #1161
Update config.yml by @frank-wei in #1163
Use py3 next() syntax by @ptrblck in #1159
Add missing comma for proper torch versioning in setup.py by @dabauxi in #1164
[docs] Update link to relative path by @zhiqwang in #1171
[FX] Changes done internally at Facebook by @frank-wei in #1172
fix: fix the model name typo error by @bowang007 in #1176
[FX] Changes done internally at Facebook by @frank-wei in #1178
[feat]: support slice with dynamic shape by @inocsin in #1110
[FX] Update getting_started_with_fx_path.rst by @frank-wei in #1184
[FX] Update README.md by @frank-wei in #1183
fix: Fix PTQ calibration when there are multiple inputs by @peri044 in #1191
[FX] Changes done internally at Facebook by @frank-wei in #1194
[fix]: fix bug in aten::to, when network only have aten::to layer wil… by @inocsin in #1108
Add .circleci/config.yml by @narendasan in #1153
feat: Upgrade TRT to 8.4 by @peri044 in #1152
feat: Update Pytorch version to 1.12 by @peri044 in #1177
fix: converter renaming already named tensors by @bowang007 in #1167
feat(//py): Use TensorRT to fill in .so libraries automatically if possible by @narendasan in #1085
[FX] Changes done internally at Facebook by @frank-wei in #1204
fix: fix the parsing related model loading bug by @bowang007 in #1148
feat: support min_block_size != 1 caused fallback nodes re-segmentation by @bowang007 in #1195
[FX] Changes done internally at Facebook by @frank-wei in #1208
fix: fix the fallback related issue after merging collection by @bowang007 in #1206
Add CMake support to build the libraries by @gcuendet in #1058
Fix typo in EfficientNet-example by @davinnovation in #1217
fix: fix bug that ListConstruct in TRT subgraph when it's entire graph's output by @bowang007 in #1220
fix: fix the error that collection input segmented into trt subgraph by @bowang007 in #1225
feat(//circleci): Adding release automation by @narendasan in #1215
fix: support int tensor * int scaler in aten::mul by @mfeliz-cruise in #1095
[FX] Changes done internally at Facebook by @frank-wei in #1221
Fix errors in unbind and list slice by @mfeliz-cruise in #1088
Adding a Resnet C++ example by @vinhngx in #1175
[FX] disable 2 of conv3d and type_as tests by @frank-wei in #1224
[feat] Add support for integers in aten::abs converter (#35) by @mfeliz-cruise in #1232
Update PTQ example to fix new compile_spec requirements by @ncomly-nvidia in #1242
feat: support for grouped inputs by @narendasan in #1201
feat: Added support for custom torch operators and converters in torchtrtc by @andi4191 in #1219
Add outputPadding in deconv by @ruoqianguo in #1234
chore: Apply linting and ignore new bazel dirs by @narendasan in #1223
added qat-ptq workflow notebook by @tanayvarshney in #1239
fix: Update cmake for the new collection files by @narendasan in #1246
chore: ignore dist dir for pre-commit by @narendasan in #1249
chore: Aligning bazel version for consistency across different docker… by @andi4191 in #1250
refactor: Changed the hardcoded values to macros for DLA memory sizes by @andi4191 in #1247
chore: update jetson pytorch baase by @narendasan in #1251
[feat] Add automatic type promotion to element-wise ops by @mfeliz-cruise in #1240
Assorted small fixes by @narendasan in #1259
[FX] remove op_lowering_disallow_list and format revert by @frank-wei in #1261
fix: fix the "schema not found for node" error by @bowang007 in #1236
chore: Fix contributing doc by @peri044 in #1268
feat: support scatter.value and scatter.src by @inocsin in #1252
Internal workspace workflow by @narendasan in #1269
Fix typo in README by @davinnovation in #1273
Support swin/bert with dynamic batch by @Njuapp in #1270
correct sha256sum of cudnn by @Njuapp in #1278
Jetson workspace by @narendasan in #1280
chore(deps): bump @actions/core from 1.8.2 to 1.9.1 in /.github/actions/assigner by @dependabot in #1287
[FX] Changes done internally at Facebook by @frank-wei in #1288
chore: Fix dataloader in finetune_qat script by @andi4191 in #1292
chore: Truncate long and double for ptq CPP path by @andi4191 in #1291
feat: Add support for aten::square by @mfeliz-cruise in #1286
fix: fix misleading skipping partitioning msg by @bowang007 in #1289
fix: Add int support to constant_pad_nd by @mfeliz-cruise in #1283
fix: Resolve non-determinism in registerSegmentsOutputs by @mfeliz-cruise in #1284
docs: Update docgen task by @narendasan in #1294
update fx notebook by @frank-wei in #1297
[FX] Changes done internally at Facebook by @frank-wei in #1299
fix(tools): Fix linter to not depend on docker by @narendasan in #1301
Support multiple indices for aten::index.Tensor by @ruoqianguo in #1309
chore: Adding CMake to the CI by @narendasan in #1310
feat: Upgrade Pytorch to 1.12.1 and TensorRT to 8.4.3.1 by @peri044 in #1315
Fix bug: correct the output shape of aten::index.Tensor by @ruoqianguo in #1314
feat (//core/conversion) : Add converter for torch.repeat_interleave ( by @blchu in #1313
chore: Adding NGC build path by @narendasan in #1311
Update lower.py by @frank-wei in #1324
fix!: Fixed Windows compilation failures by @andi4191 in #1330
[feat] Add support for argmax and argmin by @mfeliz-cruise in #1312
chore: Adding a guideline to build on Windows platform by @andi4191 in #1337
chore: Fix data loader issues and nox file paths by @peri044 in #1281
feat(//tools/perf): Refactor perf_run.py, add fx2trt backend support, usage via CLI arguments by @peri044 in #1254
refactor(//tests) : Refactor the test suite by @peri044 in #1329
[feat] add support for aten::reciprocal(int) by @mfeliz-cruise in #1308
[FX] Update getting_started_with_fx_path.rst by @frank-wei in #1342
Update getting_started_with_fx_path.rst by @frank-wei in #1343
enable direct call to fx.compile() by @frank-wei in #1344
fix: add remove_exception pass from torch to fix uninitialized tensor… by @bowang007 in #1345
chore: apply linting to docs by @narendasan in #1347
docs: Adding v1.2.0 and v1.1.1 docs by @narendasan in #1349
Docs for release by @narendasan in #1350
fix: Fixing pybind error on nightly by @andi4191 in #1285
Centralizing Partitioning State by @narendasan in #1263
chore: Fix centralized partititoning by @peri044 in #1367
chore: Move master to test nightly only by @narendasan in #1370
[fix] Avoid layer name conflicts in aten::index by @mfeliz-cruise in #1377
[fix] Fix output dimensions of aten::unbind converter by @mfeliz-cruise in #1373
Einsum converter by @gs-olive in #1385
Atan2 converter by @gs-olive in #1381
[FX] aten2trt and some pass fixes by @frank-wei in #1390
feat: Add converter for aten::sign unary op by @gs-olive in #1391
Add support for aten::squeeze without a dim by @mfeliz-cruise in #1393
[fix] incorrect casting behavior in floor_divide by @mfeliz-cruise in #1392
chore: minor fixes by @peri044 in #1397
fix: torch.std and torch.var support multi-dimensional reductions by @gs-olive in #1395
fix: fix missing float type in shape analysis by @bowang007 in #1399
feat: Rsqrt lowering pass by @gs-olive in #1394
Add correct pip install instructions by @msaroufim in #1400
fix: aten::split behavior with negative indexing by @gs-olive in #1403
fix: fix compilation stuck bug caused by elimination exception by @bowang007 in #1409
[FX] Fix clamping float32 boundary values, aten2trt init check-in, fix slice issues by @frank-wei in #1415
[feat]Add converter for aten::where by @mfeliz-cruise in #1421
[feat]Add converter support for aten::frobenius_norm by @mfeliz-cruise in #1422
chore: Update torch installation paths for NGC by @peri044 in #1435
[feat] Add dependency awareness to torch-trt partitioning by @mfeliz-cruise in #1304
docs: minor changes in Resnet50 example by @przemb in #1427
fix: Ensure proper type inheritance in aten::masked_fill by @gs-olive in #1430
chore: Nox file update from NGC 22.11 release by @peri044 in #1438
fix: Add check to ensure einsum converter has no more than 2 tensor inputs by @gs-olive in #1439
[feat] Add partial converter support for aten::linalg_norm by @mfeliz-cruise in #1426
chore: Lint noxfile.py by @gs-olive in #1443
fix: CUDA error 710 bugfix by @gs-olive in #1424
scalar_to_tensor avoid scalar.to() by @Njuapp in #1448
feat: rewriting param to a Constant if it's a introduced input by @bowang007 in #1298
feat: support int64 <=> int32 auto conversion by @bowang007 in #1407
fix: Device casting issues with certain aten operators by @gs-olive in #1416
feat(//core/partitioning) : Dynamic shapes + fallback by @peri044 in #1414
[fix] unmangle_cls_name for variable length mangled tags by @mfeliz-cruise in #1454
fix: Error with aten::div when using truncation with Int32 tensor inputs by @gs-olive in #1442
fix: fix failed test cases caused by partition API changes by @bowang007 in #1460
fix: Update floor division schema replacement in lowering by @gs-olive in #1464
feat: Add functionality to performance tooling by @gs-olive in #1451
Unifying the FX and TS Frontends by @narendasan in #1404

New Contributors

@facebook-github-bot made their first contribution in #1061
@frank-wei made their first contribution in #1064
@khabinov made their first contribution in #1120
@blchu made their first contribution in #1029
@yinghai made their first contribution in #1161
@ptrblck made their first contribution in #1159
@dabauxi made their first contribution in #1164
@zhiqwang made their first contribution in #1171
@gcuendet made their first contribution in #1058
@davinnovation made their first contribution in #1217
@dependabot made their first contribution in #1287
@msaroufim made their first contribution in #1400
@przemb made their first contribution in #1427

Full Changelog: v1.1.0...v1.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch-TensorRT v1.3.0