Torch-TensorRT v1.3.0
PyTorch 1.13, CUDA 11.7, TensorRT 8.5, Support for Dynamic Batch for Partially Compiled Modules, Engine Profiling, Experimental Unified Runtime for FX and TorchScript Frontends
Torch-TensorRT 1.3.0 targets PyTorch 1.13, CUDA 11.7, cuDNN 8.5 and TensorRT 8.5. This release focuses on adding support for Dynamic Batch Sizes for partially compiled modules using the TorchScript frontend (this is also supported with the FX frontend). It also introduces a new execution profiling utility to understand the execution of specific engine sub blocks that can be used in conjunction with PyTorch profiling tools to understand the performance of your model post compilation. Finally this release introduces a new experimental unified runtime shared by both the TorchScript and FX frontends. This allows you to start using the FX frontend to generate torch.jit.trace
able compiled modules.
Dynamic Batch Sizes for Partially Compiled Modules via the TorchScript Frontend
A long-standing limitation of the partitioning system in the TorchScript function is lack of support for dynamic shapes. In this release we address a major subset of these use cases with support for dynamic batch sizes for modules that will be partially compiled. Usage is the same as the fully compiled workflow where using the torch_tensorrt.Input
class, you may define the range of shapes that an input may take during runtime. This is represented as a set of 3 shape sizes: min
, max
and opt
. min
and max
define the dynamic range of the input Tensor. opt
informs TensorRT what size to optimize for provided there are multiple valid kernels available. TensorRT will select kernels that are valid for the full range of input shapes but most efficient at the opt
size. In this release, partially compiled module inputs can vary in shape for the highest order dimension.
For example:
min_shape: (1, 3, 128, 128)
opt_shape: (8, 3, 128, 128)
max_shape: (32, 3, 128, 128)
Is a valid shape range, however:
min_shape: (1, 3, 128, 128)
opt_shape: (1, 3, 256, 256)
max_shape: (1, 3, 512, 512)
is still not supported.
Engine Profiling [Experimental]
This release introduces a number of profiling tools to measure the performance of TensorRT sub blocks in compiled modules. This can be used in conjunction with PyTorch profiling tools to get a picture of the performance of your model. Profiling for any particular sub block can be enabled by the enabled_profiling()
method of any __torch__.classes.tensorrt.Engine
attribute, or of any torch_tensorrt.TRTModuleNext
. The profiler will dump trace files by default in /tmp
, though this path can be customized by either setting the profile_path_prefix
of __torch__.classes.tensorrt.Engine
or as an argument to torch_tensorrt.TRTModuleNext.enable_precision(profiling_results_dir="")
. Traces can be visualized using the Perfetto tool (https://perfetto.dev)
Engine Layer information can also be accessed using get_layer_info
which returns a JSON string with the layers / fusions that the engine contains.
Unified Runtime for FX and TorchScript Frontends [Experimental]
In previous versions of Torch-TensorRT, the FX and TorchScript frontends were mostly separate and each had their distinct benefits and limitations. Torch-TensorRT 1.3.0 introduces a new unified runtime to support both FX and TorchScript meaning that you can choose the compilation workflow that makes the most sense for your particular use case, be it pure Python conversion via FX or C++ Torchscript compilation. Both frontends use the same primitives to construct their compiled graphs be it fully compiled or just partially.
Basic Usage
The TorchScript frontend uses the new runtime by default. No additional workflow changes are necessary.
Note: The runtime ABI version was increased to support this feature, as such models compiled with previous versions of Torch-TensorRT will need to be recompiled
For the FX frontend, the new runtime can be chosen but setting use_experimental_fx_rt=True
as part of your compile settings to either torch_tensorrt.compile(my_mod, ir="fx", use_experimental_fx_rt=True, explicit_batch_dimension=True)
or torch_tensorrt.fx.compile(my_mod, use_experimental_fx_rt=True, explicit_batch_dimension=True)
Note: The new runtime only supports explicit batch dimension
TRTModuleNext
The FX frontend will return a torch.nn.Module
containing torch_tensorrt.TRTModuleNext
submodules instead of torch_tensorrt.fx.TRTModule
s. The features of these modules are nearly identical but with a few key improvements.
TRTModuleNext
profiling dumps a trace visualizable with Perfetto (see above for more details).TRTModuleNext
modules aretorch.jit.trace
-able, meaning you can save FX compiled modules as TorchScript for python-less / C++ deployment scenarios. Traced compiled modules have the same deployment instructions as compiled modules produced by the TorchScript frontend.- TRTModuleNext maintains the same serialization workflows
TRTModule
supports as well (state_dict / extra_state, torch.save/torch.load)
Examples
model_fx = model_fx.cuda()
inputs_fx = [i.cuda() for i in inputs_fx]
trt_fx_module_f16 = torch_tensorrt.compile(
model_fx,
ir="fx",
inputs=inputs_fx,
enabled_precisions={torch.float16},
use_experimental_fx_rt=True,
explicit_batch_dimension=True
)
# Save model using torch.save
torch.save(trt_fx_module_f16, "trt.pt")
reload_trt_mod = torch.load("trt.pt")
# Trace and save the FX module in TorchScript
scripted_fx_module = torch.jit.trace(trt_fx_module_f16, example_inputs=inputs_fx)
scripted_fx_module.save("/tmp/scripted_fx_module.ts")
scripted_fx_module = torch.jit.load("/tmp/scripted_fx_module.ts")
... #Get a handle for a TRTModuleNext submodule
# Extract state dictionary
st = trt_mod.state_dict()
# Load the state dict into a new module
new_trt_mod = TRTModuleNext()
new_trt_mod.load_state_dict(st)
Using TRTModuleNext as an arbirary TensorRT engine holder
Using TorchScript you have long been able to embed an arbritrary TensorRT engine from any source in a TorchScript module using torch_tensorrt.ts.embed_engine_in_new_module
. Now you can do this at the torch.nn.Module
level by directly using TRTModuleNext
and access all the benefits enumerated above.
trt_mod = TRTModuleNext(
serialized_engine,
name="TestModule",
input_binding_names=input_names,
output_binding_names=output_names,
)
The intention is in a future release to have torch_tensorrt.TRTModuleNext
replace torch_tensorrt.fx.TRTModule
as the default TensorRT Module implementation. Feedback on this class or how it is used, the runtime in general or associated features (profiler, engine inspector) is welcomed.
What's Changed
- chore: Bump version to 1.2.0a0 by @narendasan in #1044
- feat: Extending nox for cxx11 ABI version by @andi4191 in #1013
- docs: Update the documentation theme to PyTorch by @narendasan in #1063
- Adding Code of Conduct file by @facebook-github-bot in #1061
- Update CONTRIBUTING.md by @frank-wei in #1064
- feat: Optimize hub.py download by @andi4191 in #1022
- Adding an action to automatically assign reviewers and assignees by @narendasan in #1078
- Add PR assigner support by @narendasan in #1080
- (//core): Align with prim::Enter in module fallback by @andi4191 in #991
- (//core): Added a variant for aten::split by @andi4191 in #992
- feat(nox): Replacing session with environment variable by @andi4191 in #1057
- Refactor the internal codebase from fx2trt_oss to torch_tensorrt by @frank-wei in #1104
- format by buildifier by @frank-wei in #1106
- [fx2trt] Modify lower setting class by @frank-wei in #1107
- Modified the notebooks directory's README file by @svenchilton in #1102
- [FX] Sync to OSS by @frank-wei in #1118
- [fx_acc] Add acc_tracer support for torch.mm by @khabinov in #1120
- Added Triton deployment instructions to documentation by @tanayvarshney in #1116
- amending triton deployment docs by @tanayvarshney in #1126
- fix: Update broken repo hyperlink by @lamhoangtung in #1131
- fix: Fix keep_dims functionality for aten::max by @peri044 in #1099
- fix(tests/core/partitioning): Fix tests of refactoring segmentation in partitioning by @peri044 in #1140
- feat(//tests): Update rtol and atol based tolerance for test cases by @andi4191 in #1055
- doc: add the explanation for partition phases on docs by @bowang007 in #1090
- feat (//cpp): Using atol and rtol based tolerance threshold for torchtrtc by @andi4191 in #1052
- CI/CD setup by @frank-wei in #1137
- Update README.md by @frank-wei in #1142
- [fx2trt] Engineholder feature improvement, test fixes by @frank-wei in #1143
- feat (//core/conversion) : Add converter for torch.bitwise_not by @blchu in #1029
- fixed typos by @tanayvarshney in #1098
- [FX] --fx-only does not need to check bazel by @frank-wei in #1147
- [FX] refactor the fx path in compile function by @frank-wei in #1141
- [FX] Create getting_started_with_fx_path.rst by @frank-wei in #1145
- [FX] move example folder by @frank-wei in #1149
- [FX] Sync enhancement done internally at Meta by @yinghai in #1161
- Update config.yml by @frank-wei in #1163
- Use py3 next() syntax by @ptrblck in #1159
- Add missing comma for proper torch versioning in setup.py by @dabauxi in #1164
- [docs] Update link to relative path by @zhiqwang in #1171
- [FX] Changes done internally at Facebook by @frank-wei in #1172
- fix: fix the model name typo error by @bowang007 in #1176
- [FX] Changes done internally at Facebook by @frank-wei in #1178
- [feat]: support slice with dynamic shape by @inocsin in #1110
- [FX] Update getting_started_with_fx_path.rst by @frank-wei in #1184
- [FX] Update README.md by @frank-wei in #1183
- fix: Fix PTQ calibration when there are multiple inputs by @peri044 in #1191
- [FX] Changes done internally at Facebook by @frank-wei in #1194
- [fix]: fix bug in aten::to, when network only have aten::to layer wil… by @inocsin in #1108
- Add .circleci/config.yml by @narendasan in #1153
- feat: Upgrade TRT to 8.4 by @peri044 in #1152
- feat: Update Pytorch version to 1.12 by @peri044 in #1177
- fix: converter renaming already named tensors by @bowang007 in #1167
- feat(//py): Use TensorRT to fill in .so libraries automatically if possible by @narendasan in #1085
- [FX] Changes done internally at Facebook by @frank-wei in #1204
- fix: fix the parsing related model loading bug by @bowang007 in #1148
- feat: support min_block_size != 1 caused fallback nodes re-segmentation by @bowang007 in #1195
- [FX] Changes done internally at Facebook by @frank-wei in #1208
- fix: fix the fallback related issue after merging collection by @bowang007 in #1206
- Add CMake support to build the libraries by @gcuendet in #1058
- Fix typo in EfficientNet-example by @davinnovation in #1217
- fix: fix bug that ListConstruct in TRT subgraph when it's entire graph's output by @bowang007 in #1220
- fix: fix the error that collection input segmented into trt subgraph by @bowang007 in #1225
- feat(//circleci): Adding release automation by @narendasan in #1215
- fix: support int tensor * int scaler in aten::mul by @mfeliz-cruise in #1095
- [FX] Changes done internally at Facebook by @frank-wei in #1221
- Fix errors in unbind and list slice by @mfeliz-cruise in #1088
- Adding a Resnet C++ example by @vinhngx in #1175
- [FX] disable 2 of conv3d and type_as tests by @frank-wei in #1224
- [feat] Add support for integers in aten::abs converter (#35) by @mfeliz-cruise in #1232
- Update PTQ example to fix new compile_spec requirements by @ncomly-nvidia in #1242
- feat: support for grouped inputs by @narendasan in #1201
- feat: Added support for custom torch operators and converters in torchtrtc by @andi4191 in #1219
- Add outputPadding in deconv by @ruoqianguo in #1234
- chore: Apply linting and ignore new bazel dirs by @narendasan in #1223
- added qat-ptq workflow notebook by @tanayvarshney in #1239
- fix: Update cmake for the new collection files by @narendasan in #1246
- chore: ignore dist dir for pre-commit by @narendasan in #1249
- chore: Aligning bazel version for consistency across different docker… by @andi4191 in #1250
- refactor: Changed the hardcoded values to macros for DLA memory sizes by @andi4191 in #1247
- chore: update jetson pytorch baase by @narendasan in #1251
- [feat] Add automatic type promotion to element-wise ops by @mfeliz-cruise in #1240
- Assorted small fixes by @narendasan in #1259
- [FX] remove op_lowering_disallow_list and format revert by @frank-wei in #1261
- fix: fix the "schema not found for node" error by @bowang007 in #1236
- chore: Fix contributing doc by @peri044 in #1268
- feat: support scatter.value and scatter.src by @inocsin in #1252
- Internal workspace workflow by @narendasan in #1269
- Fix typo in README by @davinnovation in #1273
- Support swin/bert with dynamic batch by @Njuapp in #1270
- correct sha256sum of cudnn by @Njuapp in #1278
- Jetson workspace by @narendasan in #1280
- chore(deps): bump @actions/core from 1.8.2 to 1.9.1 in /.github/actions/assigner by @dependabot in #1287
- [FX] Changes done internally at Facebook by @frank-wei in #1288
- chore: Fix dataloader in finetune_qat script by @andi4191 in #1292
- chore: Truncate long and double for ptq CPP path by @andi4191 in #1291
- feat: Add support for aten::square by @mfeliz-cruise in #1286
- fix: fix misleading skipping partitioning msg by @bowang007 in #1289
- fix: Add int support to constant_pad_nd by @mfeliz-cruise in #1283
- fix: Resolve non-determinism in registerSegmentsOutputs by @mfeliz-cruise in #1284
- docs: Update docgen task by @narendasan in #1294
- update fx notebook by @frank-wei in #1297
- [FX] Changes done internally at Facebook by @frank-wei in #1299
- fix(tools): Fix linter to not depend on docker by @narendasan in #1301
- Support multiple indices for aten::index.Tensor by @ruoqianguo in #1309
- chore: Adding CMake to the CI by @narendasan in #1310
- feat: Upgrade Pytorch to 1.12.1 and TensorRT to 8.4.3.1 by @peri044 in #1315
- Fix bug: correct the output shape of
aten::index.Tensor
by @ruoqianguo in #1314 - feat (//core/conversion) : Add converter for torch.repeat_interleave ( by @blchu in #1313
- chore: Adding NGC build path by @narendasan in #1311
- Update lower.py by @frank-wei in #1324
- fix!: Fixed Windows compilation failures by @andi4191 in #1330
- [feat] Add support for argmax and argmin by @mfeliz-cruise in #1312
- chore: Adding a guideline to build on Windows platform by @andi4191 in #1337
- chore: Fix data loader issues and nox file paths by @peri044 in #1281
- feat(//tools/perf): Refactor perf_run.py, add fx2trt backend support, usage via CLI arguments by @peri044 in #1254
- refactor(//tests) : Refactor the test suite by @peri044 in #1329
- [feat] add support for aten::reciprocal(int) by @mfeliz-cruise in #1308
- [FX] Update getting_started_with_fx_path.rst by @frank-wei in #1342
- Update getting_started_with_fx_path.rst by @frank-wei in #1343
- enable direct call to fx.compile() by @frank-wei in #1344
- fix: add remove_exception pass from torch to fix uninitialized tensor… by @bowang007 in #1345
- chore: apply linting to docs by @narendasan in #1347
- docs: Adding v1.2.0 and v1.1.1 docs by @narendasan in #1349
- Docs for release by @narendasan in #1350
- fix: Fixing pybind error on nightly by @andi4191 in #1285
- Centralizing Partitioning State by @narendasan in #1263
- chore: Fix centralized partititoning by @peri044 in #1367
- chore: Move master to test nightly only by @narendasan in #1370
- [fix] Avoid layer name conflicts in aten::index by @mfeliz-cruise in #1377
- [fix] Fix output dimensions of aten::unbind converter by @mfeliz-cruise in #1373
- Einsum converter by @gs-olive in #1385
- Atan2 converter by @gs-olive in #1381
- [FX] aten2trt and some pass fixes by @frank-wei in #1390
- feat: Add converter for aten::sign unary op by @gs-olive in #1391
- Add support for aten::squeeze without a dim by @mfeliz-cruise in #1393
- [fix] incorrect casting behavior in floor_divide by @mfeliz-cruise in #1392
- chore: minor fixes by @peri044 in #1397
- fix:
torch.std
andtorch.var
support multi-dimensional reductions by @gs-olive in #1395 - fix: fix missing float type in shape analysis by @bowang007 in #1399
- feat: Rsqrt lowering pass by @gs-olive in #1394
- Add correct pip install instructions by @msaroufim in #1400
- fix:
aten::split
behavior with negative indexing by @gs-olive in #1403 - fix: fix compilation stuck bug caused by elimination exception by @bowang007 in #1409
- [FX] Fix clamping float32 boundary values, aten2trt init check-in, fix slice issues by @frank-wei in #1415
- [feat]Add converter for aten::where by @mfeliz-cruise in #1421
- [feat]Add converter support for aten::frobenius_norm by @mfeliz-cruise in #1422
- chore: Update torch installation paths for NGC by @peri044 in #1435
- [feat] Add dependency awareness to torch-trt partitioning by @mfeliz-cruise in #1304
- docs: minor changes in Resnet50 example by @przemb in #1427
- fix: Ensure proper type inheritance in
aten::masked_fill
by @gs-olive in #1430 - chore: Nox file update from NGC 22.11 release by @peri044 in #1438
- fix: Add check to ensure einsum converter has no more than 2 tensor inputs by @gs-olive in #1439
- [feat] Add partial converter support for aten::linalg_norm by @mfeliz-cruise in #1426
- chore: Lint
noxfile.py
by @gs-olive in #1443 - fix: CUDA error 710 bugfix by @gs-olive in #1424
- scalar_to_tensor avoid scalar.to() by @Njuapp in #1448
- feat: rewriting param to a Constant if it's a introduced input by @bowang007 in #1298
- feat: support int64 <=> int32 auto conversion by @bowang007 in #1407
- fix: Device casting issues with certain
aten
operators by @gs-olive in #1416 - feat(//core/partitioning) : Dynamic shapes + fallback by @peri044 in #1414
- [fix] unmangle_cls_name for variable length mangled tags by @mfeliz-cruise in #1454
- fix: Error with
aten::div
when using truncation with Int32 tensor inputs by @gs-olive in #1442 - fix: fix failed test cases caused by partition API changes by @bowang007 in #1460
- fix: Update floor division schema replacement in lowering by @gs-olive in #1464
- feat: Add functionality to performance tooling by @gs-olive in #1451
- Unifying the FX and TS Frontends by @narendasan in #1404
New Contributors
- @facebook-github-bot made their first contribution in #1061
- @frank-wei made their first contribution in #1064
- @khabinov made their first contribution in #1120
- @blchu made their first contribution in #1029
- @yinghai made their first contribution in #1161
- @ptrblck made their first contribution in #1159
- @dabauxi made their first contribution in #1164
- @zhiqwang made their first contribution in #1171
- @gcuendet made their first contribution in #1058
- @davinnovation made their first contribution in #1217
- @dependabot made their first contribution in #1287
- @msaroufim made their first contribution in #1400
- @przemb made their first contribution in #1427
Full Changelog: v1.1.0...v1.3.0