Skip to content

Commit 367e925

Browse files
committed
linting error fixes and rebase fix
1 parent 8cf5d71 commit 367e925

File tree

2 files changed

+78
-82
lines changed

2 files changed

+78
-82
lines changed

py/torch_tensorrt/dynamo/_compiler.py

Lines changed: 78 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -443,87 +443,84 @@ def compile(
443443
) -> torch.fx.GraphModule:
444444
"""Compile an ExportedProgram module for NVIDIA GPUs using TensorRT
445445
446-
Takes a existing TorchScript module and a set of settings to configure the compiler
447-
and will convert methods to JIT Graphs which call equivalent TensorRT engines
448-
449-
Converts specifically the forward method of a TorchScript Module
450-
451-
Arguments:
452-
exported_program (torch.export.ExportedProgram): Source module, running torch.export on a ``torch.nn.Module``
453-
inputs (Tuple[Any, ...]): List of specifications of input shape, dtype and memory layout for inputs to the module. This argument is required. Input Sizes can be specified as torch sizes, tuples or lists. dtypes can be specified using
454-
torch datatypes or torch_tensorrt datatypes and you can use either torch devices or the torch_tensorrt device type enum
455-
to select device type.
456-
457-
.. code-block:: py
458-
459-
inputs=[
460-
torch_tensorrt.Input((1, 3, 224, 224)), # Static NCHW input shape for input #1
461-
torch_tensorrt.Input(
462-
min_shape=(1, 224, 224, 3),
463-
opt_shape=(1, 512, 512, 3),
464-
max_shape=(1, 1024, 1024, 3),
465-
dtype=torch.int32
466-
format=torch.channel_last
467-
), # Dynamic input shape for input #2
468-
torch.randn((1, 3, 224, 244)) # Use an example tensor and let torch_tensorrt infer settings
469-
]
470-
471-
Keyword Arguments:
472-
arg_inputs (Tuple[Any, ...]): Same as inputs. Alias for better understanding with kwarg_inputs.
473-
kwarg_inputs (dict[Any, ...]): Optional, kwarg inputs to the module forward function.
474-
device (Union(torch_tensorrt.Device, torch.device, dict)): Target device for TensorRT engines to run on ::
475-
476-
device=torch_tensorrt.Device("dla:1", allow_gpu_fallback=True)
477-
478-
disable_tf32 (bool): Force FP32 layers to use traditional as FP32 format vs the default behavior of rounding the inputs to 10-bit mantissas before multiplying, but accumulates the sum using 23-bit mantissas
479-
assume_dynamic_shape_support (bool): Setting this to true enables the converters work for both dynamic and static shapes. Default: False
480-
sparse_weights (bool): Enable sparsity for convolution and fully connected layers.
481-
enabled_precision (Set(Union(torch.dtype, torch_tensorrt.dtype))): The set of datatypes that TensorRT can use when selecting kernels
482-
debug (bool): Enable debuggable engine
483-
capability (torch_tensorrt.EngineCapability): Restrict kernel selection to safe gpu kernels or safe dla kernels
484-
num_avg_timing_iters (int): Number of averaging timing iterations used to select kernels
485-
workspace_size (int): Maximum size of workspace given to TensorRT
486-
dla_sram_size (int): Fast software managed RAM used by DLA to communicate within a layer.
487-
dla_local_dram_size (int): Host RAM used by DLA to share intermediate tensor data across operations
488-
dla_global_dram_size (int): Host RAM used by DLA to store weights and metadata for execution
489-
truncate_double (bool): Truncate weights provided in double (float64) to float32
490-
calibrator (Union(torch_tensorrt._C.IInt8Calibrator, tensorrt.IInt8Calibrator)): Calibrator object which will provide data to the PTQ system for INT8 Calibration
491-
require_full_compilation (bool): Require modules to be compiled end to end or return an error as opposed to returning a hybrid graph where operations that cannot be run in TensorRT are run in PyTorch
492-
min_block_size (int): The minimum number of contiguous TensorRT convertible operations in order to run a set of operations in TensorRT
493-
torch_executed_ops (Collection[Target]): Set of aten operators that must be run in PyTorch. An error will be thrown if this set is not empty but ``require_full_compilation`` is True
494-
torch_executed_modules (List[str]): List of modules that must be run in PyTorch. An error will be thrown if this list is not empty but ``require_full_compilation`` is True
495-
pass_through_build_failures (bool): Error out if there are issues during compilation (only applicable to torch.compile workflows)
496-
max_aux_stream (Optional[int]): Maximum streams in the engine
497-
version_compatible (bool): Build the TensorRT engines compatible with future versions of TensorRT (Restrict to lean runtime operators to provide version forward compatibility for the engines)
498-
optimization_level: (Optional[int]): Setting a higher optimization level allows TensorRT to spend longer engine building time searching for more optimization options. The resulting engine may have better performance compared to an engine built with a lower optimization level. The default optimization level is 3. Valid values include integers from 0 to the maximum optimization level, which is currently 5. Setting it to be greater than the maximum level results in identical behavior to the maximum level.
499-
use_python_runtime: (bool): Return a graph using a pure Python runtime, reduces options for serialization
500-
use_fast_partitioner: (bool): Use the adjacency based partitioning scheme instead of the global partitioner. Adjacency partitioning is faster but may not be optimal. Use the global paritioner (``False``) if looking for best performance
501-
enable_experimental_decompositions (bool): Use the full set of operator decompositions. These decompositions may not be tested but serve to make the graph easier to convert to TensorRT, potentially increasing the amount of graphs run in TensorRT.
502-
dryrun (bool): Toggle for "Dryrun" mode, running everything except conversion to TRT and logging outputs
503-
hardware_compatible (bool): Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
504-
timing_cache_path (str): Path to the timing cache if it exists (or) where it will be saved after compilation
505-
lazy_engine_init (bool): Defer setting up engines until the compilation of all engines is complete. Can allow larger models with multiple graph breaks to compile but can lead to oversubscription of GPU memory at runtime.
506-
cache_built_engines (bool): Whether to save the compiled TRT engines to storage
507-
reuse_cached_engines (bool): Whether to load the compiled TRT engines from storage
508-
engine_cache_dir (Optional[str]): Directory to store the cached TRT engines
509-
engine_cache_size (Optional[int]): Maximum hard-disk space (bytes) to use for the engine cache, default is 1GB. If the cache exceeds this size, the oldest engines will be removed by default
510-
custom_engine_cache (Optional[BaseEngineCache]): Engine cache instance to use for saving and loading engines. Users can provide their own engine cache by inheriting from BaseEngineCache. If used, engine_cache_dir and engine_cache_size will be ignored.
511-
use_explicit_typing (bool): This flag enables strong typing in TensorRT compilation which respects the precisions set in the Pytorch model. This is useful when users have mixed precision graphs.
512-
use_fp32_acc (bool): This option inserts cast to FP32 nodes around matmul layers and TensorRT ensures the accumulation of matmul happens in FP32. Use this only when FP16 precision is configured in enabled_precisions.
513-
refit_identical_engine_weights (bool): Refit engines with identical weights. This is useful when the same model is compiled multiple times with different inputs and the weights are the same. This will save time by reusing the same engine for different inputs.
514-
strip_engine_weights (bool): Strip engine weights from the serialized engine. This is useful when the engine is to be deployed in an environment where the weights are not required.
515-
immutable_weights (bool): Build non-refittable engines. This is useful for some layers that are not refittable. If this argument is set to true, `strip_engine_weights` and `refit_identical_engine_weights` will be ignored.
516-
enable_weight_streaming (bool): Enable weight streaming.
517-
tiling_optimization_level (str): The optimization level of tiling strategies. A higher level allows TensorRT to spend more time searching for better tiling strategy. We currently support ["none", "fast", "moderate", "full"].
518-
l2_limit_for_tiling (int): The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).
519-
<<<<<<< HEAD
520-
offload_module_to_cpu (bool): Offload the module to CPU. This is useful when we need to minimize GPU memory usage.
521-
=======
522-
use_distributed_mode_trace (bool): Using aot_autograd to trace the graph. This is enabled when DTensors or distributed tensors are present in distributed model
523-
>>>>>>> c3b62d239 (TensorRT-LLM import fix and aot_joint_export specify as explicit setting in dynamo.compile)
524-
**kwargs: Any,
525-
Returns:
526-
torch.fx.GraphModule: Compiled FX Module, when run it will execute via TensorRT
446+
Takes a existing TorchScript module and a set of settings to configure the compiler
447+
and will convert methods to JIT Graphs which call equivalent TensorRT engines
448+
449+
Converts specifically the forward method of a TorchScript Module
450+
451+
Arguments:
452+
exported_program (torch.export.ExportedProgram): Source module, running torch.export on a ``torch.nn.Module``
453+
inputs (Tuple[Any, ...]): List of specifications of input shape, dtype and memory layout for inputs to the module. This argument is required. Input Sizes can be specified as torch sizes, tuples or lists. dtypes can be specified using
454+
torch datatypes or torch_tensorrt datatypes and you can use either torch devices or the torch_tensorrt device type enum
455+
to select device type.
456+
457+
.. code-block:: py
458+
459+
inputs=[
460+
torch_tensorrt.Input((1, 3, 224, 224)), # Static NCHW input shape for input #1
461+
torch_tensorrt.Input(
462+
min_shape=(1, 224, 224, 3),
463+
opt_shape=(1, 512, 512, 3),
464+
max_shape=(1, 1024, 1024, 3),
465+
dtype=torch.int32
466+
format=torch.channel_last
467+
), # Dynamic input shape for input #2
468+
torch.randn((1, 3, 224, 244)) # Use an example tensor and let torch_tensorrt infer settings
469+
]
470+
471+
Keyword Arguments:
472+
arg_inputs (Tuple[Any, ...]): Same as inputs. Alias for better understanding with kwarg_inputs.
473+
kwarg_inputs (dict[Any, ...]): Optional, kwarg inputs to the module forward function.
474+
device (Union(torch_tensorrt.Device, torch.device, dict)): Target device for TensorRT engines to run on ::
475+
476+
device=torch_tensorrt.Device("dla:1", allow_gpu_fallback=True)
477+
478+
disable_tf32 (bool): Force FP32 layers to use traditional as FP32 format vs the default behavior of rounding the inputs to 10-bit mantissas before multiplying, but accumulates the sum using 23-bit mantissas
479+
assume_dynamic_shape_support (bool): Setting this to true enables the converters work for both dynamic and static shapes. Default: False
480+
sparse_weights (bool): Enable sparsity for convolution and fully connected layers.
481+
enabled_precision (Set(Union(torch.dtype, torch_tensorrt.dtype))): The set of datatypes that TensorRT can use when selecting kernels
482+
debug (bool): Enable debuggable engine
483+
capability (torch_tensorrt.EngineCapability): Restrict kernel selection to safe gpu kernels or safe dla kernels
484+
num_avg_timing_iters (int): Number of averaging timing iterations used to select kernels
485+
workspace_size (int): Maximum size of workspace given to TensorRT
486+
dla_sram_size (int): Fast software managed RAM used by DLA to communicate within a layer.
487+
dla_local_dram_size (int): Host RAM used by DLA to share intermediate tensor data across operations
488+
dla_global_dram_size (int): Host RAM used by DLA to store weights and metadata for execution
489+
truncate_double (bool): Truncate weights provided in double (float64) to float32
490+
calibrator (Union(torch_tensorrt._C.IInt8Calibrator, tensorrt.IInt8Calibrator)): Calibrator object which will provide data to the PTQ system for INT8 Calibration
491+
require_full_compilation (bool): Require modules to be compiled end to end or return an error as opposed to returning a hybrid graph where operations that cannot be run in TensorRT are run in PyTorch
492+
min_block_size (int): The minimum number of contiguous TensorRT convertible operations in order to run a set of operations in TensorRT
493+
torch_executed_ops (Collection[Target]): Set of aten operators that must be run in PyTorch. An error will be thrown if this set is not empty but ``require_full_compilation`` is True
494+
torch_executed_modules (List[str]): List of modules that must be run in PyTorch. An error will be thrown if this list is not empty but ``require_full_compilation`` is True
495+
pass_through_build_failures (bool): Error out if there are issues during compilation (only applicable to torch.compile workflows)
496+
max_aux_stream (Optional[int]): Maximum streams in the engine
497+
version_compatible (bool): Build the TensorRT engines compatible with future versions of TensorRT (Restrict to lean runtime operators to provide version forward compatibility for the engines)
498+
optimization_level: (Optional[int]): Setting a higher optimization level allows TensorRT to spend longer engine building time searching for more optimization options. The resulting engine may have better performance compared to an engine built with a lower optimization level. The default optimization level is 3. Valid values include integers from 0 to the maximum optimization level, which is currently 5. Setting it to be greater than the maximum level results in identical behavior to the maximum level.
499+
use_python_runtime: (bool): Return a graph using a pure Python runtime, reduces options for serialization
500+
use_fast_partitioner: (bool): Use the adjacency based partitioning scheme instead of the global partitioner. Adjacency partitioning is faster but may not be optimal. Use the global paritioner (``False``) if looking for best performance
501+
enable_experimental_decompositions (bool): Use the full set of operator decompositions. These decompositions may not be tested but serve to make the graph easier to convert to TensorRT, potentially increasing the amount of graphs run in TensorRT.
502+
dryrun (bool): Toggle for "Dryrun" mode, running everything except conversion to TRT and logging outputs
503+
hardware_compatible (bool): Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
504+
timing_cache_path (str): Path to the timing cache if it exists (or) where it will be saved after compilation
505+
lazy_engine_init (bool): Defer setting up engines until the compilation of all engines is complete. Can allow larger models with multiple graph breaks to compile but can lead to oversubscription of GPU memory at runtime.
506+
cache_built_engines (bool): Whether to save the compiled TRT engines to storage
507+
reuse_cached_engines (bool): Whether to load the compiled TRT engines from storage
508+
engine_cache_dir (Optional[str]): Directory to store the cached TRT engines
509+
engine_cache_size (Optional[int]): Maximum hard-disk space (bytes) to use for the engine cache, default is 1GB. If the cache exceeds this size, the oldest engines will be removed by default
510+
custom_engine_cache (Optional[BaseEngineCache]): Engine cache instance to use for saving and loading engines. Users can provide their own engine cache by inheriting from BaseEngineCache. If used, engine_cache_dir and engine_cache_size will be ignored.
511+
use_explicit_typing (bool): This flag enables strong typing in TensorRT compilation which respects the precisions set in the Pytorch model. This is useful when users have mixed precision graphs.
512+
use_fp32_acc (bool): This option inserts cast to FP32 nodes around matmul layers and TensorRT ensures the accumulation of matmul happens in FP32. Use this only when FP16 precision is configured in enabled_precisions.
513+
refit_identical_engine_weights (bool): Refit engines with identical weights. This is useful when the same model is compiled multiple times with different inputs and the weights are the same. This will save time by reusing the same engine for different inputs.
514+
strip_engine_weights (bool): Strip engine weights from the serialized engine. This is useful when the engine is to be deployed in an environment where the weights are not required.
515+
immutable_weights (bool): Build non-refittable engines. This is useful for some layers that are not refittable. If this argument is set to true, `strip_engine_weights` and `refit_identical_engine_weights` will be ignored.
516+
enable_weight_streaming (bool): Enable weight streaming.
517+
tiling_optimization_level (str): The optimization level of tiling strategies. A higher level allows TensorRT to spend more time searching for better tiling strategy. We currently support ["none", "fast", "moderate", "full"].
518+
l2_limit_for_tiling (int): The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).
519+
offload_module_to_cpu (bool): Offload the module to CPU. This is useful when we need to minimize GPU memory usage.
520+
use_distributed_mode_trace (bool): Using aot_autograd to trace the graph. This is enabled when DTensors or distributed tensors are present in distributed model
521+
**kwargs: Any,
522+
Returns:
523+
torch.fx.GraphModule: Compiled FX Module, when run it will execute via TensorRT
527524
"""
528525

529526
if debug:

py/torch_tensorrt/dynamo/conversion/converter_utils.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1048,4 +1048,3 @@ def promote_trt_tensors_to_same_dtype(
10481048
rhs_cast = cast_trt_tensor(ctx, rhs, promoted_dtype, f"{name_prefix}rhs_cast")
10491049

10501050
return lhs_cast, rhs_cast
1051-

0 commit comments

Comments
 (0)