TRT-LLM loading mechanism tool #3398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

apbose wants to merge 7 commits into main from nccl_ops_trt_llm_installation

+315 −159

Collaborator

apbose commented Feb 14, 2025 •

edited

Loading

TRT-LLM download utility

apbose self-assigned this

facebook-github-bot added the cla signed label

apbose marked this pull request as draft

February 14, 2025 17:58

github-actions bot added component: conversion component: api [Python] component: dynamo labels

github-actions bot requested a review from peri044

February 14, 2025 17:58

apbose force-pushed the nccl_ops_trt_llm_installation branch from 57dbb3f to 3e38e87 Compare

February 25, 2025 14:34

narendasan reviewed

View reviewed changes

py/torch_tensorrt/dynamo/conversion/converter_utils.py Outdated

-                              f"Ensure the path is correct and the library is compatible",
-                              exc_info=e_os_error,
+                      else:
+                          py_version = f"cp{sys.version_info.major}{sys.version_info.minor}"

Collaborator

narendasan Feb 25, 2025

Why do we restrict to cp310 and cp312, It shouldnt matter if we are pulling the whl and unzipping ourselves

Collaborator Author

apbose Feb 27, 2025

https://pypi.nvidia.com/tensorrt-llm/ In this since I see the tags for only cp310 and cp312 I added the check

narendasan reviewed

View reviewed changes

py/torch_tensorrt/dynamo/conversion/converter_utils.py Outdated Show resolved Hide resolved

narendasan reviewed

View reviewed changes

py/torch_tensorrt/dynamo/conversion/converter_utils.py Show resolved Hide resolved

github-actions bot added the component: tests label

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_nccl_ops.py	2025-02-27 20:03:00.014038+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_nccl_ops.py	2025-02-27 20:03:24.885031+00:00
@@ -22,11 +22,11 @@

from .harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)

apbose force-pushed the nccl_ops_trt_llm_installation branch from 9ba407b to 5f3fdac Compare

February 27, 2025 20:05

apbose marked this pull request as ready for review

February 27, 2025 20:05

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_nccl_ops.py	2025-02-27 20:05:38.023287+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_nccl_ops.py	2025-02-27 20:06:02.662188+00:00
@@ -22,11 +22,11 @@

from .harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_nccl_ops.py	2025-02-27 20:05:54.405311+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_nccl_ops.py	2025-02-27 20:06:21.454993+00:00
@@ -22,11 +22,11 @@

from .harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)

apbose force-pushed the nccl_ops_trt_llm_installation branch from 5f3fdac to b66350e Compare

April 15, 2025 19:57

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_nccl_ops.py	2025-04-15 19:58:05.267724+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_nccl_ops.py	2025-04-15 19:58:36.145897+00:00
@@ -22,11 +22,11 @@

from .harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)

apbose changed the title ~~change in TRT-LLM loading mechanism and exposing aot_joint_export in _compiler.py~~ TRT-LLM loading mechanism tool

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_nccl_ops.py	2025-04-15 21:00:13.719714+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_nccl_ops.py	2025-04-15 21:00:40.093669+00:00
@@ -22,11 +22,11 @@

from .harness import DispatchTestCase


class TestGatherNcclOpsConverter(DispatchTestCase):
-    @parameterized.expand([(8)])
+    @parameterized.expand([8])
    def test_nccl_ops(self, linear_layer_dim):
        class DistributedGatherModel(nn.Module):
            def __init__(self, input_dim):
                super().__init__()
                self.fc = torch.nn.Linear(input_dim, input_dim)

github-actions bot removed the component: tests label

apbose force-pushed the nccl_ops_trt_llm_installation branch 2 times, most recently from 6e893ed to 77f2145 Compare

April 18, 2025 00:51

narendasan reviewed

View reviewed changes

py/torch_tensorrt/dynamo/_compiler.py Outdated Show resolved Hide resolved

py/torch_tensorrt/dynamo/utils.py Outdated Show resolved Hide resolved

apbose force-pushed the nccl_ops_trt_llm_installation branch from f30acb7 to 9c238ae Compare

April 29, 2025 21:33

narendasan reviewed

View reviewed changes

py/torch_tensorrt/dynamo/utils.py Outdated Show resolved Hide resolved

apbose force-pushed the nccl_ops_trt_llm_installation branch 2 times, most recently from 89d621d to 27aa2f2 Compare

May 2, 2025 02:36

apbose commented

View reviewed changes

py/torch_tensorrt/dynamo/utils.py Outdated Show resolved Hide resolved

apbose force-pushed the nccl_ops_trt_llm_installation branch from c3b62d2 to 8cf5d71 Compare

May 20, 2025 19:25

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/_compiler.py	2025-05-20 19:26:15.050058+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/_compiler.py	2025-05-20 19:26:40.956710+00:00
@@ -441,91 +441,91 @@
    use_distributed_mode_trace: bool = _defaults.USE_DISTRIBUTED_MODE_TRACE,
    **kwargs: Any,
) -> torch.fx.GraphModule:
    """Compile an ExportedProgram module for NVIDIA GPUs using TensorRT

-    Takes a existing TorchScript module and a set of settings to configure the compiler
-    and will convert methods to JIT Graphs which call equivalent TensorRT engines
-
-    Converts specifically the forward method of a TorchScript Module
-
-    Arguments:
-        exported_program (torch.export.ExportedProgram): Source module, running torch.export on a ``torch.nn.Module``
-        inputs (Tuple[Any, ...]): List of specifications of input shape, dtype and memory layout for inputs to the module. This argument is required. Input Sizes can be specified as torch sizes, tuples or lists. dtypes can be specified using
-            torch datatypes or torch_tensorrt datatypes and you can use either torch devices or the torch_tensorrt device type enum
-            to select device type.
-
-                .. code-block:: py
-
-                    inputs=[
-                        torch_tensorrt.Input((1, 3, 224, 224)), # Static NCHW input shape for input #1
-                        torch_tensorrt.Input(
-                            min_shape=(1, 224, 224, 3),
-                            opt_shape=(1, 512, 512, 3),
-                            max_shape=(1, 1024, 1024, 3),
-                            dtype=torch.int32
-                            format=torch.channel_last
-                        ), # Dynamic input shape for input #2
-                        torch.randn((1, 3, 224, 244)) # Use an example tensor and let torch_tensorrt infer settings
-                    ]
-
-    Keyword Arguments:
-        arg_inputs (Tuple[Any, ...]): Same as inputs. Alias for better understanding with kwarg_inputs.
-        kwarg_inputs (dict[Any, ...]): Optional, kwarg inputs to the module forward function.
-        device (Union(torch_tensorrt.Device, torch.device, dict)): Target device for TensorRT engines to run on ::
-
-            device=torch_tensorrt.Device("dla:1", allow_gpu_fallback=True)
-
-        disable_tf32 (bool): Force FP32 layers to use traditional as FP32 format vs the default behavior of rounding the inputs to 10-bit mantissas before multiplying, but accumulates the sum using 23-bit mantissas
-        assume_dynamic_shape_support (bool): Setting this to true enables the converters work for both dynamic and static shapes. Default: False
-        sparse_weights (bool): Enable sparsity for convolution and fully connected layers.
-        enabled_precision (Set(Union(torch.dtype, torch_tensorrt.dtype))): The set of datatypes that TensorRT can use when selecting kernels
-        debug (bool): Enable debuggable engine
-        capability (torch_tensorrt.EngineCapability): Restrict kernel selection to safe gpu kernels or safe dla kernels
-        num_avg_timing_iters (int): Number of averaging timing iterations used to select kernels
-        workspace_size (int): Maximum size of workspace given to TensorRT
-        dla_sram_size (int): Fast software managed RAM used by DLA to communicate within a layer.
-        dla_local_dram_size (int): Host RAM used by DLA to share intermediate tensor data across operations
-        dla_global_dram_size (int): Host RAM used by DLA to store weights and metadata for execution
-        truncate_double (bool): Truncate weights provided in double (float64) to float32
-        calibrator (Union(torch_tensorrt._C.IInt8Calibrator, tensorrt.IInt8Calibrator)): Calibrator object which will provide data to the PTQ system for INT8 Calibration
-        require_full_compilation (bool): Require modules to be compiled end to end or return an error as opposed to returning a hybrid graph where operations that cannot be run in TensorRT are run in PyTorch
-        min_block_size (int): The minimum number of contiguous TensorRT convertible operations in order to run a set of operations in TensorRT
-        torch_executed_ops (Collection[Target]): Set of aten operators that must be run in PyTorch. An error will be thrown if this set is not empty but ``require_full_compilation`` is True
-        torch_executed_modules (List[str]): List of modules that must be run in PyTorch. An error will be thrown if this list is not empty but ``require_full_compilation`` is True
-        pass_through_build_failures (bool): Error out if there are issues during compilation (only applicable to torch.compile workflows)
-        max_aux_stream (Optional[int]): Maximum streams in the engine
-        version_compatible (bool): Build the TensorRT engines compatible with future versions of TensorRT (Restrict to lean runtime operators to provide version forward compatibility for the engines)
-        optimization_level: (Optional[int]): Setting a higher optimization level allows TensorRT to spend longer engine building time searching for more optimization options. The resulting engine may have better performance compared to an engine built with a lower optimization level. The default optimization level is 3. Valid values include integers from 0 to the maximum optimization level, which is currently 5. Setting it to be greater than the maximum level results in identical behavior to the maximum level.
-        use_python_runtime: (bool): Return a graph using a pure Python runtime, reduces options for serialization
-        use_fast_partitioner: (bool): Use the adjacency based partitioning scheme instead of the global partitioner. Adjacency partitioning is faster but may not be optimal. Use the global paritioner (``False``) if looking for best performance
-        enable_experimental_decompositions (bool): Use the full set of operator decompositions. These decompositions may not be tested but serve to make the graph easier to convert to TensorRT, potentially increasing the amount of graphs run in TensorRT.
-        dryrun (bool): Toggle for "Dryrun" mode, running everything except conversion to TRT and logging outputs
-        hardware_compatible (bool): Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
-        timing_cache_path (str): Path to the timing cache if it exists (or) where it will be saved after compilation
-        lazy_engine_init (bool): Defer setting up engines until the compilation of all engines is complete. Can allow larger models with multiple graph breaks to compile but can lead to oversubscription of GPU memory at runtime.
-        cache_built_engines (bool): Whether to save the compiled TRT engines to storage
-        reuse_cached_engines (bool): Whether to load the compiled TRT engines from storage
-        engine_cache_dir (Optional[str]): Directory to store the cached TRT engines
-        engine_cache_size (Optional[int]): Maximum hard-disk space (bytes) to use for the engine cache, default is 1GB. If the cache exceeds this size, the oldest engines will be removed by default
-        custom_engine_cache (Optional[BaseEngineCache]): Engine cache instance to use for saving and loading engines. Users can provide their own engine cache by inheriting from BaseEngineCache. If used, engine_cache_dir and engine_cache_size will be ignored.
-        use_explicit_typing (bool): This flag enables strong typing in TensorRT compilation which respects the precisions set in the Pytorch model. This is useful when users have mixed precision graphs.
-        use_fp32_acc (bool): This option inserts cast to FP32 nodes around matmul layers and TensorRT ensures the accumulation of matmul happens in FP32. Use this only when FP16 precision is configured in enabled_precisions.
-        refit_identical_engine_weights (bool): Refit engines with identical weights. This is useful when the same model is compiled multiple times with different inputs and the weights are the same. This will save time by reusing the same engine for different inputs.
-        strip_engine_weights (bool): Strip engine weights from the serialized engine. This is useful when the engine is to be deployed in an environment where the weights are not required.
-        immutable_weights (bool): Build non-refittable engines. This is useful for some layers that are not refittable. If this argument is set to true, `strip_engine_weights` and `refit_identical_engine_weights` will be ignored.
-        enable_weight_streaming (bool): Enable weight streaming.
-        tiling_optimization_level (str): The optimization level of tiling strategies. A higher level allows TensorRT to spend more time searching for better tiling strategy. We currently support ["none", "fast", "moderate", "full"].
-        l2_limit_for_tiling (int): The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).
-<<<<<<< HEAD
-        offload_module_to_cpu (bool): Offload the module to CPU. This is useful when we need to minimize GPU memory usage.
-=======
-        use_distributed_mode_trace (bool):  Using aot_autograd to trace the graph. This is enabled when DTensors or distributed tensors are present in distributed model
->>>>>>> c3b62d239 (TensorRT-LLM import fix and aot_joint_export specify as explicit setting in dynamo.compile)
-        **kwargs: Any,
-    Returns:
-        torch.fx.GraphModule: Compiled FX Module, when run it will execute via TensorRT
+        Takes a existing TorchScript module and a set of settings to configure the compiler
+        and will convert methods to JIT Graphs which call equivalent TensorRT engines
+
+        Converts specifically the forward method of a TorchScript Module
+
+        Arguments:
+            exported_program (torch.export.ExportedProgram): Source module, running torch.export on a ``torch.nn.Module``
+            inputs (Tuple[Any, ...]): List of specifications of input shape, dtype and memory layout for inputs to the module. This argument is required. Input Sizes can be specified as torch sizes, tuples or lists. dtypes can be specified using
+                torch datatypes or torch_tensorrt datatypes and you can use either torch devices or the torch_tensorrt device type enum
+                to select device type.
+
+                    .. code-block:: py
+
+                        inputs=[
+                            torch_tensorrt.Input((1, 3, 224, 224)), # Static NCHW input shape for input #1
+                            torch_tensorrt.Input(
+                                min_shape=(1, 224, 224, 3),
+                                opt_shape=(1, 512, 512, 3),
+                                max_shape=(1, 1024, 1024, 3),
+                                dtype=torch.int32
+                                format=torch.channel_last
+                            ), # Dynamic input shape for input #2
+                            torch.randn((1, 3, 224, 244)) # Use an example tensor and let torch_tensorrt infer settings
+                        ]
+
+        Keyword Arguments:
+            arg_inputs (Tuple[Any, ...]): Same as inputs. Alias for better understanding with kwarg_inputs.
+            kwarg_inputs (dict[Any, ...]): Optional, kwarg inputs to the module forward function.
+            device (Union(torch_tensorrt.Device, torch.device, dict)): Target device for TensorRT engines to run on ::
+
+                device=torch_tensorrt.Device("dla:1", allow_gpu_fallback=True)
+
+            disable_tf32 (bool): Force FP32 layers to use traditional as FP32 format vs the default behavior of rounding the inputs to 10-bit mantissas before multiplying, but accumulates the sum using 23-bit mantissas
+            assume_dynamic_shape_support (bool): Setting this to true enables the converters work for both dynamic and static shapes. Default: False
+            sparse_weights (bool): Enable sparsity for convolution and fully connected layers.
+            enabled_precision (Set(Union(torch.dtype, torch_tensorrt.dtype))): The set of datatypes that TensorRT can use when selecting kernels
+            debug (bool): Enable debuggable engine
+            capability (torch_tensorrt.EngineCapability): Restrict kernel selection to safe gpu kernels or safe dla kernels
+            num_avg_timing_iters (int): Number of averaging timing iterations used to select kernels
+            workspace_size (int): Maximum size of workspace given to TensorRT
+            dla_sram_size (int): Fast software managed RAM used by DLA to communicate within a layer.
+            dla_local_dram_size (int): Host RAM used by DLA to share intermediate tensor data across operations
+            dla_global_dram_size (int): Host RAM used by DLA to store weights and metadata for execution
+            truncate_double (bool): Truncate weights provided in double (float64) to float32
+            calibrator (Union(torch_tensorrt._C.IInt8Calibrator, tensorrt.IInt8Calibrator)): Calibrator object which will provide data to the PTQ system for INT8 Calibration
+            require_full_compilation (bool): Require modules to be compiled end to end or return an error as opposed to returning a hybrid graph where operations that cannot be run in TensorRT are run in PyTorch
+            min_block_size (int): The minimum number of contiguous TensorRT convertible operations in order to run a set of operations in TensorRT
+            torch_executed_ops (Collection[Target]): Set of aten operators that must be run in PyTorch. An error will be thrown if this set is not empty but ``require_full_compilation`` is True
+            torch_executed_modules (List[str]): List of modules that must be run in PyTorch. An error will be thrown if this list is not empty but ``require_full_compilation`` is True
+            pass_through_build_failures (bool): Error out if there are issues during compilation (only applicable to torch.compile workflows)
+            max_aux_stream (Optional[int]): Maximum streams in the engine
+            version_compatible (bool): Build the TensorRT engines compatible with future versions of TensorRT (Restrict to lean runtime operators to provide version forward compatibility for the engines)
+            optimization_level: (Optional[int]): Setting a higher optimization level allows TensorRT to spend longer engine building time searching for more optimization options. The resulting engine may have better performance compared to an engine built with a lower optimization level. The default optimization level is 3. Valid values include integers from 0 to the maximum optimization level, which is currently 5. Setting it to be greater than the maximum level results in identical behavior to the maximum level.
+            use_python_runtime: (bool): Return a graph using a pure Python runtime, reduces options for serialization
+            use_fast_partitioner: (bool): Use the adjacency based partitioning scheme instead of the global partitioner. Adjacency partitioning is faster but may not be optimal. Use the global paritioner (``False``) if looking for best performance
+            enable_experimental_decompositions (bool): Use the full set of operator decompositions. These decompositions may not be tested but serve to make the graph easier to convert to TensorRT, potentially increasing the amount of graphs run in TensorRT.
+            dryrun (bool): Toggle for "Dryrun" mode, running everything except conversion to TRT and logging outputs
+            hardware_compatible (bool): Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
+            timing_cache_path (str): Path to the timing cache if it exists (or) where it will be saved after compilation
+            lazy_engine_init (bool): Defer setting up engines until the compilation of all engines is complete. Can allow larger models with multiple graph breaks to compile but can lead to oversubscription of GPU memory at runtime.
+            cache_built_engines (bool): Whether to save the compiled TRT engines to storage
+            reuse_cached_engines (bool): Whether to load the compiled TRT engines from storage
+            engine_cache_dir (Optional[str]): Directory to store the cached TRT engines
+            engine_cache_size (Optional[int]): Maximum hard-disk space (bytes) to use for the engine cache, default is 1GB. If the cache exceeds this size, the oldest engines will be removed by default
+            custom_engine_cache (Optional[BaseEngineCache]): Engine cache instance to use for saving and loading engines. Users can provide their own engine cache by inheriting from BaseEngineCache. If used, engine_cache_dir and engine_cache_size will be ignored.
+            use_explicit_typing (bool): This flag enables strong typing in TensorRT compilation which respects the precisions set in the Pytorch model. This is useful when users have mixed precision graphs.
+            use_fp32_acc (bool): This option inserts cast to FP32 nodes around matmul layers and TensorRT ensures the accumulation of matmul happens in FP32. Use this only when FP16 precision is configured in enabled_precisions.
+            refit_identical_engine_weights (bool): Refit engines with identical weights. This is useful when the same model is compiled multiple times with different inputs and the weights are the same. This will save time by reusing the same engine for different inputs.
+            strip_engine_weights (bool): Strip engine weights from the serialized engine. This is useful when the engine is to be deployed in an environment where the weights are not required.
+            immutable_weights (bool): Build non-refittable engines. This is useful for some layers that are not refittable. If this argument is set to true, `strip_engine_weights` and `refit_identical_engine_weights` will be ignored.
+            enable_weight_streaming (bool): Enable weight streaming.
+            tiling_optimization_level (str): The optimization level of tiling strategies. A higher level allows TensorRT to spend more time searching for better tiling strategy. We currently support ["none", "fast", "moderate", "full"].
+            l2_limit_for_tiling (int): The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).
+    <<<<<<< HEAD
+            offload_module_to_cpu (bool): Offload the module to CPU. This is useful when we need to minimize GPU memory usage.
+    =======
+            use_distributed_mode_trace (bool):  Using aot_autograd to trace the graph. This is enabled when DTensors or distributed tensors are present in distributed model
+    >>>>>>> c3b62d239 (TensorRT-LLM import fix and aot_joint_export specify as explicit setting in dynamo.compile)
+            **kwargs: Any,
+        Returns:
+            torch.fx.GraphModule: Compiled FX Module, when run it will execute via TensorRT
    """

    if debug:
        set_log_level(logger.parent, logging.DEBUG)
    if "truncate_long_and_double" in kwargs.keys():
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/converter_utils.py	2025-05-20 19:26:15.051058+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/conversion/converter_utils.py	2025-05-20 19:26:41.194290+00:00
@@ -1046,6 +1046,5 @@
    # Cast both tensors to the promoted dtype
    lhs_cast = cast_trt_tensor(ctx, lhs, promoted_dtype, f"{name_prefix}lhs_cast")
    rhs_cast = cast_trt_tensor(ctx, rhs, promoted_dtype, f"{name_prefix}rhs_cast")

    return lhs_cast, rhs_cast
-

github-actions bot requested changes

View reviewed changes

github-actions bot left a comment

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/_compiler.py	2025-05-20 19:43:10.916779+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/_compiler.py	2025-05-20 19:43:32.838717+00:00
@@ -441,88 +441,88 @@
    use_distributed_mode_trace: bool = _defaults.USE_DISTRIBUTED_MODE_TRACE,
    **kwargs: Any,
) -> torch.fx.GraphModule:
    """Compile an ExportedProgram module for NVIDIA GPUs using TensorRT

-        Takes a existing TorchScript module and a set of settings to configure the compiler
-        and will convert methods to JIT Graphs which call equivalent TensorRT engines
-
-        Converts specifically the forward method of a TorchScript Module
-
-        Arguments:
-            exported_program (torch.export.ExportedProgram): Source module, running torch.export on a ``torch.nn.Module``
-            inputs (Tuple[Any, ...]): List of specifications of input shape, dtype and memory layout for inputs to the module. This argument is required. Input Sizes can be specified as torch sizes, tuples or lists. dtypes can be specified using
-                torch datatypes or torch_tensorrt datatypes and you can use either torch devices or the torch_tensorrt device type enum
-                to select device type.
-
-                    .. code-block:: py
-
-                        inputs=[
-                            torch_tensorrt.Input((1, 3, 224, 224)), # Static NCHW input shape for input #1
-                            torch_tensorrt.Input(
-                                min_shape=(1, 224, 224, 3),
-                                opt_shape=(1, 512, 512, 3),
-                                max_shape=(1, 1024, 1024, 3),
-                                dtype=torch.int32
-                                format=torch.channel_last
-                            ), # Dynamic input shape for input #2
-                            torch.randn((1, 3, 224, 244)) # Use an example tensor and let torch_tensorrt infer settings
-                        ]
-
-        Keyword Arguments:
-            arg_inputs (Tuple[Any, ...]): Same as inputs. Alias for better understanding with kwarg_inputs.
-            kwarg_inputs (dict[Any, ...]): Optional, kwarg inputs to the module forward function.
-            device (Union(torch_tensorrt.Device, torch.device, dict)): Target device for TensorRT engines to run on ::
-
-                device=torch_tensorrt.Device("dla:1", allow_gpu_fallback=True)
-
-            disable_tf32 (bool): Force FP32 layers to use traditional as FP32 format vs the default behavior of rounding the inputs to 10-bit mantissas before multiplying, but accumulates the sum using 23-bit mantissas
-            assume_dynamic_shape_support (bool): Setting this to true enables the converters work for both dynamic and static shapes. Default: False
-            sparse_weights (bool): Enable sparsity for convolution and fully connected layers.
-            enabled_precision (Set(Union(torch.dtype, torch_tensorrt.dtype))): The set of datatypes that TensorRT can use when selecting kernels
-            debug (bool): Enable debuggable engine
-            capability (torch_tensorrt.EngineCapability): Restrict kernel selection to safe gpu kernels or safe dla kernels
-            num_avg_timing_iters (int): Number of averaging timing iterations used to select kernels
-            workspace_size (int): Maximum size of workspace given to TensorRT
-            dla_sram_size (int): Fast software managed RAM used by DLA to communicate within a layer.
-            dla_local_dram_size (int): Host RAM used by DLA to share intermediate tensor data across operations
-            dla_global_dram_size (int): Host RAM used by DLA to store weights and metadata for execution
-            truncate_double (bool): Truncate weights provided in double (float64) to float32
-            calibrator (Union(torch_tensorrt._C.IInt8Calibrator, tensorrt.IInt8Calibrator)): Calibrator object which will provide data to the PTQ system for INT8 Calibration
-            require_full_compilation (bool): Require modules to be compiled end to end or return an error as opposed to returning a hybrid graph where operations that cannot be run in TensorRT are run in PyTorch
-            min_block_size (int): The minimum number of contiguous TensorRT convertible operations in order to run a set of operations in TensorRT
-            torch_executed_ops (Collection[Target]): Set of aten operators that must be run in PyTorch. An error will be thrown if this set is not empty but ``require_full_compilation`` is True
-            torch_executed_modules (List[str]): List of modules that must be run in PyTorch. An error will be thrown if this list is not empty but ``require_full_compilation`` is True
-            pass_through_build_failures (bool): Error out if there are issues during compilation (only applicable to torch.compile workflows)
-            max_aux_stream (Optional[int]): Maximum streams in the engine
-            version_compatible (bool): Build the TensorRT engines compatible with future versions of TensorRT (Restrict to lean runtime operators to provide version forward compatibility for the engines)
-            optimization_level: (Optional[int]): Setting a higher optimization level allows TensorRT to spend longer engine building time searching for more optimization options. The resulting engine may have better performance compared to an engine built with a lower optimization level. The default optimization level is 3. Valid values include integers from 0 to the maximum optimization level, which is currently 5. Setting it to be greater than the maximum level results in identical behavior to the maximum level.
-            use_python_runtime: (bool): Return a graph using a pure Python runtime, reduces options for serialization
-            use_fast_partitioner: (bool): Use the adjacency based partitioning scheme instead of the global partitioner. Adjacency partitioning is faster but may not be optimal. Use the global paritioner (``False``) if looking for best performance
-            enable_experimental_decompositions (bool): Use the full set of operator decompositions. These decompositions may not be tested but serve to make the graph easier to convert to TensorRT, potentially increasing the amount of graphs run in TensorRT.
-            dryrun (bool): Toggle for "Dryrun" mode, running everything except conversion to TRT and logging outputs
-            hardware_compatible (bool): Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
-            timing_cache_path (str): Path to the timing cache if it exists (or) where it will be saved after compilation
-            lazy_engine_init (bool): Defer setting up engines until the compilation of all engines is complete. Can allow larger models with multiple graph breaks to compile but can lead to oversubscription of GPU memory at runtime.
-            cache_built_engines (bool): Whether to save the compiled TRT engines to storage
-            reuse_cached_engines (bool): Whether to load the compiled TRT engines from storage
-            engine_cache_dir (Optional[str]): Directory to store the cached TRT engines
-            engine_cache_size (Optional[int]): Maximum hard-disk space (bytes) to use for the engine cache, default is 1GB. If the cache exceeds this size, the oldest engines will be removed by default
-            custom_engine_cache (Optional[BaseEngineCache]): Engine cache instance to use for saving and loading engines. Users can provide their own engine cache by inheriting from BaseEngineCache. If used, engine_cache_dir and engine_cache_size will be ignored.
-            use_explicit_typing (bool): This flag enables strong typing in TensorRT compilation which respects the precisions set in the Pytorch model. This is useful when users have mixed precision graphs.
-            use_fp32_acc (bool): This option inserts cast to FP32 nodes around matmul layers and TensorRT ensures the accumulation of matmul happens in FP32. Use this only when FP16 precision is configured in enabled_precisions.
-            refit_identical_engine_weights (bool): Refit engines with identical weights. This is useful when the same model is compiled multiple times with different inputs and the weights are the same. This will save time by reusing the same engine for different inputs.
-            strip_engine_weights (bool): Strip engine weights from the serialized engine. This is useful when the engine is to be deployed in an environment where the weights are not required.
-            immutable_weights (bool): Build non-refittable engines. This is useful for some layers that are not refittable. If this argument is set to true, `strip_engine_weights` and `refit_identical_engine_weights` will be ignored.
-            enable_weight_streaming (bool): Enable weight streaming.
-            tiling_optimization_level (str): The optimization level of tiling strategies. A higher level allows TensorRT to spend more time searching for better tiling strategy. We currently support ["none", "fast", "moderate", "full"].
-            l2_limit_for_tiling (int): The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).
-            offload_module_to_cpu (bool): Offload the module to CPU. This is useful when we need to minimize GPU memory usage.
-            use_distributed_mode_trace (bool):  Using aot_autograd to trace the graph. This is enabled when DTensors or distributed tensors are present in distributed model
-            **kwargs: Any,
-        Returns:
-            torch.fx.GraphModule: Compiled FX Module, when run it will execute via TensorRT
+    Takes a existing TorchScript module and a set of settings to configure the compiler
+    and will convert methods to JIT Graphs which call equivalent TensorRT engines
+
+    Converts specifically the forward method of a TorchScript Module
+
+    Arguments:
+        exported_program (torch.export.ExportedProgram): Source module, running torch.export on a ``torch.nn.Module``
+        inputs (Tuple[Any, ...]): List of specifications of input shape, dtype and memory layout for inputs to the module. This argument is required. Input Sizes can be specified as torch sizes, tuples or lists. dtypes can be specified using
+            torch datatypes or torch_tensorrt datatypes and you can use either torch devices or the torch_tensorrt device type enum
+            to select device type.
+
+                .. code-block:: py
+
+                    inputs=[
+                        torch_tensorrt.Input((1, 3, 224, 224)), # Static NCHW input shape for input #1
+                        torch_tensorrt.Input(
+                            min_shape=(1, 224, 224, 3),
+                            opt_shape=(1, 512, 512, 3),
+                            max_shape=(1, 1024, 1024, 3),
+                            dtype=torch.int32
+                            format=torch.channel_last
+                        ), # Dynamic input shape for input #2
+                        torch.randn((1, 3, 224, 244)) # Use an example tensor and let torch_tensorrt infer settings
+                    ]
+
+    Keyword Arguments:
+        arg_inputs (Tuple[Any, ...]): Same as inputs. Alias for better understanding with kwarg_inputs.
+        kwarg_inputs (dict[Any, ...]): Optional, kwarg inputs to the module forward function.
+        device (Union(torch_tensorrt.Device, torch.device, dict)): Target device for TensorRT engines to run on ::
+
+            device=torch_tensorrt.Device("dla:1", allow_gpu_fallback=True)
+
+        disable_tf32 (bool): Force FP32 layers to use traditional as FP32 format vs the default behavior of rounding the inputs to 10-bit mantissas before multiplying, but accumulates the sum using 23-bit mantissas
+        assume_dynamic_shape_support (bool): Setting this to true enables the converters work for both dynamic and static shapes. Default: False
+        sparse_weights (bool): Enable sparsity for convolution and fully connected layers.
+        enabled_precision (Set(Union(torch.dtype, torch_tensorrt.dtype))): The set of datatypes that TensorRT can use when selecting kernels
+        debug (bool): Enable debuggable engine
+        capability (torch_tensorrt.EngineCapability): Restrict kernel selection to safe gpu kernels or safe dla kernels
+        num_avg_timing_iters (int): Number of averaging timing iterations used to select kernels
+        workspace_size (int): Maximum size of workspace given to TensorRT
+        dla_sram_size (int): Fast software managed RAM used by DLA to communicate within a layer.
+        dla_local_dram_size (int): Host RAM used by DLA to share intermediate tensor data across operations
+        dla_global_dram_size (int): Host RAM used by DLA to store weights and metadata for execution
+        truncate_double (bool): Truncate weights provided in double (float64) to float32
+        calibrator (Union(torch_tensorrt._C.IInt8Calibrator, tensorrt.IInt8Calibrator)): Calibrator object which will provide data to the PTQ system for INT8 Calibration
+        require_full_compilation (bool): Require modules to be compiled end to end or return an error as opposed to returning a hybrid graph where operations that cannot be run in TensorRT are run in PyTorch
+        min_block_size (int): The minimum number of contiguous TensorRT convertible operations in order to run a set of operations in TensorRT
+        torch_executed_ops (Collection[Target]): Set of aten operators that must be run in PyTorch. An error will be thrown if this set is not empty but ``require_full_compilation`` is True
+        torch_executed_modules (List[str]): List of modules that must be run in PyTorch. An error will be thrown if this list is not empty but ``require_full_compilation`` is True
+        pass_through_build_failures (bool): Error out if there are issues during compilation (only applicable to torch.compile workflows)
+        max_aux_stream (Optional[int]): Maximum streams in the engine
+        version_compatible (bool): Build the TensorRT engines compatible with future versions of TensorRT (Restrict to lean runtime operators to provide version forward compatibility for the engines)
+        optimization_level: (Optional[int]): Setting a higher optimization level allows TensorRT to spend longer engine building time searching for more optimization options. The resulting engine may have better performance compared to an engine built with a lower optimization level. The default optimization level is 3. Valid values include integers from 0 to the maximum optimization level, which is currently 5. Setting it to be greater than the maximum level results in identical behavior to the maximum level.
+        use_python_runtime: (bool): Return a graph using a pure Python runtime, reduces options for serialization
+        use_fast_partitioner: (bool): Use the adjacency based partitioning scheme instead of the global partitioner. Adjacency partitioning is faster but may not be optimal. Use the global paritioner (``False``) if looking for best performance
+        enable_experimental_decompositions (bool): Use the full set of operator decompositions. These decompositions may not be tested but serve to make the graph easier to convert to TensorRT, potentially increasing the amount of graphs run in TensorRT.
+        dryrun (bool): Toggle for "Dryrun" mode, running everything except conversion to TRT and logging outputs
+        hardware_compatible (bool): Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
+        timing_cache_path (str): Path to the timing cache if it exists (or) where it will be saved after compilation
+        lazy_engine_init (bool): Defer setting up engines until the compilation of all engines is complete. Can allow larger models with multiple graph breaks to compile but can lead to oversubscription of GPU memory at runtime.
+        cache_built_engines (bool): Whether to save the compiled TRT engines to storage
+        reuse_cached_engines (bool): Whether to load the compiled TRT engines from storage
+        engine_cache_dir (Optional[str]): Directory to store the cached TRT engines
+        engine_cache_size (Optional[int]): Maximum hard-disk space (bytes) to use for the engine cache, default is 1GB. If the cache exceeds this size, the oldest engines will be removed by default
+        custom_engine_cache (Optional[BaseEngineCache]): Engine cache instance to use for saving and loading engines. Users can provide their own engine cache by inheriting from BaseEngineCache. If used, engine_cache_dir and engine_cache_size will be ignored.
+        use_explicit_typing (bool): This flag enables strong typing in TensorRT compilation which respects the precisions set in the Pytorch model. This is useful when users have mixed precision graphs.
+        use_fp32_acc (bool): This option inserts cast to FP32 nodes around matmul layers and TensorRT ensures the accumulation of matmul happens in FP32. Use this only when FP16 precision is configured in enabled_precisions.
+        refit_identical_engine_weights (bool): Refit engines with identical weights. This is useful when the same model is compiled multiple times with different inputs and the weights are the same. This will save time by reusing the same engine for different inputs.
+        strip_engine_weights (bool): Strip engine weights from the serialized engine. This is useful when the engine is to be deployed in an environment where the weights are not required.
+        immutable_weights (bool): Build non-refittable engines. This is useful for some layers that are not refittable. If this argument is set to true, `strip_engine_weights` and `refit_identical_engine_weights` will be ignored.
+        enable_weight_streaming (bool): Enable weight streaming.
+        tiling_optimization_level (str): The optimization level of tiling strategies. A higher level allows TensorRT to spend more time searching for better tiling strategy. We currently support ["none", "fast", "moderate", "full"].
+        l2_limit_for_tiling (int): The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).
+        offload_module_to_cpu (bool): Offload the module to CPU. This is useful when we need to minimize GPU memory usage.
+        use_distributed_mode_trace (bool):  Using aot_autograd to trace the graph. This is enabled when DTensors or distributed tensors are present in distributed model
+        **kwargs: Any,
+    Returns:
+        torch.fx.GraphModule: Compiled FX Module, when run it will execute via TensorRT
    """

    if debug:
        set_log_level(logger.parent, logging.DEBUG)
    if "truncate_long_and_double" in kwargs.keys():

apbose force-pushed the nccl_ops_trt_llm_installation branch from 367e925 to 194fb44 Compare

May 20, 2025 19:50

apbose force-pushed the nccl_ops_trt_llm_installation branch from f12c627 to 23d27b0 Compare

June 11, 2025 18:55

narendasan reviewed

View reviewed changes

py/torch_tensorrt/dynamo/utils.py Outdated Show resolved Hide resolved

narendasan reviewed

View reviewed changes

py/torch_tensorrt/dynamo/utils.py Outdated Show resolved Hide resolved

narendasan reviewed

View reviewed changes

py/torch_tensorrt/dynamo/utils.py Outdated Show resolved Hide resolved

narendasan reviewed

View reviewed changes

py/torch_tensorrt/dynamo/utils.py Outdated Show resolved Hide resolved

narendasan reviewed

View reviewed changes

Collaborator

narendasan left a comment

Use a temp directory to save the wheel and unzipped wheel

apbose force-pushed the nccl_ops_trt_llm_installation branch from 37d4e90 to e8bc3a4 Compare

June 13, 2025 21:54

apbose force-pushed the nccl_ops_trt_llm_installation branch 2 times, most recently from a0cecf4 to 4193ca2 Compare

July 1, 2025 09:17

github-actions bot added the component: tests label

apbose added 4 commits

July 1, 2025 11:17


          TensorRT-LLM import fix and aot_joint_export specify as explicit sett…

…ing in dynamo.compile

TRT-LLM installation utilities and adding test cases

adding the option in _compiler.py

changes in the TRT-LLM loading tool- removing install_wget, install_unzip, install_mpi

Further changes in error logging of the TRT-LLM installation tool

moving the load_tensorrt_llm to dynamo/utils.py

correcting misprint for TRT LLM load

Using python lib for download to make it platform agnostic

dll file path update for windows

correcting the non critical lint error

Including version in versions.txt


          linting error fixes and rebase fix

69839aa


          removing Platform enum from converter_utils.py

66f0c88


          Addressing review comments- tmp dir for wheel download and wheel extr…

61c93b2

…action, variable for py_version

apbose force-pushed the nccl_ops_trt_llm_installation branch 2 times, most recently from 747e38e to 4589c76 Compare

July 1, 2025 18:21


          checks for windows where NCCL backend is not supported

1e2148d

apbose force-pushed the nccl_ops_trt_llm_installation branch from 4589c76 to 1e2148d Compare

July 1, 2025 18:23

narendasan reviewed

View reviewed changes

py/torch_tensorrt/dynamo/utils.py Outdated Show resolved Hide resolved


          adding checks for windows and jetson devices

9cb3cab

apbose force-pushed the nccl_ops_trt_llm_installation branch from 42d4862 to 9cb3cab Compare

July 1, 2025 22:35

apbose commented

View reviewed changes

tests/py/dynamo/distributed/test_nccl_ops.py Outdated Show resolved Hide resolved

narendasan reviewed

View reviewed changes

py/torch_tensorrt/dynamo/utils.py

+                      handle = ctypes.CDLL(plugin_lib_path)
+                      logger.info(f"Successfully loaded plugin library: {plugin_lib_path}")
+                  except OSError as e_os_error:
+                      if "libmpi" in str(e_os_error):

Collaborator

narendasan Jul 2, 2025

These string matching error hints can be misleading. Maybe change the wording to something like

"Failed to load libnvinfer_plugin_tensorrt_llm.so from {path}, got error {actual_error} (hint: libmpi.so is a necessary dependency, ensure that OpenMPI is installed on you system)"

apbose force-pushed the nccl_ops_trt_llm_installation branch from 083d0d2 to dbfd7ee Compare

July 4, 2025 00:38


          Keeping the extracted and deleting download file, restructuring test

15d681a

apbose force-pushed the nccl_ops_trt_llm_installation branch from dbfd7ee to 15d681a Compare

July 4, 2025 00:40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] component: build system component: conversion component: dynamo component: tests