Overhaul backend function execution for improved performance and flexibility #270

jhalakpatel · 2024-10-15T19:48:32Z

This PR replaces the DPS-style calling convention with a non-DPS approach, eliminating the requirement for call sites to preallocate output buffers. This change enables us to bypass the computation of output shapes and advance allocation of output buffers, laying the groundwork for supporting data-dependent shapes where network outputs can have dynamic dimensions.

The underlying compiler stack has been enhanced to avoid allocating oversized buffers and eliminate an extra device-to-device copy operation from TensorRT-allocated memory to MLIR-TRT managed memory.

Additionally, we've improved the copy operation to support copying to host memory. This enhancement removes the need to track output device allocations for device-to-host copies. Previously, copy outputs were restricted to device allocations; now they can be allocated on both device and host.

Tests have been updated to align with the new calling convention, ensuring compatibility and correctness.

tripy/tripy/backend/mlir/executor.py

tripy/tripy/frontend/tensor.py

tripy/tests/integration/test_quantize.py

tripy/tripy/backend/api/executable.py

pranavm-nvidia · 2024-11-04T23:51:56Z

tripy/tripy/frontend/tensor.py

@@ -188,7 +188,7 @@ def eval(self) -> runtime.MemRefValue:
        executor = Executor(executable)
        # Upon computing the value of this tensor, we switch it to have a `Storage`
        # parameter so that it does not need to be computed again.
-        data = executor.execute([out.device for out in flat_ir.outputs])
+        data = executor.execute()
        executor.stream.synchronize()
        assert len(data) == 1, "Expects only one output from mlir_tensorrt.compiler executor"
        data = data[0]


Could we also remove the hack a few lines below?

self.trace_tensor.device = flat_ir.outputs[0].device

There are still a few issues:

MLIR-TensorRT still allocates inputs only on the device.

This PR fixes the copy operation only.

There are still issues since the input get always allocated on device. There will be a device mismatch between inputs (which are always on device) vs output which could be now on (host as well as device).

def infer_devices(self): """ Infers devices for the operation and updates output tensor devices accordingly. """ assert ( > self.inputs and len(self.outputs) == 1 and all(inp.device == self.inputs[0].device for inp in self.inputs) ), "Default implementation cannot handle cases where there are no inputs, multiple outputs, or multiple inputs with different devices. Please override." E AssertionError: Default implementation cannot handle cases where there are no inputs, multiple outputs, or multiple inputs with different devices. Please override.

tripy/tripy/frontend/trace/ops/reduce.py

…ibility This PR replaces the DPS-style calling convention with a non-DPS approach, eliminating the requirement for call sites to preallocate output buffers. This change enables us to bypass the computation of output shapes and advance allocation of output buffers, laying the groundwork for supporting data-dependent shapes where network outputs can have dynamic dimensions. The underlying compiler stack has been enhanced to avoid allocating oversized buffers and eliminate an extra device-to-device copy operation from TensorRT-allocated memory to MLIR-TRT managed memory. Additionally, we've improved the copy operation to support copying to host memory. This enhancement removes the need to track output device allocations for device-to-host copies. Previously, copy outputs were restricted to device allocations; now they can be allocated on both device and host. Tests have been updated to align with the new calling convention, ensuring compatibility and correctness. Other changes: Fix type constraints tests Address review comments

jhalakpatel · 2024-11-11T19:11:41Z

tripy/tripy/frontend/tensor.py

@@ -186,10 +186,11 @@ def eval(self) -> runtime.MemRefValue:

        compiler = Compiler(trt_builder_opt_level=0)
        executable = compiler.compile(mlir, flat_ir=flat_ir)
+        # Ensure that session and client are available as long as tensor lives.


Remove this comment.

jhalakpatel commented Oct 15, 2024

View reviewed changes

tripy/tripy/backend/mlir/executor.py Outdated Show resolved Hide resolved

jhalakpatel commented Oct 15, 2024

View reviewed changes

tripy/tripy/frontend/tensor.py Outdated Show resolved Hide resolved

jhalakpatel force-pushed the jhalakp-use-non-dps-exec-func branch from ba2dd98 to b745b6a Compare October 21, 2024 18:57

jhalakpatel force-pushed the jhalakp-use-non-dps-exec-func branch from b745b6a to 780e18b Compare November 4, 2024 22:01

jhalakpatel changed the title ~~Update MLIR-TRT execution function to use non-DPS style calling convention~~ Overhaul backend function execution for improved performance and flexibility Nov 4, 2024

jhalakpatel marked this pull request as ready for review November 4, 2024 22:03

jhalakpatel force-pushed the jhalakp-use-non-dps-exec-func branch 2 times, most recently from ceeeaf6 to 19e841d Compare November 4, 2024 22:13

jhalakpatel commented Nov 4, 2024

View reviewed changes

tripy/tests/integration/test_quantize.py Show resolved Hide resolved

pranavm-nvidia reviewed Nov 4, 2024

View reviewed changes

tripy/tripy/backend/api/executable.py Outdated Show resolved Hide resolved

pranavm-nvidia reviewed Nov 4, 2024

View reviewed changes

jhalakpatel force-pushed the jhalakp-use-non-dps-exec-func branch 2 times, most recently from 5247906 to 1914775 Compare November 6, 2024 06:03

pranavm-nvidia reviewed Nov 6, 2024

View reviewed changes

tripy/tripy/frontend/trace/ops/reduce.py Show resolved Hide resolved

jhalakpatel force-pushed the jhalakp-use-non-dps-exec-func branch 2 times, most recently from a44e199 to 4986057 Compare November 9, 2024 00:13

jhalakpatel force-pushed the jhalakp-use-non-dps-exec-func branch from 4986057 to 383c182 Compare November 9, 2024 00:14

jhalakpatel commented Nov 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul backend function execution for improved performance and flexibility #270

Overhaul backend function execution for improved performance and flexibility #270

jhalakpatel commented Oct 15, 2024 •

edited

Loading

pranavm-nvidia Nov 4, 2024

jhalakpatel Nov 5, 2024 •

edited

Loading

jhalakpatel Nov 11, 2024

Overhaul backend function execution for improved performance and flexibility #270

Are you sure you want to change the base?

Overhaul backend function execution for improved performance and flexibility #270

Conversation

jhalakpatel commented Oct 15, 2024 • edited Loading

pranavm-nvidia Nov 4, 2024

Choose a reason for hiding this comment

jhalakpatel Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

jhalakpatel Nov 11, 2024

Choose a reason for hiding this comment

jhalakpatel commented Oct 15, 2024 •

edited

Loading

jhalakpatel Nov 5, 2024 •

edited

Loading