-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul backend function execution for improved performance and flexibility #270
base: main
Are you sure you want to change the base?
Conversation
ba2dd98
to
b745b6a
Compare
b745b6a
to
780e18b
Compare
ceeeaf6
to
19e841d
Compare
@@ -188,7 +188,7 @@ def eval(self) -> runtime.MemRefValue: | |||
executor = Executor(executable) | |||
# Upon computing the value of this tensor, we switch it to have a `Storage` | |||
# parameter so that it does not need to be computed again. | |||
data = executor.execute([out.device for out in flat_ir.outputs]) | |||
data = executor.execute() | |||
executor.stream.synchronize() | |||
assert len(data) == 1, "Expects only one output from mlir_tensorrt.compiler executor" | |||
data = data[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also remove the hack a few lines below?
self.trace_tensor.device = flat_ir.outputs[0].device
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still a few issues:
- MLIR-TensorRT still allocates inputs only on the device.
- This PR fixes the copy operation only.
There are still issues since the input get always allocated on device. There will be a device mismatch between inputs (which are always on device) vs output which could be now on (host as well as device).
def infer_devices(self):
"""
Infers devices for the operation and updates output tensor devices accordingly.
"""
assert (
> self.inputs and len(self.outputs) == 1 and all(inp.device == self.inputs[0].device for inp in self.inputs)
), "Default implementation cannot handle cases where there are no inputs, multiple outputs, or multiple inputs with different devices. Please override."
E AssertionError: Default implementation cannot handle cases where there are no inputs, multiple outputs, or multiple inputs with different devices. Please override.
5247906
to
1914775
Compare
a44e199
to
4986057
Compare
…ibility This PR replaces the DPS-style calling convention with a non-DPS approach, eliminating the requirement for call sites to preallocate output buffers. This change enables us to bypass the computation of output shapes and advance allocation of output buffers, laying the groundwork for supporting data-dependent shapes where network outputs can have dynamic dimensions. The underlying compiler stack has been enhanced to avoid allocating oversized buffers and eliminate an extra device-to-device copy operation from TensorRT-allocated memory to MLIR-TRT managed memory. Additionally, we've improved the copy operation to support copying to host memory. This enhancement removes the need to track output device allocations for device-to-host copies. Previously, copy outputs were restricted to device allocations; now they can be allocated on both device and host. Tests have been updated to align with the new calling convention, ensuring compatibility and correctness. Other changes: Fix type constraints tests Address review comments
4986057
to
383c182
Compare
@@ -186,10 +186,11 @@ def eval(self) -> runtime.MemRefValue: | |||
|
|||
compiler = Compiler(trt_builder_opt_level=0) | |||
executable = compiler.compile(mlir, flat_ir=flat_ir) | |||
# Ensure that session and client are available as long as tensor lives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this comment.
This PR replaces the DPS-style calling convention with a non-DPS approach, eliminating the requirement for call sites to preallocate output buffers. This change enables us to bypass the computation of output shapes and advance allocation of output buffers, laying the groundwork for supporting data-dependent shapes where network outputs can have dynamic dimensions.
The underlying compiler stack has been enhanced to avoid allocating oversized buffers and eliminate an extra device-to-device copy operation from TensorRT-allocated memory to MLIR-TRT managed memory.
Additionally, we've improved the copy operation to support copying to host memory. This enhancement removes the need to track output device allocations for device-to-host copies. Previously, copy outputs were restricted to device allocations; now they can be allocated on both device and host.
Tests have been updated to align with the new calling convention, ensuring compatibility and correctness.