Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into sam2-image
Browse files Browse the repository at this point in the history
  • Loading branch information
parthchadha committed Nov 12, 2024
2 parents cdc57d1 + 34048f9 commit 474b03d
Show file tree
Hide file tree
Showing 22 changed files with 346 additions and 273 deletions.
49 changes: 49 additions & 0 deletions .github/workflows/post-merge-package-index-update.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: Post-merge package index update

on:
push:
branches: [ "main" ]
paths: ['tripy/docs/packages.html']

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
contents: read
pages: write
id-token: write

# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
concurrency:
group: "pages"
cancel-in-progress: false

jobs:
publish-package-index:
runs-on: tripy-self-hosted
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
container:
image: ghcr.io/nvidia/tensorrt-incubator/tripy:latest
volumes:
- ${{ github.workspace }}/tripy:/tripy
options: --gpus all
steps:
- uses: actions/checkout@v4

- name: build-docs
run: |
cd /tripy/
python3 docs/generate_rsts.py
sphinx-build build/doc_sources build/docs -c docs/ -j 4 -W -n
cp docs/packages.html build/docs/
- uses: actions/configure-pages@v5

- uses: actions/upload-pages-artifact@v3
with:
path: "/tripy/build/docs"

- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4
73 changes: 41 additions & 32 deletions tripy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Tripy: A Python Programming Model For TensorRT

<!-- Tripy: DOC: OMIT Start -->
[**Installation**](#installation) | [**Quickstart**](#quickstart) | [**Documentation**](https://nvidia.github.io/TensorRT-Incubator/) | [**Examples**](./examples) | [**Contributing**](./CONTRIBUTING.md)
[**Installation**](#installation) | [**Getting Started**](#getting-started) | [**Documentation**](https://nvidia.github.io/TensorRT-Incubator/) | [**Examples**](./examples) | [**Contributing**](./CONTRIBUTING.md)

[![Tripy L1](https://github.com/NVIDIA/TensorRT-Incubator/actions/workflows/tripy-l1.yml/badge.svg)](https://github.com/NVIDIA/TensorRT-Incubator/actions/workflows/tripy-l1.yml)
<!-- Tripy: DOC: OMIT End -->
Expand Down Expand Up @@ -66,39 +66,48 @@ To get the latest changes in the repository, you can build Tripy wheels from sou

<!-- Tripy: DOC: OMIT End -->

## Quickstart
## Getting Started

In eager mode, Tripy works just like you'd expect:
```py
# doc: no-print-locals
import tripy as tp
a = tp.Tensor([1.0, 2.0])
print(a + 1)
```
We've included several guides in Tripy to make it easy to get started.
We recommend starting with the
[Introduction To Tripy](https://nvidia.github.io/TensorRT-Incubator/pre0_user_guides/00-introduction-to-tripy.html)
guide.
Tripy can also compile functions to generate efficient machine code for faster execution:
To get an idea of the look and feel of Tripy, let's take a look at a short code example.
All of the features used in this example are explained in more detail in the
introduction guide mentioned above.

```py
# doc: no-print-locals
def add(a, b):
return a + b
# When compiling, we need to specify shape and data type constraints on the inputs:
# a is a 1D dynamic shape tensor of shape (d,), where `d` can range from 1 to 5.
# `[1, 2, 5]` indicates a range from 1 to 5, with optimization for `d = 2`.
a_info = tp.InputInfo(shape=([1, 2, 5],), dtype=tp.float32)
# `b` is a 1D tensor of shape (1,).
b_info = tp.InputInfo((1,), dtype=tp.float32)
compiled_add = tp.compile(add, args=[a_info, b_info])
print(compiled_add(tp.Tensor([1., 2., 3.]), tp.Tensor([3.])))
# Define our model:
class Model(tp.Module):
def __init__(self):
self.conv = tp.Conv(in_channels=1, out_channels=1, kernel_dims=[3, 3])
def __call__(self, x):
x = self.conv(x)
x = tp.relu(x)
return x
# Initialize the model and populate weights:
model = Model()
model.load_state_dict(
{
"conv.weight": tp.ones((1, 1, 3, 3)),
"conv.bias": tp.ones((1,)),
}
)
inp = tp.ones((1, 1, 4, 4))
# Eager mode:
eager_out = model(inp)
# Compiled mode:
compiled_model = tp.compile(
model,
args=[tp.InputInfo(shape=(1, 1, 4, 4), dtype=tp.float32)],
)
compiled_out = compiled_model(inp)
```
For more details, see the
[Introduction To Tripy](https://nvidia.github.io/TensorRT-Incubator/pre0_user_guides/00-introduction-to-tripy.html)
guide.
17 changes: 17 additions & 0 deletions tripy/docs/packages.html
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,23 @@ <h1>Package Index</h1>
href="https://github.com/NVIDIA/TensorRT-Incubator/releases/download/mlir-tensorrt-v0.1.34/mlir_tensorrt_runtime-0.1.34+cuda12.trt102-cp312-cp312-linux_x86_64.whl">mlir_tensorrt_runtime-0.1.34+cuda12.trt102-cp312-cp312-linux_x86_64.whl</a><br>
<a
href="https://github.com/NVIDIA/TensorRT-Incubator/releases/download/mlir-tensorrt-v0.1.34/mlir_tensorrt_runtime-0.1.34+cuda12.trt102-cp39-cp39-linux_x86_64.whl">mlir_tensorrt_runtime-0.1.34+cuda12.trt102-cp39-cp39-linux_x86_64.whl</a><br>

<a
href="https://github.com/NVIDIA/TensorRT-Incubator/releases/download/mlir-tensorrt-v0.1.36/mlir_tensorrt_compiler-0.1.36+cuda12.trt102-cp310-cp310-linux_x86_64.whl">mlir_tensorrt_compiler-0.1.36+cuda12.trt102-cp310-cp310-linux_x86_64.whl</a><br>
<a
href="https://github.com/NVIDIA/TensorRT-Incubator/releases/download/mlir-tensorrt-v0.1.36/mlir_tensorrt_compiler-0.1.36+cuda12.trt102-cp311-cp311-linux_x86_64.whl">mlir_tensorrt_compiler-0.1.36+cuda12.trt102-cp311-cp311-linux_x86_64.whl</a><br>
<a
href="https://github.com/NVIDIA/TensorRT-Incubator/releases/download/mlir-tensorrt-v0.1.36/mlir_tensorrt_compiler-0.1.36+cuda12.trt102-cp312-cp312-linux_x86_64.whl">mlir_tensorrt_compiler-0.1.36+cuda12.trt102-cp312-cp312-linux_x86_64.whl</a><br>
<a
href="https://github.com/NVIDIA/TensorRT-Incubator/releases/download/mlir-tensorrt-v0.1.36/mlir_tensorrt_compiler-0.1.36+cuda12.trt102-cp39-cp39-linux_x86_64.whl">mlir_tensorrt_compiler-0.1.36+cuda12.trt102-cp39-cp39-linux_x86_64.whl</a><br>
<a
href="https://github.com/NVIDIA/TensorRT-Incubator/releases/download/mlir-tensorrt-v0.1.36/mlir_tensorrt_runtime-0.1.36+cuda12.trt102-cp310-cp310-linux_x86_64.whl">mlir_tensorrt_runtime-0.1.36+cuda12.trt102-cp310-cp310-linux_x86_64.whl</a><br>
<a
href="https://github.com/NVIDIA/TensorRT-Incubator/releases/download/mlir-tensorrt-v0.1.36/mlir_tensorrt_runtime-0.1.36+cuda12.trt102-cp311-cp311-linux_x86_64.whl">mlir_tensorrt_runtime-0.1.36+cuda12.trt102-cp311-cp311-linux_x86_64.whl</a><br>
<a
href="https://github.com/NVIDIA/TensorRT-Incubator/releases/download/mlir-tensorrt-v0.1.36/mlir_tensorrt_runtime-0.1.36+cuda12.trt102-cp312-cp312-linux_x86_64.whl">mlir_tensorrt_runtime-0.1.36+cuda12.trt102-cp312-cp312-linux_x86_64.whl</a><br>
<a
href="https://github.com/NVIDIA/TensorRT-Incubator/releases/download/mlir-tensorrt-v0.1.36/mlir_tensorrt_runtime-0.1.36+cuda12.trt102-cp39-cp39-linux_x86_64.whl">mlir_tensorrt_runtime-0.1.36+cuda12.trt102-cp39-cp39-linux_x86_64.whl</a><br>
</body>

</html>
142 changes: 72 additions & 70 deletions tripy/docs/pre0_user_guides/00-introduction-to-tripy.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@ It aims to be fast, easy to debug, and provide an easy-to-use Pythonic interface

## Your First Tripy Program

But enough talk; let's see some code:

```py
# doc: no-print-locals
a = tp.arange(5)
Expand All @@ -18,54 +16,7 @@ assert np.array_equal(cp.from_dlpack(c).get(), np.arange(5, dtype=np.float32) +
```

This should look familiar if you've used linear algebra or deep learning libraries like
NumPy and PyTorch.


### Lazy Evaluation: Putting Off Work

One important point is that Tripy uses a lazy evaluation model; that is,
no computation is performed until a value is actually needed.

In the example above, that means that `c` will not be evaluated until it is used,
such as when we print its values.

In most cases, this is simply an implementation detail that you will not notice.
One exception to this is when attempting to time code. Consider the following code:

```py
# doc: no-print-locals
import time

start = time.time()
a = tp.arange(5)
b = tp.arange(5)
c = a + b + tp.tanh(a)
end = time.time()

print(f"Time to create 'c': {(end - start) * 1000:.3f} ms.")
```

It looks like Tripy is very fast! While Tripy *execution* is very fast, compiling the program
takes some time. The reason the time is so low relative to what we'd expect for initializing
and running the compiler is that *we're not doing that yet*.

The actual compilation and computation only happens when we evaluate `c`:

```py
# doc: no-print-locals
start = time.time()
print(c)
end = time.time()

print(f"Time to print 'c': {(end - start) * 1000:.3f} ms.")
```

That is why the time to print `c` is so much higher than the time to create it.

If we wanted to time individual parts of the model, we would insert calls to `.eval()`;
for example, adding a `c.eval()` prior to checking the end time would tell us how
long it took to compile and run the subgraph that computes `c`.

NumPy and PyTorch. Hopefully, the code above is self-explanatory, so we won't go into details.

## Organizing Code Using Modules

Expand All @@ -77,10 +28,10 @@ For example, we can define a Transfomer MLP block like so:

```py
class MLP(tp.Module):
def __init__(self, embedding_size, dtype=tp.float32):
def __init__(self, embd_size, dtype=tp.float32):
super().__init__()
self.c_fc = tp.Linear(embedding_size, 4 * embedding_size, bias=True, dtype=dtype)
self.c_proj = tp.Linear(4 * embedding_size, embedding_size, bias=True, dtype=dtype)
self.c_fc = tp.Linear(embd_size, 4 * embd_size, bias=True, dtype=dtype)
self.c_proj = tp.Linear(4 * embd_size, embd_size, bias=True, dtype=dtype)

def __call__(self, x):
x = self.c_fc(x)
Expand All @@ -92,14 +43,14 @@ class MLP(tp.Module):
To use it, we just need to construct and call it:

```py
mlp = MLP(embedding_size=2)
# doc: no-print-locals mlp
mlp = MLP(embd_size=2)

inp = tp.iota(shape=(1, 2), dim=1, dtype=tp.float32)
out = mlp(inp)
```


## To `compile` Or Not To `compile`
## Compiling Code

All the code we've seen so far has been using Tripy's eager mode. It is also possible to compile
functions or modules ahead of time, which can result in significantly better performance.
Expand All @@ -111,37 +62,88 @@ Let's compile the MLP module we defined above as an example:

```py
# doc: no-print-locals
# When we compile, we need to indicate which parameters to the function should be runtime inputs.
# In this case, MLP takes a single input tensor for which we can specify our desired shape and datatype.
# When we compile, we need to indicate which parameters to the function
# should be runtime inputs. In this case, MLP takes a single input tensor
# for which we can specify our desired shape and datatype.
fast_mlp = tp.compile(mlp, args=[tp.InputInfo(shape=(1, 2), dtype=tp.float32)])
```

It is also possible to compile for a range of possible input shapes.
See {func}`tripy.compile` for details.

Now let's benchmark the compiled version against eager mode:
```py
# doc: no-print-locals
import time

start = time.time()
out = mlp(inp)
out.eval() # Recall that we need to evaluate in order to actually materialize `out`
# We need to evaluate in order to actually materialize `out`.
# See the section on lazy evaluation below for details.
out.eval()
end = time.time()

eager_time = (end - start) * 1000
print(f"Eager mode time: {eager_time:.4f} ms")

ITERS = 10
start = time.time()
for _ in range(ITERS):
out = fast_mlp(inp)
out.eval()
out = fast_mlp(inp)
out.eval()
end = time.time()

compiled_time = ((end - start) / ITERS) * 1000
print(f"Compiled mode average time: {compiled_time:.4f} ms")
compiled_time = (end - start) * 1000
print(f"Compiled mode time: {compiled_time:.4f} ms")
# Make sure compiled mode is actually faster # doc: omit
assert compiled_time < 0.01 * eager_time # doc: omit
```

As you can see, the compiled module is significantly faster than running the module
in eager mode.
For more information on the compiler, compiled functions/modules, and dynamic shapes,
see the [compiler guide](project:./02-compiler.md).

## Things To Note

### Eager Mode: How Does It Work?

If you've used TensorRT before, you may know that it does not support an eager mode.
In order to provide eager mode support in Tripy, we actually need to compile the graph
under the hood.

Although we employ several tricks to make compile times faster when using eager mode,
we do still need to compile, and so eager mode will likely be slower than other
comparable frameworks.

Consequently, we suggest that you use eager mode primarily for debugging and
compiled mode for deployments.

### Lazy Evaluation: Putting Off Work

One important point is that Tripy uses a lazy evaluation model; that is,
no computation is performed until a value is actually needed.

In most cases, this is simply an implementation detail that you will not notice.
One exception to this is when attempting to time code. Consider the following code:

```py
# doc: no-print-locals
import time

start = time.time()
a = tp.arange(5)
b = tp.arange(5)
c = a + b + tp.tanh(a)
end = time.time()

print(f"Time to create 'c': {(end - start) * 1000:.3f} ms.")
```

Given what we said above about eager mode, it seems like Tripy is very fast!
Of course, this is because *we haven't actually done anything yet*.
The actual compilation and execution only happens when we evaluate `c`:

```py
# doc: no-print-locals
start = time.time()
print(c)
end = time.time()

print(f"Time to print 'c': {(end - start) * 1000:.3f} ms.")
```

That is why the time to print `c` is so much higher than the time to create it.
Loading

0 comments on commit 474b03d

Please sign in to comment.