Skip to content

Commit

Permalink
docs: update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
dgcnz committed Oct 31, 2024
1 parent fa1535a commit 2d311bf
Show file tree
Hide file tree
Showing 7 changed files with 156 additions and 46 deletions.
Binary file added docs/src/part2/arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 18 additions & 5 deletions docs/src/part2/choosing.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,17 @@

Our task in this chapter is to choose a candidate architecture that allows us to use pre-trained vision foundation models as their backbone's feature extractor.


## State of the Art

A brief glimpse into the literature gives us some promising picks but also some fundamental questions. The first thing we find is that there is no clear winner between CNN-based and ViT-based models, especially when we factor latency/efficiency into the equation. Furthermore, neither CNN-based and ViT-based models have a clear best architectural variant (e.g vanilla ViT vs EVA's TrV's, ResNet vs ResNext) and sometimes the backbone's architecture itself is modified to better suit the task at hand (e.g Swin is a ViT with hierarchical features, useful for dense prediction tasks). Furthermore, some backbones are finetuned in task-specific datasets, which improves task-specific performance at expense of generality.

{numref}`Table {number} <sota>` categorizes these model choices and summarizes its performance on the COCO dataset. However, as we described beforehand, these comparisons are often not fair, and a comprehensive evaluation would have to be done to determine the best backbone on all main tasks, and thus the best candidate as a vision foundation model. This question has been tackled by {cite}`botb` last year (2023), but its results are already outdated, as the most popular vit-based foundation models (dinov2 {cite}`dinov2`, eva02 {cite}`eva02`) were released afterwards. In any case, we want a model that is meant for general use, which narrows down the search.
:::{tip}
More generally, the pretraining objective also matters. {cite}`park2023` shows that Contrastive learning favors image classification, while Masked Image Modelling favors dense prediction (object detection).
:::

To finally arrive at a decision, it is useful to think back to the original motivation of using VFMs: To leverage the knowledge acquired by a model pre-trained with extensive data and compute. To keep ourselves future-proofed, we chose **dinov2** {cite}`dinov2` as the backbone, as it has the most support from the community and the authors at Meta. With the same reasoning, we chose the **VitDet** {cite}`vitdet` adapter which allows us to use almost any decoder head. To stay at the state-of-the-art we chose the **DINO** {cite}`dinodetr` decoder.

{numref}`Table {number} <sota>` categorizes these model choices and summarizes its performance on the COCO dataset. However, as we described beforehand, these comparisons are often not fair, and a comprehensive evaluation would have to be done to determine the best backbone on all main tasks, and thus the best candidate as a vision foundation model. This question has been tackled by {cite}`botb` last year (2023), but its results are already outdated, as the most popular vit-based foundation models (dinov2 {cite}`dinov2`, eva02 {cite}`eva02`) were released afterwards. In any case, we want a model that is meant for general use, which narrows down the search.


```{table} State of the Art of Object Detection models
Expand Down Expand Up @@ -39,9 +45,16 @@ To finally arrive at a decision, it is useful to think back to the original moti
```

:::{tip}
More generally, the pretraining objective also matters. {cite}`park2023` shows that Contrastive learning favors image classification, while Masked Image Modelling favors dense prediction (object detection).
:::
## Final decision

To finally arrive at a decision, it is useful to think back to the original motivation of using VFMs: To leverage the knowledge acquired by a model pre-trained with extensive data and compute. To keep ourselves future-proofed, we chose **Dinov2** {cite}`dinov2` as the backbone, as it has the most support from the community and the authors at Meta. With the same reasoning, we chose the **VitDet** {cite}`vitdet` adapter which allows us to use almost any decoder head. To stay at the state-of-the-art we chose the **DINO** {cite}`dinodetr` decoder.

:::{figure-md} arch
<img src="arch.png" alt="arch">

Model architecture: Dinov2 backbone, VitDet adapter, DINO decoder.
:::



[^1]: With TensorRT FP16.
Expand Down
82 changes: 74 additions & 8 deletions docs/src/part2/training.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
---
jupytext:
formats: md:myst
text_representation:
extension: .md
format_name: myst
kernelspec:
display_name: Python 3
language: python
name: python3
mystnb:
execution_mode: force
---


# Training the Decoder

```{contents}
Expand All @@ -7,7 +22,7 @@ Now we have a working model with a pre-trained backbone, but we still need to tr

## Testing the Training Script

To test the training script locally with a single 16GB GPU, we'll can do a couple of things: Reducing batch size, using a smaller model, and enabling mixed precision training:
To test the training script locally with a single 16GB GPU, can do a couple of things: Reducing batch size, using a smaller model, and enabling mixed precision training:

```bash
WANDB_MODE=offline python -m scripts.train_net --num-gpus=1 \
Expand All @@ -32,14 +47,65 @@ In {numref}`Table {number} <training_configs>` we can see how these choices affe
```

## Training Setup

The full training recipe can be found at `projects/dino_dinov2/configs/COCO/dino_dinov2_b_12ep.py`, which is mostly based on the original recipe for ViT + VitDet + DINO that can be found at `detrex/projects/dino/configs/dino-vitdet/dino_vitdet_base_4scale_12ep.py`. If you want to create a training recipe for `50 epochs` or use a larger `dinov2` you can find appropriate recipes in that same folder.

As an example, we can check the optimizer and learning rate scheduler configuration for our recipe.

```{code-cell} python
:tags: [remove-cell]
import sys; from pathlib import Path
__DIRS = list(Path().cwd().resolve().parents) + [Path().cwd().resolve()]
WDIR = next(p for p in __DIRS if (p / ".project-root").exists())
sys.path.append(str(WDIR))
%cd {WDIR}
```

```{code-cell} python
:tags: [hide-cell, remove-output]
import detectron2
from detectron2.config import LazyConfig, instantiate, LazyCall
from omegaconf import OmegaConf
```

```{code-cell} python
:tags: [remove-output]
cfg = LazyConfig.load("projects/dino_dinov2/configs/COCO/dino_dinov2_b_12ep.py")
```
```{code-cell} python
print(OmegaConf.to_yaml(cfg["optimizer"]))
```

```{code-cell} python
print(OmegaConf.to_yaml(cfg["lr_multiplier"]["scheduler"]))
```

Thus we can observe that this model is trained with AdamW, with a constant learning rate of `1e-4` for the first 11 epochs, and then decays to `1e-5` for the last epoch, where each epoch is `7500` steps.


The final training command is thus:

```sh
python -m scripts.train_net \
--config-file=projects/dino_dinov2/configs/COCO/dino_dinov2_b_12ep.py \
--num-gpus=4 \
train.amp.enabled=False
```

You can activate automatic mixed precision training by setting `train.amp.enabled=True`.

## Training Results

In figures {numref}`boxap` and {numref}`loss` we can see the validation BoxAP and training loss over 12 epochs, respectively.

TODO:
- [ ] Mention the little bump at the end from the learning rate scheduler (2eps)
- [ ] Mention that the model is not saturated
- Mention the little bump at the end from the learning rate scheduler (2eps)
- Mention that the model is not saturated

::::{grid} 2
:::{grid-item-card}
Expand All @@ -62,11 +128,11 @@ Training loss over 12 epochs
## Predicting performance at 50 epochs

TODO
- [ ] Mention that model is trained for 12eps and 50eps, but the 50ep is the one that is used in evaluations
- [ ] Let's fit some curves and forecast performance at 50eps
- [ ] Mention the little accuracy increase at the last 10eps of the training
- [ ] Mention that the normal vit config can be used as reference: detrex/projects/dino/configs/dino-vitdet/dino_vitdet_base_4scale_50ep.py
- [ ] lr scheduler information can be found at :detrex/detrex/config/configs/common/coco_schedule.py
- Mention that model is trained for 12eps and 50eps, but the 50ep is the one that is used in evaluations
- Let's fit some curves and forecast performance at 50eps
- Mention the little accuracy increase at the last 10eps of the training
- Mention that the normal vit config can be used as reference: detrex/projects/dino/configs/dino-vitdet/dino_vitdet_base_4scale_50ep.py
- lr scheduler information can be found at :detrex/detrex/config/configs/common/coco_schedule.py

::::{grid} 2
:::{grid-item-card}
Expand Down
45 changes: 38 additions & 7 deletions docs/src/part3/compilation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -725,7 +725,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The working script to export and compile our model with the TensorRT backend is `scripts.export_tensorrt`.\n",
"Before compilation, download the trained model weights from HuggingFace and place them on `artifacts/model_final.pth` or configure the path in the config file. To download the weights, run the following command:\n",
"\n",
"```sh\n",
"!wget https://huggingface.co/dgcnz/dinov2_vitdet_DINO_12ep/resolve/main/model_final.pth -O artifacts/model_final.pth ⁠\n",
"```\n",
"\n",
"The main script to compile our model with the TensorRT backend is `scripts.export_tensorrt`.\n",
"\n",
"The easiest way to specify a compilation target, is by adding a config file at `scripts/config/export_tensorrt`. For example, if we want to compile our model's, we can use the config file located at `scripts/config/export_tensorrt/dinov2.yaml` as follows:\n",
"\n",
Expand Down Expand Up @@ -800,11 +806,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Although this script is a useful entrypoint, the challenge when compiling a model lies in making the models' source code compatible with both TorchDynamo and the backend of choice (TensorRT in this case). This is a bit harder to explain because during the debugging procedure, you'll attempt many possible fixes that are informed by insights of the codebase's state at that time, many of which will be deemed unsuccessful or unnecessary. For example, you might find a way to solve a bug which will itself be fixed by another more important bug. Furthermore, one bug might appear/disappear with newer versions of the libararies. \n",
"Although this script is a useful entrypoint, the challenge when compiling a model lies in making the models' source code compatible with both TorchDynamo and the backend of choice (TensorRT in this case). This is a bit harder to explain because during the debugging procedure, you'll attempt many possible fixes that are informed by insights of the codebase's state at that time, many of which will be deemed unsuccessful or unnecessary. For example, you might find a way to solve a bug which will itself be fixed by another more important bug. Furthermore, one bug might appear/disappear with newer versions of the libraries. \n",
"\n",
"Because of this, I'll cover two apparently similar but very different case studies and share some of the relevant insights and tricks in the following two sections:\n",
"1. DinoV2 + ViTDet + DINO: Successful compilation, minimal final rewrites.\n",
"2. ViT + ViTDet + Cascade Mask RCNN: Almost successful, many final rewrites."
"2. ViT + ViTDet + Cascade Mask RCNN: Almost successful, many final rewrites.\n",
"\n",
"To follow the thought process in a single notebook, I've added flags throughout the model's code to activate or deactivate the most important fixes. To see *all* the changes, you can check all the differences between my forks of `detectron2`, `detrex` and the original repositories."
]
},
{
Expand Down Expand Up @@ -1109,6 +1117,23 @@
"#### [✅] Rewriting code for non-tensor constants"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The second solution is to rewrite the code to keep `spatial_shapes` as a list of tuples. This works because PyTorch automatically considers lists and integers as constants. \n",
"\n",
"The disadvantages of this approach are:\n",
"- It's a bit more intrusive and error-prone.\n",
"- We will have to disable the deformable attention cuda kernel because it expects a tensor `spatial_shapes`. Maybe the kernel could be rewritten, but TensorRT is already good enough at optimizing the python implementation.\n",
"\n",
"The advantages are:\n",
"- It's more robust in comparison with the first solution. We don't have to rewrite PyTorch's source code nor wait until they fix the issue.\n",
"\n",
"\n",
"We can test this, by setting `model.transformer.specialize_with_list`:"
]
},
{
"cell_type": "code",
"execution_count": 17,
Expand Down Expand Up @@ -1674,13 +1699,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This new error is tricky, but we can pinpoint its location by looking at the name of the node: `ForeignNode[model.backbone.net.blocks.0.norm1/native_layer_norm_weight...]`. After cross-referencing the operators we see in the node with the source code, we find out that the culprit is the window attention module. We can disable it and use only global attention to bypass this error."
"This new error states that TensorRT can't find an implementation for a fused node. I'm unsure as to why this happens, but we can fix it by rewriting the code. To pinpoint the source location we can look at the name of the node: `ForeignNode[model.backbone.net.blocks.0.norm1/native_layer_norm_weight...]` and cross-reference the operators we see with the source code. For example, we know that there's unsupported code in the `detectron2.VisionTransformer` blocks because that's the class of `model.backbone.net.blocks[i]`. \n",
"\n",
"Specifically, the culprit here is the usage of window attention. We can disable it and use only global attention to bypass this error and try to compile again."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"metadata": {
"tags": [
"remove-output"
]
},
"outputs": [],
"source": [
"cfg = LazyConfig.load(\"detrex/detectron2/projects/ViTDet/configs/COCO/cascade_mask_rcnn_vitdet_b_100ep.py\")\n",
Expand Down Expand Up @@ -1748,13 +1779,13 @@
"\n",
"This is where we stop. This framework-specific bugs are hard to debug and fix as they often are bugs in the compiler itself. In my experience with the previous case study, these bugs fixed themselves by rewriting the model in order to avoid graph partitioning alltogether. We can obtain the unsupported nodes by feeding `debug=True` to `torch_tensorrt.dynamo.compile`.\n",
"\n",
"For this model, the unsupported nodes after the non-maximum-suppresion rewrites are:\n",
"For this model, the unsupported nodes after the removing the filtering steps (non-maximum-suppresion, etc) are:\n",
"- `torch.ops.aten.nonzero.default`\n",
"- `torch.ops.aten.index.Tensor`\n",
"- `torch.ops.torchvision.roi_align.default`\n",
"- `torch.ops.aten.index_put.default`\n",
"\n",
"However, we've already rewritten essential parts of the model and my guess is that if we continued with more rewrites, the resulting model would not be usable. For example, the weights of window attention do not have the same the same shape as that of the global attention, so the pre-trained model likely already needs finetuning."
"However, we've already rewritten essential parts of the model and my guess is that if we continued with more rewrites, the resulting model would not be usable. For example, the weights of window attention do not have the same the same shape as that of the global attention, so the pre-trained model likely already needs finetuning or might not even work anymore."
]
},
{
Expand Down
41 changes: 25 additions & 16 deletions docs/src/part3/results.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# Results
# Benchmarks and Results

## Running the benchmarks

Before running the benchmarks make sure you have compiled your desired model.
Download the model:
```bash
!wget https://huggingface.co/dgcnz/dinov2_vitdet_DINO_12ep/resolve/main/model_final.pth -O artifacts/model_final.pth ⁠
```

Before running the benchmarks make sure you have compiled your desired model.
```bash
python -m scripts.export_tensorrt --config-name dinov2 amp_dtype=fp32 trt.enabled_precisions="[fp32, bf16, fp16]"
# ...
Expand Down Expand Up @@ -41,48 +45,53 @@ python -m scripts.benchmark_gpu compile_run_path=outputs/2024-10-31/10-43-31 n_i

## Results


**Python Runtime, no TensorRT**

| model's precision | amp_dtype | latency |
| model's precision | amp_dtype | latency (ms) |
| ----------------- | ---------------------- | -------------- |
| fp32 | fp32+fp16 | 66.322 ± 0.927 |
| fp32 | fp32+bf16 | 66.497 ± 1.052 |
| fp32 | fp32 | 76.275 ± 0.587 |

Max memory usage for all configurations is ~1GB.

**Python Runtime, with TensorRT**

| model's precision | trt.enabled_precisions | latency |
| model's precision | trt.enabled_precisions | latency (ms) |
| ----------------- | ---------------------- | -------------- |
| fp32+fp16 | fp32+bf16+fp16 | 15.369 ± 0.023 |
| fp32 | fp32+bf16+fp16 | 23.164 ± 0.031 |
| fp32 | fp32+bf16 | 25.148 ± 0.030 |
| fp32 | fp32 | 38.381 ± 0.022 |

Max memory usage for all configurations is ~500MB except for fp32+fp32 which is ~770MB.

**C++ Runtime, no TensorRT**

| model's precision | trt.enabled_precisions | latency |
| model's precision | trt.enabled_precisions | latency (ms) |
| ----------------- | ---------------------- | -------------- |
| fp32+fp16 | fp32+bf16+fp16 | 15.433 ± 0.029 |
| fp32 | fp32+bf16+fp16 | 23.263 ± 0.027 |
| fp32 | fp32+bf16 | 25.255 ± 0.014 |
| fp32 | fp32 | 38.465 ± 0.029 |


Max memory usage for all configurations is ~500MB except for fp32+fp32 which is ~770MB.

---

Note: For some reason in the latest version of torch_tensorrt, `bfloat16` precision is not working well and it's not achieving the previously measured performance of (13-14ms) and/or failing compilation.

We include the previous results for completeness:

| Runtime | model's precision | Enabled Precisions | Latency | Memory (MB) |
| ------- | ----------------- | ------------------ | ------- | ----------- |
| cpp+trt | fp32 | fp32+fp16 | 13.984 | 500 |
| cpp+trt | fp32 | fp32+bf16+fp16 | 13.898 | 500 |
| cpp+trt | fp32 | fp32+bf16 | 17.261 | 500 |
| cpp+trt | bf16 | fp32+bf16 | 22.913 | 500 |
| cpp+trt | bf16 | bf16 | 22.938 | 500 |
| cpp+trt | fp32 | fp32 | 37.639 | 770 |

We include the previous results for completeness, in case the issue is resolved in the future.

| Runtime | model's precision | trt.enabled_precisions | latency | memory (mb) |
| ------- | ----------------- | ---------------------- | ------- | ----------- |
| cpp+trt | fp32 | fp32+fp16 | 13.984 | 500 |
| cpp+trt | fp32 | fp32+bf16+fp16 | 13.898 | 500 |
| cpp+trt | fp32 | fp32+bf16 | 17.261 | 500 |
| cpp+trt | bf16 | fp32+bf16 | 22.913 | 500 |
| cpp+trt | bf16 | bf16 | 22.938 | 500 |
| cpp+trt | fp32 | fp32 | 37.639 | 770 |


Binary file removed docs/src/simple_net.pt2
Binary file not shown.
11 changes: 1 addition & 10 deletions projects/dino_dinov2/configs/COCO/dino_dinov2_b_12ep.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
from detrex.config import get_config
from ..models.dino_dinov2 import model
# from ..common.coco_detr_518 import dataloader

# get default config
dataloader = get_config("common/data/coco_detr.py").dataloader
Expand Down Expand Up @@ -54,12 +53,4 @@
dataloader.evaluator.output_dir = train.output_dir

# logger
train.wandb.enabled=True
# wandb=dict(
# enabled=False,
# params=dict(
# dir="./wandb_output",
# project="detrex",
# name="detrex_experiment",
# )
# ),
train.wandb.enabled=True

0 comments on commit 2d311bf

Please sign in to comment.