docs: update docs

dgcnz · Oct 31, 2024 · 2d311bf · 2d311bf
1 parent fa1535a
commit 2d311bf
Show file tree

Hide file tree

Showing 7 changed files with 156 additions and 46 deletions.
diff --git a/docs/src/part2/arch.png b/docs/src/part2/arch.png
diff --git a/docs/src/part2/choosing.md b/docs/src/part2/choosing.md
@@ -2,11 +2,17 @@
 
 Our task in this chapter is to choose a candidate architecture that allows us to use pre-trained vision foundation models as their backbone's feature extractor.
 
+
+## State of the Art
+
 A brief glimpse into the literature gives us some promising picks but also some fundamental questions. The first thing we find is that there is no clear winner between CNN-based and ViT-based models, especially when we factor latency/efficiency into the equation. Furthermore, neither CNN-based and ViT-based models have a clear best architectural variant (e.g vanilla ViT vs EVA's TrV's, ResNet vs ResNext) and sometimes the backbone's architecture itself is modified to better suit the task at hand (e.g Swin is a ViT with hierarchical features, useful for dense prediction tasks). Furthermore, some backbones are finetuned in task-specific datasets, which improves task-specific performance at expense of generality.
 
-{numref}`Table {number} <sota>` categorizes these model choices and summarizes its performance on the COCO dataset. However, as we described beforehand, these comparisons are often not fair, and a comprehensive evaluation would have to be done to determine the best backbone on all main tasks, and thus the best candidate as a vision foundation model. This question has been tackled by {cite}`botb` last year (2023), but its results are already outdated, as the most popular vit-based foundation models (dinov2 {cite}`dinov2`, eva02 {cite}`eva02`) were released afterwards. In any case, we want a model that is meant for general use, which narrows down the search.
+:::{tip}
+More generally, the pretraining objective also matters. {cite}`park2023` shows that Contrastive learning favors image classification, while Masked Image Modelling favors dense prediction (object detection).
+:::
 
-To finally arrive at a decision, it is useful to think back to the original motivation of using VFMs: To leverage the knowledge acquired by a model pre-trained with extensive data and compute. To keep ourselves future-proofed, we chose **dinov2** {cite}`dinov2` as the backbone, as it has the most support from the community and the authors at Meta. With the same reasoning, we chose the **VitDet** {cite}`vitdet` adapter which allows us to use almost any decoder head. To stay at the state-of-the-art we chose the **DINO** {cite}`dinodetr` decoder.
+
+{numref}`Table {number} <sota>` categorizes these model choices and summarizes its performance on the COCO dataset. However, as we described beforehand, these comparisons are often not fair, and a comprehensive evaluation would have to be done to determine the best backbone on all main tasks, and thus the best candidate as a vision foundation model. This question has been tackled by {cite}`botb` last year (2023), but its results are already outdated, as the most popular vit-based foundation models (dinov2 {cite}`dinov2`, eva02 {cite}`eva02`) were released afterwards. In any case, we want a model that is meant for general use, which narrows down the search.
 
 
 ```{table} State of the Art of Object Detection models
@@ -39,9 +45,16 @@ To finally arrive at a decision, it is useful to think back to the original moti
 
 ```
 
-:::{tip}
-More generally, the pretraining objective also matters. {cite}`park2023` shows that Contrastive learning favors image classification, while Masked Image Modelling favors dense prediction (object detection).
-:::
+## Final decision
+
+To finally arrive at a decision, it is useful to think back to the original motivation of using VFMs: To leverage the knowledge acquired by a model pre-trained with extensive data and compute. To keep ourselves future-proofed, we chose **Dinov2** {cite}`dinov2` as the backbone, as it has the most support from the community and the authors at Meta. With the same reasoning, we chose the **VitDet** {cite}`vitdet` adapter which allows us to use almost any decoder head. To stay at the state-of-the-art we chose the **DINO** {cite}`dinodetr` decoder. 
+
+:::{figure-md} arch
+<img src="arch.png" alt="arch">
+
+Model architecture: Dinov2 backbone, VitDet adapter, DINO decoder.
+::: 
+
 
 
 [^1]: With TensorRT FP16.

diff --git a/docs/src/part2/training.md b/docs/src/part2/training.md
@@ -1,3 +1,18 @@
+---
+jupytext:
+  formats: md:myst
+  text_representation:
+    extension: .md
+    format_name: myst
+kernelspec:
+  display_name: Python 3
+  language: python
+  name: python3
+mystnb:
+  execution_mode: force
+---
+
+
 # Training the Decoder
 
 ```{contents}
@@ -7,7 +22,7 @@ Now we have a working model with a pre-trained backbone, but we still need to tr
 
 ## Testing the Training Script
 
-To test the training script locally with a single 16GB GPU, we'll can do a couple of things: Reducing batch size, using a smaller model, and enabling mixed precision training:
+To test the training script locally with a single 16GB GPU, can do a couple of things: Reducing batch size, using a smaller model, and enabling mixed precision training:
 
 ```bash
 WANDB_MODE=offline python -m scripts.train_net --num-gpus=1 \
@@ -32,14 +47,65 @@ In {numref}`Table {number} <training_configs>` we can see how these choices affe
 
 ```
 
+## Training Setup 
+
+The full training recipe can be found at `projects/dino_dinov2/configs/COCO/dino_dinov2_b_12ep.py`, which is mostly based on the original recipe for ViT + VitDet + DINO that can be found at `detrex/projects/dino/configs/dino-vitdet/dino_vitdet_base_4scale_12ep.py`. If you want to create a training recipe for `50 epochs` or use a larger `dinov2` you can find appropriate recipes in that same folder.
+
+As an example, we can check the optimizer and learning rate scheduler configuration for our recipe.
+
+```{code-cell} python
+:tags: [remove-cell]
+
+import sys; from pathlib import Path
+
+__DIRS = list(Path().cwd().resolve().parents) + [Path().cwd().resolve()]
+WDIR = next(p for p in __DIRS if (p / ".project-root").exists())
+sys.path.append(str(WDIR))
+%cd {WDIR}
+
+```
+
+```{code-cell} python
+:tags: [hide-cell, remove-output]
+import detectron2
+from detectron2.config import LazyConfig, instantiate, LazyCall
+from omegaconf import OmegaConf
+```
+
+```{code-cell} python
+:tags: [remove-output]
+
+cfg = LazyConfig.load("projects/dino_dinov2/configs/COCO/dino_dinov2_b_12ep.py")
+```
+```{code-cell} python
+print(OmegaConf.to_yaml(cfg["optimizer"]))
+```
+
+```{code-cell} python
+print(OmegaConf.to_yaml(cfg["lr_multiplier"]["scheduler"]))
+```
+
+Thus we can observe that this model is trained with AdamW, with a constant learning rate of `1e-4` for the first 11 epochs, and then decays to `1e-5` for the last epoch, where each epoch is `7500` steps.
+
+
+The final training command is thus:
+
+```sh
+python -m scripts.train_net \
+--config-file=projects/dino_dinov2/configs/COCO/dino_dinov2_b_12ep.py \
+--num-gpus=4 \
+train.amp.enabled=False
+```
+
+You can activate automatic mixed precision training by setting `train.amp.enabled=True`.
 
 ## Training Results
 
 In figures {numref}`boxap` and {numref}`loss` we can see the validation BoxAP and training loss over 12 epochs, respectively. 
 
 TODO:
-- [ ] Mention the little bump at the end from the learning rate scheduler (2eps)
-- [ ] Mention that the model is not saturated
+-  Mention the little bump at the end from the learning rate scheduler (2eps)
+-  Mention that the model is not saturated
 
 ::::{grid} 2
 :::{grid-item-card} 
@@ -62,11 +128,11 @@ Training loss over 12 epochs
 ## Predicting performance at 50 epochs
 
 TODO
-- [ ] Mention that model is trained for 12eps and 50eps, but the 50ep is the one that is used in evaluations
-- [ ] Let's fit some curves and forecast performance at 50eps
-- [ ] Mention the little accuracy increase at the last 10eps of the training
-- [ ] Mention that the normal vit config can be used as reference: detrex/projects/dino/configs/dino-vitdet/dino_vitdet_base_4scale_50ep.py
-- [ ] lr scheduler information can be found at :detrex/detrex/config/configs/common/coco_schedule.py
+- Mention that model is trained for 12eps and 50eps, but the 50ep is the one that is used in evaluations
+- Let's fit some curves and forecast performance at 50eps
+- Mention the little accuracy increase at the last 10eps of the training
+- Mention that the normal vit config can be used as reference: detrex/projects/dino/configs/dino-vitdet/dino_vitdet_base_4scale_50ep.py
+- lr scheduler information can be found at :detrex/detrex/config/configs/common/coco_schedule.py
 
 ::::{grid} 2
 :::{grid-item-card} 

diff --git a/docs/src/part3/compilation.ipynb b/docs/src/part3/compilation.ipynb
@@ -725,7 +725,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The working script to export and compile our model with the TensorRT backend is `scripts.export_tensorrt`.\n",
+    "Before compilation, download the trained model weights from HuggingFace and place them on `artifacts/model_final.pth` or configure the path in the config file. To download the weights, run the following command:\n",
+    "\n",
+    "```sh\n",
+    "!wget https://huggingface.co/dgcnz/dinov2_vitdet_DINO_12ep/resolve/main/model_final.pth -O artifacts/model_final.pth ⁠\n",
+    "```\n",
+    "\n",
+    "The main script to compile our model with the TensorRT backend is `scripts.export_tensorrt`.\n",
     "\n",
     "The easiest way to specify a compilation target, is by adding a config file at `scripts/config/export_tensorrt`. For example, if we want to compile our model's, we can use the config file located at `scripts/config/export_tensorrt/dinov2.yaml` as follows:\n",
     "\n",
@@ -800,11 +806,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Although this script is a useful entrypoint, the challenge when compiling a model lies in making the models' source code compatible with both TorchDynamo and the backend of choice (TensorRT in this case). This is a bit harder to explain because during the debugging procedure, you'll attempt many possible fixes that are informed by insights of the codebase's state at that time, many of which will be deemed unsuccessful or unnecessary. For example, you might find a way to solve a bug which will itself be fixed by another more important bug. Furthermore, one bug might appear/disappear with newer versions of the libararies. \n",
+    "Although this script is a useful entrypoint, the challenge when compiling a model lies in making the models' source code compatible with both TorchDynamo and the backend of choice (TensorRT in this case). This is a bit harder to explain because during the debugging procedure, you'll attempt many possible fixes that are informed by insights of the codebase's state at that time, many of which will be deemed unsuccessful or unnecessary. For example, you might find a way to solve a bug which will itself be fixed by another more important bug. Furthermore, one bug might appear/disappear with newer versions of the libraries. \n",
     "\n",
     "Because of this, I'll cover two apparently similar but very different case studies and share some of the relevant insights and tricks in the following two sections:\n",
     "1. DinoV2 + ViTDet + DINO: Successful compilation, minimal final rewrites.\n",
-    "2. ViT + ViTDet + Cascade Mask RCNN: Almost successful, many final rewrites."
+    "2. ViT + ViTDet + Cascade Mask RCNN: Almost successful, many final rewrites.\n",
+    "\n",
+    "To follow the thought process in a single notebook, I've added flags throughout the model's code to activate or deactivate the most important fixes. To see *all* the changes, you can check all the differences between my forks of `detectron2`, `detrex` and the original repositories."
    ]
   },
   {
@@ -1109,6 +1117,23 @@
     "#### [✅] Rewriting code for non-tensor constants"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The second solution is to rewrite the code to keep `spatial_shapes` as a list of tuples. This works because PyTorch automatically considers lists and integers as constants. \n",
+    "\n",
+    "The disadvantages of this approach are:\n",
+    "- It's a bit more intrusive and error-prone.\n",
+    "- We will have to disable the deformable attention cuda kernel because it expects a tensor `spatial_shapes`. Maybe the kernel could be rewritten, but TensorRT is already good enough at optimizing the python implementation.\n",
+    "\n",
+    "The advantages are:\n",
+    "- It's more robust in comparison with the first solution. We don't have to rewrite PyTorch's source code nor wait until they fix the issue.\n",
+    "\n",
+    "\n",
+    "We can test this, by setting `model.transformer.specialize_with_list`:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 17,
@@ -1674,13 +1699,19 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This new error is tricky, but we can pinpoint its location by looking at the name of the node: `ForeignNode[model.backbone.net.blocks.0.norm1/native_layer_norm_weight...]`. After cross-referencing the operators we see in the node with the source code, we find out that the culprit is the window attention module. We can disable it and use only global attention to bypass this error."
+    "This new error states that TensorRT can't find an implementation for a fused node. I'm unsure as to why this happens, but we can fix it by rewriting the code. To pinpoint the source location we can look at the name of the node: `ForeignNode[model.backbone.net.blocks.0.norm1/native_layer_norm_weight...]` and  cross-reference the operators we see with the source code. For example, we know that there's unsupported code in the `detectron2.VisionTransformer` blocks because that's the class of `model.backbone.net.blocks[i]`. \n",
+    "\n",
+    "Specifically, the culprit here is the usage of window attention. We can disable it and use only global attention to bypass this error and try to compile again."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 11,
-   "metadata": {},
+   "metadata": {
+    "tags": [
+     "remove-output"
+    ]
+   },
    "outputs": [],
    "source": [
     "cfg = LazyConfig.load(\"detrex/detectron2/projects/ViTDet/configs/COCO/cascade_mask_rcnn_vitdet_b_100ep.py\")\n",
@@ -1748,13 +1779,13 @@
     "\n",
     "This is where we stop. This framework-specific bugs are hard to debug and fix as they often are bugs in the compiler itself. In my experience with the previous case study, these bugs fixed themselves by rewriting the model in order to avoid graph partitioning alltogether. We can obtain the unsupported nodes by feeding `debug=True` to `torch_tensorrt.dynamo.compile`.\n",
     "\n",
-    "For this model, the unsupported nodes after the non-maximum-suppresion rewrites are:\n",
+    "For this model, the unsupported nodes after the removing the filtering steps (non-maximum-suppresion, etc) are:\n",
     "- `torch.ops.aten.nonzero.default`\n",
     "- `torch.ops.aten.index.Tensor`\n",
     "- `torch.ops.torchvision.roi_align.default`\n",
     "- `torch.ops.aten.index_put.default`\n",
     "\n",
-    "However, we've already rewritten essential parts of the model and my guess is that if we continued with more rewrites, the resulting model would not be usable. For example, the weights of window attention do not have the same the same shape as that of the global attention, so the pre-trained model likely already needs finetuning."
+    "However, we've already rewritten essential parts of the model and my guess is that if we continued with more rewrites, the resulting model would not be usable. For example, the weights of window attention do not have the same the same shape as that of the global attention, so the pre-trained model likely already needs finetuning or might not even work anymore."
    ]
   },
   {

diff --git a/docs/src/part3/results.md b/docs/src/part3/results.md
@@ -1,9 +1,13 @@
-# Results
+# Benchmarks and Results
 
 ## Running the benchmarks
 
-Before running the benchmarks make sure you have compiled your desired model.
+Download the model:
+```bash
+!wget https://huggingface.co/dgcnz/dinov2_vitdet_DINO_12ep/resolve/main/model_final.pth -O artifacts/model_final.pth ⁠
+```
 
+Before running the benchmarks make sure you have compiled your desired model. 
 ```bash
 python -m scripts.export_tensorrt --config-name dinov2 amp_dtype=fp32 trt.enabled_precisions="[fp32, bf16, fp16]" 
 # ...
@@ -41,48 +45,53 @@ python -m scripts.benchmark_gpu compile_run_path=outputs/2024-10-31/10-43-31 n_i
 
 ## Results
 
+
 **Python Runtime, no TensorRT**
 
-| model's precision | amp_dtype              | latency        |
+| model's precision | amp_dtype              | latency (ms)   |
 | ----------------- | ---------------------- | -------------- |
 | fp32              | fp32+fp16              | 66.322 ± 0.927 |
 | fp32              | fp32+bf16              | 66.497 ± 1.052 |
 | fp32              | fp32                   | 76.275 ± 0.587 |
 
+Max memory usage for all configurations is ~1GB.
+
 **Python Runtime, with TensorRT**
 
-| model's precision | trt.enabled_precisions | latency        |
+| model's precision | trt.enabled_precisions | latency (ms)   |
 | ----------------- | ---------------------- | -------------- |
 | fp32+fp16         | fp32+bf16+fp16         | 15.369 ± 0.023 |
 | fp32              | fp32+bf16+fp16         | 23.164 ± 0.031 |
 | fp32              | fp32+bf16              | 25.148 ± 0.030 |
 | fp32              | fp32                   | 38.381 ± 0.022 |
 
+Max memory usage for all configurations is ~500MB except for fp32+fp32 which is ~770MB.
+
 **C++ Runtime, no TensorRT**
 
-| model's precision | trt.enabled_precisions | latency        |
+| model's precision | trt.enabled_precisions | latency (ms)   |
 | ----------------- | ---------------------- | -------------- |
 | fp32+fp16         | fp32+bf16+fp16         | 15.433 ± 0.029 |
 | fp32              | fp32+bf16+fp16         | 23.263 ± 0.027 |
 | fp32              | fp32+bf16              | 25.255 ± 0.014 |
 | fp32              | fp32                   | 38.465 ± 0.029 |
 
 
+Max memory usage for all configurations is ~500MB except for fp32+fp32 which is ~770MB.
 
+---
 
 Note: For some reason in the latest version of torch_tensorrt, `bfloat16` precision is not working well and it's not achieving the previously measured performance of (13-14ms) and/or failing compilation. 
 
-We include the previous results for completeness:
-
-| Runtime | model's precision | Enabled Precisions | Latency | Memory (MB) |
-| ------- | ----------------- | ------------------ | ------- | ----------- |
-| cpp+trt | fp32              | fp32+fp16          | 13.984  | 500         |
-| cpp+trt | fp32              | fp32+bf16+fp16     | 13.898  | 500         |
-| cpp+trt | fp32              | fp32+bf16          | 17.261  | 500         |
-| cpp+trt | bf16              | fp32+bf16          | 22.913  | 500         |
-| cpp+trt | bf16              | bf16               | 22.938  | 500         |
-| cpp+trt | fp32              | fp32               | 37.639  | 770         |
-
+We include the previous results for completeness, in case the issue is resolved in the future.
 
+| Runtime | model's precision | trt.enabled_precisions | latency | memory (mb) |
+| ------- | ----------------- | ---------------------- | ------- | ----------- |
+| cpp+trt | fp32              | fp32+fp16              | 13.984  | 500         |
+| cpp+trt | fp32              | fp32+bf16+fp16         | 13.898  | 500         |
+| cpp+trt | fp32              | fp32+bf16              | 17.261  | 500         |
+| cpp+trt | bf16              | fp32+bf16              | 22.913  | 500         |
+| cpp+trt | bf16              | bf16                   | 22.938  | 500         |
+| cpp+trt | fp32              | fp32                   | 37.639  | 770         |
 
 
diff --git a/docs/src/simple_net.pt2 b/docs/src/simple_net.pt2
diff --git a/projects/dino_dinov2/configs/COCO/dino_dinov2_b_12ep.py b/projects/dino_dinov2/configs/COCO/dino_dinov2_b_12ep.py
@@ -1,6 +1,5 @@
 from detrex.config import get_config
 from ..models.dino_dinov2 import model
-# from ..common.coco_detr_518 import dataloader
 
 # get default config
 dataloader = get_config("common/data/coco_detr.py").dataloader
@@ -54,12 +53,4 @@
 dataloader.evaluator.output_dir = train.output_dir
 
 # logger
-train.wandb.enabled=True
-#     wandb=dict(
-#         enabled=False,
-#         params=dict(
-#             dir="./wandb_output",
-#             project="detrex",
-#             name="detrex_experiment",
-#         )
-#     ),
+train.wandb.enabled=True