|
8 | 8 | techniques often can be implemented by changing only a few lines of code and can
|
9 | 9 | be applied to a wide range of deep learning models across all domains.
|
10 | 10 |
|
| 11 | +.. grid:: 2 |
| 12 | +
|
| 13 | + .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn |
| 14 | + :class-card: card-prerequisites |
| 15 | +
|
| 16 | + * General optimization techniques for PyTorch models |
| 17 | + * CPU-specific performance optimizations |
| 18 | + * GPU acceleration strategies |
| 19 | + * Distributed training optimizations |
| 20 | +
|
| 21 | + .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites |
| 22 | + :class-card: card-prerequisites |
| 23 | +
|
| 24 | + * PyTorch 2.0 or later |
| 25 | + * Python 3.8 or later |
| 26 | + * CUDA-capable GPU (recommended for GPU optimizations) |
| 27 | + * Linux, macOS, or Windows operating system |
| 28 | +
|
| 29 | +Overview |
| 30 | +-------- |
| 31 | +
|
| 32 | +Performance optimization is crucial for efficient deep learning model training and inference. |
| 33 | +This tutorial covers a comprehensive set of techniques to accelerate PyTorch workloads across |
| 34 | +different hardware configurations and use cases. |
| 35 | +
|
11 | 36 | General optimizations
|
12 | 37 | ---------------------
|
13 | 38 | """
|
14 | 39 |
|
| 40 | +import torch |
| 41 | +import torchvision |
| 42 | + |
15 | 43 | ###############################################################################
|
16 | 44 | # Enable asynchronous data loading and augmentation
|
17 | 45 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
90 | 118 | # setting it to zero, for more details refer to the
|
91 | 119 | # `documentation <https://pytorch.org/docs/master/optim.html#torch.optim.Optimizer.zero_grad>`_.
|
92 | 120 | #
|
93 |
| -# Alternatively, starting from PyTorch 1.7, call ``model`` or |
| 121 | +# Alternatively, call ``model`` or |
94 | 122 | # ``optimizer.zero_grad(set_to_none=True)``.
|
95 | 123 |
|
96 | 124 | ###############################################################################
|
@@ -129,7 +157,7 @@ def gelu(x):
|
129 | 157 | ###############################################################################
|
130 | 158 | # Enable channels_last memory format for computer vision models
|
131 | 159 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
132 |
| -# PyTorch 1.5 introduced support for ``channels_last`` memory format for |
| 160 | +# PyTorch supports ``channels_last`` memory format for |
133 | 161 | # convolutional networks. This format is meant to be used in conjunction with
|
134 | 162 | # `AMP <https://pytorch.org/docs/stable/amp.html>`_ to further accelerate
|
135 | 163 | # convolutional neural networks with
|
@@ -250,65 +278,6 @@ def gelu(x):
|
250 | 278 | #
|
251 | 279 | # export LD_PRELOAD=<jemalloc.so/tcmalloc.so>:$LD_PRELOAD
|
252 | 280 |
|
253 |
| -############################################################################### |
254 |
| -# Use oneDNN Graph with TorchScript for inference |
255 |
| -# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
256 |
| -# oneDNN Graph can significantly boost inference performance. It fuses some compute-intensive operations such as convolution, matmul with their neighbor operations. |
257 |
| -# In PyTorch 2.0, it is supported as a beta feature for ``Float32`` & ``BFloat16`` data-types. |
258 |
| -# oneDNN Graph receives the model’s graph and identifies candidates for operator-fusion with respect to the shape of the example input. |
259 |
| -# A model should be JIT-traced using an example input. |
260 |
| -# Speed-up would then be observed after a couple of warm-up iterations for inputs with the same shape as the example input. |
261 |
| -# The example code-snippets below are for resnet50, but they can very well be extended to use oneDNN Graph with custom models as well. |
262 |
| - |
263 |
| -# Only this extra line of code is required to use oneDNN Graph |
264 |
| -torch.jit.enable_onednn_fusion(True) |
265 |
| - |
266 |
| -############################################################################### |
267 |
| -# Using the oneDNN Graph API requires just one extra line of code for inference with Float32. |
268 |
| -# If you are using oneDNN Graph, please avoid calling ``torch.jit.optimize_for_inference``. |
269 |
| - |
270 |
| -# sample input should be of the same shape as expected inputs |
271 |
| -sample_input = [torch.rand(32, 3, 224, 224)] |
272 |
| -# Using resnet50 from torchvision in this example for illustrative purposes, |
273 |
| -# but the line below can indeed be modified to use custom models as well. |
274 |
| -model = getattr(torchvision.models, "resnet50")().eval() |
275 |
| -# Tracing the model with example input |
276 |
| -traced_model = torch.jit.trace(model, sample_input) |
277 |
| -# Invoking torch.jit.freeze |
278 |
| -traced_model = torch.jit.freeze(traced_model) |
279 |
| - |
280 |
| -############################################################################### |
281 |
| -# Once a model is JIT-traced with a sample input, it can then be used for inference after a couple of warm-up runs. |
282 |
| - |
283 |
| -with torch.no_grad(): |
284 |
| - # a couple of warm-up runs |
285 |
| - traced_model(*sample_input) |
286 |
| - traced_model(*sample_input) |
287 |
| - # speedup would be observed after warm-up runs |
288 |
| - traced_model(*sample_input) |
289 |
| - |
290 |
| -############################################################################### |
291 |
| -# While the JIT fuser for oneDNN Graph also supports inference with ``BFloat16`` datatype, |
292 |
| -# performance benefit with oneDNN Graph is only exhibited by machines with AVX512_BF16 |
293 |
| -# instruction set architecture (ISA). |
294 |
| -# The following code snippets serves as an example of using ``BFloat16`` datatype for inference with oneDNN Graph: |
295 |
| - |
296 |
| -# AMP for JIT mode is enabled by default, and is divergent with its eager mode counterpart |
297 |
| -torch._C._jit_set_autocast_mode(False) |
298 |
| - |
299 |
| -with torch.no_grad(), torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16): |
300 |
| - # Conv-BatchNorm folding for CNN-based Vision Models should be done with ``torch.fx.experimental.optimization.fuse`` when AMP is used |
301 |
| - import torch.fx.experimental.optimization as optimization |
302 |
| - # Please note that optimization.fuse need not be called when AMP is not used |
303 |
| - model = optimization.fuse(model) |
304 |
| - model = torch.jit.trace(model, (example_input)) |
305 |
| - model = torch.jit.freeze(model) |
306 |
| - # a couple of warm-up runs |
307 |
| - model(example_input) |
308 |
| - model(example_input) |
309 |
| - # speedup would be observed in subsequent runs. |
310 |
| - model(example_input) |
311 |
| - |
312 | 281 |
|
313 | 282 | ###############################################################################
|
314 | 283 | # Train a model on CPU with PyTorch ``DistributedDataParallel``(DDP) functionality
|
@@ -426,9 +395,8 @@ def gelu(x):
|
426 | 395 | # * enable AMP
|
427 | 396 | #
|
428 | 397 | # * Introduction to Mixed Precision Training and AMP:
|
429 |
| -# `video <https://www.youtube.com/watch?v=jF4-_ZK_tyc&feature=youtu.be>`_, |
430 | 398 | # `slides <https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/dusan_stosic-training-neural-networks-with-tensor-cores.pdf>`_
|
431 |
| -# * native PyTorch AMP is available starting from PyTorch 1.6: |
| 399 | +# * native PyTorch AMP is available: |
432 | 400 | # `documentation <https://pytorch.org/docs/stable/amp.html>`_,
|
433 | 401 | # `examples <https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples>`_,
|
434 | 402 | # `tutorial <https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html>`_
|
@@ -536,3 +504,31 @@ def gelu(x):
|
536 | 504 | # approximately constant number of tokens (and variable number of sequences in a
|
537 | 505 | # batch), other models solve imbalance by bucketing samples with similar
|
538 | 506 | # sequence length or even by sorting dataset by sequence length.
|
| 507 | + |
| 508 | +############################################################################### |
| 509 | +# Conclusion |
| 510 | +# ---------- |
| 511 | +# |
| 512 | +# This tutorial covered a comprehensive set of performance optimization techniques |
| 513 | +# for PyTorch models. The key takeaways include: |
| 514 | +# |
| 515 | +# * **General optimizations**: Enable async data loading, disable gradients for |
| 516 | +# inference, fuse operations with ``torch.compile``, and use efficient memory formats |
| 517 | +# * **CPU optimizations**: Leverage NUMA controls, optimize OpenMP settings, and |
| 518 | +# use efficient memory allocators |
| 519 | +# * **GPU optimizations**: Enable Tensor cores, use CUDA graphs, enable cuDNN |
| 520 | +# autotuner, and implement mixed precision training |
| 521 | +# * **Distributed optimizations**: Use DistributedDataParallel, optimize gradient |
| 522 | +# synchronization, and balance workloads across devices |
| 523 | +# |
| 524 | +# Many of these optimizations can be applied with minimal code changes and provide |
| 525 | +# significant performance improvements across a wide range of deep learning models. |
| 526 | +# |
| 527 | +# Further Reading |
| 528 | +# --------------- |
| 529 | +# |
| 530 | +# * `PyTorch Performance Tuning Documentation <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html>`_ |
| 531 | +# * `CUDA Best Practices <https://pytorch.org/docs/stable/notes/cuda.html>`_ |
| 532 | +# * `Distributed Training Documentation <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_ |
| 533 | +# * `Mixed Precision Training <https://pytorch.org/docs/stable/amp.html>`_ |
| 534 | +# * `torch.compile Tutorial <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_ |
0 commit comments