Skip to content

Commit 9e73898

Browse files
drisspgsvekarssekyondaMeta
authored
Task T228334710 update tuning guide (#3433)
* Task T228334710 update tuning guide stack-info: PR: #3433, branch: drisspg/stack/1 * Update recipes_source/recipes/tuning_guide.py --------- Co-authored-by: Svetlana Karslioglu <[email protected]> Co-authored-by: sekyondaMeta <[email protected]>
1 parent 30d8869 commit 9e73898

File tree

1 file changed

+59
-63
lines changed

1 file changed

+59
-63
lines changed

recipes_source/recipes/tuning_guide.py

Lines changed: 59 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,38 @@
88
techniques often can be implemented by changing only a few lines of code and can
99
be applied to a wide range of deep learning models across all domains.
1010
11+
.. grid:: 2
12+
13+
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
14+
:class-card: card-prerequisites
15+
16+
* General optimization techniques for PyTorch models
17+
* CPU-specific performance optimizations
18+
* GPU acceleration strategies
19+
* Distributed training optimizations
20+
21+
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
22+
:class-card: card-prerequisites
23+
24+
* PyTorch 2.0 or later
25+
* Python 3.8 or later
26+
* CUDA-capable GPU (recommended for GPU optimizations)
27+
* Linux, macOS, or Windows operating system
28+
29+
Overview
30+
--------
31+
32+
Performance optimization is crucial for efficient deep learning model training and inference.
33+
This tutorial covers a comprehensive set of techniques to accelerate PyTorch workloads across
34+
different hardware configurations and use cases.
35+
1136
General optimizations
1237
---------------------
1338
"""
1439

40+
import torch
41+
import torchvision
42+
1543
###############################################################################
1644
# Enable asynchronous data loading and augmentation
1745
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -90,7 +118,7 @@
90118
# setting it to zero, for more details refer to the
91119
# `documentation <https://pytorch.org/docs/master/optim.html#torch.optim.Optimizer.zero_grad>`_.
92120
#
93-
# Alternatively, starting from PyTorch 1.7, call ``model`` or
121+
# Alternatively, call ``model`` or
94122
# ``optimizer.zero_grad(set_to_none=True)``.
95123

96124
###############################################################################
@@ -129,7 +157,7 @@ def gelu(x):
129157
###############################################################################
130158
# Enable channels_last memory format for computer vision models
131159
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
132-
# PyTorch 1.5 introduced support for ``channels_last`` memory format for
160+
# PyTorch supports ``channels_last`` memory format for
133161
# convolutional networks. This format is meant to be used in conjunction with
134162
# `AMP <https://pytorch.org/docs/stable/amp.html>`_ to further accelerate
135163
# convolutional neural networks with
@@ -250,65 +278,6 @@ def gelu(x):
250278
#
251279
# export LD_PRELOAD=<jemalloc.so/tcmalloc.so>:$LD_PRELOAD
252280

253-
###############################################################################
254-
# Use oneDNN Graph with TorchScript for inference
255-
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
256-
# oneDNN Graph can significantly boost inference performance. It fuses some compute-intensive operations such as convolution, matmul with their neighbor operations.
257-
# In PyTorch 2.0, it is supported as a beta feature for ``Float32`` & ``BFloat16`` data-types.
258-
# oneDNN Graph receives the model’s graph and identifies candidates for operator-fusion with respect to the shape of the example input.
259-
# A model should be JIT-traced using an example input.
260-
# Speed-up would then be observed after a couple of warm-up iterations for inputs with the same shape as the example input.
261-
# The example code-snippets below are for resnet50, but they can very well be extended to use oneDNN Graph with custom models as well.
262-
263-
# Only this extra line of code is required to use oneDNN Graph
264-
torch.jit.enable_onednn_fusion(True)
265-
266-
###############################################################################
267-
# Using the oneDNN Graph API requires just one extra line of code for inference with Float32.
268-
# If you are using oneDNN Graph, please avoid calling ``torch.jit.optimize_for_inference``.
269-
270-
# sample input should be of the same shape as expected inputs
271-
sample_input = [torch.rand(32, 3, 224, 224)]
272-
# Using resnet50 from torchvision in this example for illustrative purposes,
273-
# but the line below can indeed be modified to use custom models as well.
274-
model = getattr(torchvision.models, "resnet50")().eval()
275-
# Tracing the model with example input
276-
traced_model = torch.jit.trace(model, sample_input)
277-
# Invoking torch.jit.freeze
278-
traced_model = torch.jit.freeze(traced_model)
279-
280-
###############################################################################
281-
# Once a model is JIT-traced with a sample input, it can then be used for inference after a couple of warm-up runs.
282-
283-
with torch.no_grad():
284-
# a couple of warm-up runs
285-
traced_model(*sample_input)
286-
traced_model(*sample_input)
287-
# speedup would be observed after warm-up runs
288-
traced_model(*sample_input)
289-
290-
###############################################################################
291-
# While the JIT fuser for oneDNN Graph also supports inference with ``BFloat16`` datatype,
292-
# performance benefit with oneDNN Graph is only exhibited by machines with AVX512_BF16
293-
# instruction set architecture (ISA).
294-
# The following code snippets serves as an example of using ``BFloat16`` datatype for inference with oneDNN Graph:
295-
296-
# AMP for JIT mode is enabled by default, and is divergent with its eager mode counterpart
297-
torch._C._jit_set_autocast_mode(False)
298-
299-
with torch.no_grad(), torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16):
300-
# Conv-BatchNorm folding for CNN-based Vision Models should be done with ``torch.fx.experimental.optimization.fuse`` when AMP is used
301-
import torch.fx.experimental.optimization as optimization
302-
# Please note that optimization.fuse need not be called when AMP is not used
303-
model = optimization.fuse(model)
304-
model = torch.jit.trace(model, (example_input))
305-
model = torch.jit.freeze(model)
306-
# a couple of warm-up runs
307-
model(example_input)
308-
model(example_input)
309-
# speedup would be observed in subsequent runs.
310-
model(example_input)
311-
312281

313282
###############################################################################
314283
# Train a model on CPU with PyTorch ``DistributedDataParallel``(DDP) functionality
@@ -426,9 +395,8 @@ def gelu(x):
426395
# * enable AMP
427396
#
428397
# * Introduction to Mixed Precision Training and AMP:
429-
# `video <https://www.youtube.com/watch?v=jF4-_ZK_tyc&feature=youtu.be>`_,
430398
# `slides <https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/dusan_stosic-training-neural-networks-with-tensor-cores.pdf>`_
431-
# * native PyTorch AMP is available starting from PyTorch 1.6:
399+
# * native PyTorch AMP is available:
432400
# `documentation <https://pytorch.org/docs/stable/amp.html>`_,
433401
# `examples <https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples>`_,
434402
# `tutorial <https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html>`_
@@ -536,3 +504,31 @@ def gelu(x):
536504
# approximately constant number of tokens (and variable number of sequences in a
537505
# batch), other models solve imbalance by bucketing samples with similar
538506
# sequence length or even by sorting dataset by sequence length.
507+
508+
###############################################################################
509+
# Conclusion
510+
# ----------
511+
#
512+
# This tutorial covered a comprehensive set of performance optimization techniques
513+
# for PyTorch models. The key takeaways include:
514+
#
515+
# * **General optimizations**: Enable async data loading, disable gradients for
516+
# inference, fuse operations with ``torch.compile``, and use efficient memory formats
517+
# * **CPU optimizations**: Leverage NUMA controls, optimize OpenMP settings, and
518+
# use efficient memory allocators
519+
# * **GPU optimizations**: Enable Tensor cores, use CUDA graphs, enable cuDNN
520+
# autotuner, and implement mixed precision training
521+
# * **Distributed optimizations**: Use DistributedDataParallel, optimize gradient
522+
# synchronization, and balance workloads across devices
523+
#
524+
# Many of these optimizations can be applied with minimal code changes and provide
525+
# significant performance improvements across a wide range of deep learning models.
526+
#
527+
# Further Reading
528+
# ---------------
529+
#
530+
# * `PyTorch Performance Tuning Documentation <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html>`_
531+
# * `CUDA Best Practices <https://pytorch.org/docs/stable/notes/cuda.html>`_
532+
# * `Distributed Training Documentation <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_
533+
# * `Mixed Precision Training <https://pytorch.org/docs/stable/amp.html>`_
534+
# * `torch.compile Tutorial <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_

0 commit comments

Comments
 (0)