on by default testing for PyTorch 1.12 release

Background

Nvfuser is a system designed and developed by NVIDIA but it is specifically, and only, for PyTorch. Even though we maintain our separate project repo, we do sync with upstream PyTorch master repo at a roughly monthly cadence. We do realize that this workflow is not optimal and have added friction especially during release cycles. Our development team is discussing changing the workflow. At the same time, post 1.12 release, we can try to stick to a more frequent upstream push to alleviate the pain as we refactor nvFuser integration.

In 1.12 release, we were set to sanity check on this request (https://github.com/pytorch/pytorch/issues/77709) in order to justify on-by-default for nvFuser. The motivation is to ensure that there’s no performance regression nor functional regression when we switch the default gpu fuser for TorchScript from NNC to nvFuser.

Objectives for on-by-default in 1.12 release:

We are convinced that nvFuser can provide performance gain for certain workflows. We have performance gains for both training and inference workload when utilizing nvFuser via tools like FuncTorch (AOT Autograd) on public models during our benchmarking and tuning. Some context https://github.com/rwightman/pytorch-image-models/issues/1244#issuecomment-1130075034

Meanwhile we do realize that it’s hard to get big speedups with nvFuser when running with vanilla TorchScript, where autodiff and control flows block nvFuser from fusing large portion of the graph. But turning nvFuser on-by-default in TorchScript is still beneficial because it exposes nvFuser to broader use cases and tests to ensure it is up to the production standard. It also allows early adopters of next-gen systems that use TorchScript as runtime.

PyTorch nvFuser enablement

We’ve been working with TorchScript team on nvFuser enablement for a few months. Here’s our planning doc: https://docs.google.com/document/d/1ZSxFC8r3uS-uUB-Vr8KCeY_iI4qo0Qf4HmWdF-8RaNk/edit#

We’ve been adding nvFuser tests in PyTorch master for a while. Our focus has been on OpInfo tests and Torchbench, which have exposed lots of support that were lacking and have been patched since. David(davidberard98) created this panel on PyTorch github to track issues and fixes during the time: https://github.com/pytorch/pytorch/projects/30

We also briefly flipped the switch to test nvFuser as default, which exposed a TorchVision breakage and then have the on-by-default PR reverted in the following day. But during that time, we have not seen other breakages related to nvFuser reported.

In our nightly container, we benchmark nvFuser devel branch on TIMM via TorchScript as well as TorchDyanmo. We have gone through benchmarking and perf tuning for Bert and TIMM models for nvFuser using TorchScript, TorchDynamo and FuncTorch. Further expanding test coverage for TorchScript is expensive because getting the right JIT friendly model is hard. We are hoping to improve that with on-by-default effort in upstream PyTorch.

Requested tests for 1.12 on-by-default flip

BERT https://github.com/kevinstephano/simple_dl_models

TorchScript with nvFuser Results - DGXA100-40g

+ python bert_model.py --jit_script
>>> Eager-Time(us): 250382.397 JIT_Script-Time(us): 236777.930 JIT_Script-Speedup: 1.06
+ python bert_model.py --jit_script --max_fp16_perf
>>> Eager-Time(us): 137933.057 JIT_Script-Time(us): 130717.700 JIT_Script-Speedup: 1.06
+ python bert_model.py --jit_script --inference
>>> Eager-Time(us): 83756.287 JIT_Script-Time(us): 73747.351 JIT_Script-Speedup: 1.14
+ python bert_model.py --jit_script --max_fp16_perf --inference
>>> Eager-Time(us): 47913.626 JIT_Script-Time(us): 41916.519 JIT_Script-Speedup: 1.14

TorchScript with NNC Results - DGXA100-40g

+ python bert_model.py --jit_script
>>> Eager-Time(us): 250725.635 JIT_Script-Time(us): 250578.735 JIT_Script-Speedup: 1.00
+ python bert_model.py --jit_script --max_fp16_perf
>>> Eager-Time(us): 138299.902 JIT_Script-Time(us): 138668.896 JIT_Script-Speedup: 1.00
+ python bert_model.py --jit_script --inference
>>> Eager-Time(us): 83717.633 JIT_Script-Time(us): 81589.709 JIT_Script-Speedup: 1.03
+ python bert_model.py --jit_script --max_fp16_perf --inference
>>> Eager-Time(us): 47980.850 JIT_Script-Time(us): 46898.840 JIT_Script-Speedup: 1.02

TIMM - Resnet18

Results - DGXA100-40g

FP32 - Training Batch Size 128

TorchScript with nvFuser

    "model": "resnet18",
    "train_samples_per_sec": 2581.99,
    "train_step_time": 49.037,
    "train_batch_size": 128,
    "train_img_size": 224,
    "param_count": 11.69

TorchScript with NNC

    "model": "resnet18",
    "train_samples_per_sec": 2581.79,
    "train_step_time": 49.053,
    "train_batch_size": 128,
    "train_img_size": 224,
    "param_count": 11.69

Eager

   "model": "resnet18",
    "train_samples_per_sec": 2580.5,
    "train_step_time": 49.091,
    "train_batch_size": 128,
    "train_img_size": 224,
    "param_count": 11.69

FP16 AMP - Training Batch Size 128

TorchScript with nvFuser

    "model": "resnet18",
    "train_samples_per_sec": 3781.19,
    "train_step_time": 33.337,
    "train_batch_size": 128,
    "train_img_size": 224,
    "param_count": 11.69

TorchScript with NNC

    "model": "resnet18",
    "train_samples_per_sec": 3790.07,
    "train_step_time": 33.283,
    "train_batch_size": 128,
    "train_img_size": 224,
    "param_count": 11.69

Eager

    "model": "resnet18",
    "train_samples_per_sec": 3796.87,
    "train_step_time": 33.234,
    "train_batch_size": 128,
    "train_img_size": 224,
    "param_count": 11.69

FP16 AMP Channels-Last - Training Batch Size 128

TorchScript with nvFuser

    "model": "resnet18",
    "train_samples_per_sec": 6545.12,
    "train_step_time": 19.023,
    "train_batch_size": 128,
    "train_img_size": 224,
    "param_count": 11.69

TorchScript with NNC

    "model": "resnet18",
    "train_samples_per_sec": 6553.0,
    "train_step_time": 19.013,
    "train_batch_size": 128,
    "train_img_size": 224,
    "param_count": 11.69

Eager

    "model": "resnet18",
    "train_samples_per_sec": 6560.52,
    "train_step_time": 18.974,
    "train_batch_size": 128,
    "train_img_size": 224,
    "param_count": 11.69

TIMM Inference BS8 TorchScript with nvFuser vs Eager - RTX3080

TorchScript doesn't provide nvFuser with any fusion opportunities. Y-axis is Eager Mode execution time / nvFuser execution time. A constant size 1 means there's no significant change in performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

on by default testing for PyTorch 1.12 release

Background

Objectives for on-by-default in 1.12 release:

PyTorch nvFuser enablement

Requested tests for 1.12 on-by-default flip

BERT https://github.com/kevinstephano/simple_dl_models

TorchScript with nvFuser Results - DGXA100-40g

TorchScript with NNC Results - DGXA100-40g

TIMM - Resnet18

FP32 - Training Batch Size 128

FP16 AMP - Training Batch Size 128

FP16 AMP Channels-Last - Training Batch Size 128

TIMM Inference BS8 TorchScript with nvFuser vs Eager - RTX3080

Clone this wiki locally