-
Notifications
You must be signed in to change notification settings - Fork 7
on by default testing for PyTorch 1.12 release
Nvfuser is a system designed and developed by NVIDIA but it is specifically, and only, for PyTorch. Even though we maintain our separate project repo, we do sync with upstream PyTorch master repo at a roughly monthly cadence. We do realize that this workflow is not optimal and have added friction especially during release cycles. Our development team is discussing changing the workflow. At the same time, post 1.12 release, we can try to stick to a more frequent upstream push to alleviate the pain as we refactor nvFuser integration.
In 1.12 release, we were set to sanity check on this request (https://github.com/pytorch/pytorch/issues/77709) in order to justify on-by-default for nvFuser. The motivation is to ensure that there’s no performance regression nor functional regression when we switch the default gpu fuser for TorchScript from NNC to nvFuser.
We are convinced that nvFuser can provide performance gain for certain workflows. We have performance gains for both training and inference workload when utilizing nvFuser via tools like FuncTorch (AOT Autograd) on public models during our benchmarking and tuning. Some context https://github.com/rwightman/pytorch-image-models/issues/1244#issuecomment-1130075034
Meanwhile we do realize that it’s hard to get big speedups with nvFuser when running with vanilla TorchScript, where autodiff and control flows block nvFuser from fusing large portion of the graph. But turning nvFuser on-by-default in TorchScript is still beneficial because it exposes nvFuser to broader use cases and tests to ensure it is up to the production standard. It also allows early adopters of next-gen systems that use TorchScript as runtime.
We’ve been working with TorchScript team on nvFuser enablement for a few months. Here’s our planning doc: https://docs.google.com/document/d/1ZSxFC8r3uS-uUB-Vr8KCeY_iI4qo0Qf4HmWdF-8RaNk/edit#
We’ve been adding nvFuser tests in PyTorch master for a while. Our focus has been on OpInfo tests and Torchbench, which have exposed lots of support that were lacking and have been patched since. David(davidberard98) created this panel on PyTorch github to track issues and fixes during the time: https://github.com/pytorch/pytorch/projects/30
We also briefly flipped the switch to test nvFuser as default, which exposed a TorchVision breakage and then have the on-by-default PR reverted in the following day. But during that time, we have not seen other breakages related to nvFuser reported.
In our nightly container, we benchmark nvFuser devel branch on TIMM via TorchScript as well as TorchDyanmo. We have gone through benchmarking and perf tuning for Bert and TIMM models for nvFuser using TorchScript, TorchDynamo and FuncTorch. Further expanding test coverage for TorchScript is expensive because getting the right JIT friendly model is hard. We are hoping to improve that with on-by-default effort in upstream PyTorch.
+ python bert_model.py --jit_script
>>> Eager-Time(us): 250382.397 JIT_Script-Time(us): 236777.930 JIT_Script-Speedup: 1.06
+ python bert_model.py --jit_script --max_fp16_perf
>>> Eager-Time(us): 137933.057 JIT_Script-Time(us): 130717.700 JIT_Script-Speedup: 1.06
+ python bert_model.py --jit_script --inference
>>> Eager-Time(us): 83756.287 JIT_Script-Time(us): 73747.351 JIT_Script-Speedup: 1.14
+ python bert_model.py --jit_script --max_fp16_perf --inference
>>> Eager-Time(us): 47913.626 JIT_Script-Time(us): 41916.519 JIT_Script-Speedup: 1.14
+ python bert_model.py --jit_script
>>> Eager-Time(us): 250725.635 JIT_Script-Time(us): 250578.735 JIT_Script-Speedup: 1.00
+ python bert_model.py --jit_script --max_fp16_perf
>>> Eager-Time(us): 138299.902 JIT_Script-Time(us): 138668.896 JIT_Script-Speedup: 1.00
+ python bert_model.py --jit_script --inference
>>> Eager-Time(us): 83717.633 JIT_Script-Time(us): 81589.709 JIT_Script-Speedup: 1.03
+ python bert_model.py --jit_script --max_fp16_perf --inference
>>> Eager-Time(us): 47980.850 JIT_Script-Time(us): 46898.840 JIT_Script-Speedup: 1.02
Results - DGXA100-40g
TorchScript with nvFuser
"model": "resnet18",
"train_samples_per_sec": 2581.99,
"train_step_time": 49.037,
"train_batch_size": 128,
"train_img_size": 224,
"param_count": 11.69
TorchScript with NNC
"model": "resnet18",
"train_samples_per_sec": 2581.79,
"train_step_time": 49.053,
"train_batch_size": 128,
"train_img_size": 224,
"param_count": 11.69
Eager
"model": "resnet18",
"train_samples_per_sec": 2580.5,
"train_step_time": 49.091,
"train_batch_size": 128,
"train_img_size": 224,
"param_count": 11.69
TorchScript with nvFuser
"model": "resnet18",
"train_samples_per_sec": 3781.19,
"train_step_time": 33.337,
"train_batch_size": 128,
"train_img_size": 224,
"param_count": 11.69
TorchScript with NNC
"model": "resnet18",
"train_samples_per_sec": 3790.07,
"train_step_time": 33.283,
"train_batch_size": 128,
"train_img_size": 224,
"param_count": 11.69
Eager
"model": "resnet18",
"train_samples_per_sec": 3796.87,
"train_step_time": 33.234,
"train_batch_size": 128,
"train_img_size": 224,
"param_count": 11.69
TorchScript with nvFuser
"model": "resnet18",
"train_samples_per_sec": 6545.12,
"train_step_time": 19.023,
"train_batch_size": 128,
"train_img_size": 224,
"param_count": 11.69
TorchScript with NNC
"model": "resnet18",
"train_samples_per_sec": 6553.0,
"train_step_time": 19.013,
"train_batch_size": 128,
"train_img_size": 224,
"param_count": 11.69
Eager
"model": "resnet18",
"train_samples_per_sec": 6560.52,
"train_step_time": 18.974,
"train_batch_size": 128,
"train_img_size": 224,
"param_count": 11.69
TorchScript doesn't provide nvFuser with any fusion opportunities. Y-axis is Eager Mode execution time / nvFuser execution time. A constant size 1 means there's no significant change in performance.