python [model file] [engine: --jit_script|--ltc|--aot_autograd]
If you include no options with the model file, only eager mode will be timed
There are two profiling scripts:
profile_all.sh
: profiles everything that runsprofile_api_start.sh
: Only profiles after the Cuda API is called to start profiling after warmup
./scripts/profile_api_start.sh python [model file] [engine: --jit_script|--ltc|--aot_autograd] [--profile_with_nvtx]
$ python simple_model.py --jit_script
>>> Eager-Time(us): 411.493 JIT_Script-Time(us): 368.355 JIT_Script-Speedup: 1.12
Defaults to FP32 model and input data.
Model parameters remain in FP32 and input data is in FP16
python [model file] [engine: --jit_script|--ltc|--aot_autograd] --amp
Model parameters are in FP16 and input data is in FP16
python [model file] [engine: --jit_script|--ltc|--aot_autograd] --max_fp16_perf
or
python [model file] [engine: --jit_script|--ltc|--aot_autograd] --grad_scaler --input_dtype=torch.float16 --model_dtype=torch.float16
This set of options does not work with Optimizers that rely on GradScaler to do the unscaling (Native Pytorch Optimizers) as it asserts on FP16 weights. For test purposes, just don't use the --grad_scaler
flag.
You have your choice of 4 different front engines. The Eager Engine is always run as a comparison point for speedup. If you don't specify an engine, just the Eager Engine will be run. Besides the Eager Engine, the rest give opportunities for fusion to the NVFuser backend for GPUs.
Engines:
- Eager: Default: no switch needed.
- JIT Script:
--jit_script
- Lazy Tensor Core:
--ltc
- AOT Autograd:
--aot_autograd
- Simple linear layer + relu and SGD Optimizer:
simple_model.py
- Simple conv layer + bn + relu and SGD Optimizer:
simple_conv_model.py
- Multihead Attention Block with no optimizer:
xformer_multihead_attn.py
- Feed Forward Block with no optimizer:
xformer_feed_fwd.py
- One Encoder Layer with no optimizer:
xformer_1_layer.py
- Full Bert Model (bert-large) with APEX Lamb Optimizer:
bert_model.py
- Full Bert Model (bert-large) with Native AdamW Optimizer:
bert_model_adam_opt.py
- Bert Model with 1 Layer (bert-large sized) with no optimizer:
bert_model_1_layer_no_opt.py
- Full Bert Model (bert-large) with APEX Lamb Optimizer:
dynamic_bert_model.py
- Full Bert Model (bert-large) with Native AdamW Optimizer:
dynamic_bert_model_adam_opt.py
- Bert Model with 1 Layer (bert-large sized) with no optimizer:
dynamic_bert_model_1_layer_no_opt.py