Tutorial about How to change your slow tensorflow training faster
THIS CODE ONLY WORKS ON NVIDIA GPUS
Assuming dataset length is infinite, lnline preprocessing can cause CPU bottleneck that can decrease training throughput.
This code samples show unoptimized/optimized tensorflow workflow.
- x86-64 (AMD64) CPU
- RAM >= 8GiB
- NVIDIA Computer Capability 7.0+ GPUs
- GPU memory > 12GiB for default batch size
- CPU : Intel(R) Xeon(R) Gold 5218R
- GPU : 2x A100 80GB PCI-E
- RAM : 255GiB
- Nvidia DALI - GPU Accelerated Dataloader
- Mixed Precision - Better MMA(Matrix Multiply-Accumulate) throughput than TF32
- XLA - JIT-Compile and fuse operators to effective job scheduling in GPUs
- (Optional) Multi GPU training - Use more then one GPU for training
- Clone this repo with submodule
git clone --recursive https://github.com/ReturnToFirst/FastTFWorkflow.git
- Compare performance between unoptimized/optimized workflow
after_optimization_multi.ipynb shows training process with multi gpu.
Depanding on devices in computer, performance can be decreased.
This optimized code will not show best performance.
Multi-GPUs Training doesn't works on test envrionment.
Wrong description or code there.