a LLM training/inference engine under cluster, CS(Client-Server) environment
-
🏗️ topology:
- ✅ cluster mode
- 🛑 CS mode
-
🏗️ distributed communication (optimize c10d)
- ✅ group, subgroup
- ✅ fix _store_based_barrier
- ✅ P2P comm
- ✅ collective comm
- 🏗️ timeline for cuda stream sync
-
🏗️ models
- ✅ Pythia7B
- 🏗️ sparse transformer: sparse attention(
$\mathcal{O}(N\log{N})$ ), then much longer sequence (len>>2048) - 🛑 parallel models
-
🏗️ pipeline parallel
- ✅ staged model
- ✅ sequence pipeline schedule
- 🛑 1f1b, interleave schedule
-
🏗️ activation recomputation
- ✅ full mode
- 🛑 selective mode
-
🛑 data parallel
-
🛑 tensor parallel
-
🛑 sequence parallel
-
✅ training data (open source)
- ✅ OIG
- ✅ streaming style dataset, w/o padding
-
🏗️ llm training
- ✅ pretrain
- 🛑 RLHF
- 🛑 RLAI
-
🛑 llm evaluation
-
🏗️ model compression
- ✅ empty model init, device map, sequential loading
- ✅ mixed precision training
- ✅ fp16 (GPU), loss scale
- ✅ bf16 (CPU, GPU), no loss scale
- 🏗️ quanz operation
- 🏗️ quantization
- 🛑 pruning
-
🏗️ optimizer
- 🏗️ 8-bit Adam
-
🛑 Adapter
- 🛑 Lora, QLoRa
-
🏗️ system debug & monitor
- 🏗️ GPU memory profile
- 🏗️ comm data sync debug
- 🏗️ loss convergence
-
🛑 compute graph for distributed computing
cluster环境下的均配结构: world_size = pp_size * dp_size * tp_size
e.g.:
world_size = 12
pp_size = 3
dp_size = 2
tp_size = 2
pp groups: [[0, 4, 8], [1, 5, 9], [2, 6, 10], [3, 7, 11]]
dp groups: [[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]]
tp groups: [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11]]
gpus = [1, 3, 3]
world_size = sum(gpus)
pp_size = len(gpus)
dp_size*tp_size = max(gpus)
ppg: [(0,1,4), (0,2,5), (0,3,6)]
dpg: [(1,2,3), (4,5,6)]
- training
bash scripts/pretrain_pythia7B.sh
- process group
- main group: 区别pytorch.distributed的default pg, 可以有多个main group
- subgroup: main group可以有多个sub group, pp/dp/tp mode对应不同的subgroup
- operations
- model forward, backward
- activations recomputing
- communication: p2p, collective comm
- schedule
- mixed precision training:
- lower precision: fp32 --> fp16(gpu), bf16(cpu, gpu)
- scale loss: for fp16
- quantized optimizer: int8 optimzier states
Mixed precision primarily benefits Tensor Core-enabled architectures (Volta, Turing, Ampere). NOTE:
torch.autocast doesn't support mixed precision well, some ops can't be autocasted.
- Quantization
- post-training quantization: weights, buffers
- quantization aware trainging: weights, buffers, actiovations
- timeline for CUDA streams
- default stream for computation, non-default stream for communication across ranks
- different streams for each micro batch, and synchronize micro-batches at all-reduce step. Since no relationship btw micro-batches, async is possible.
- parallel recomputing and communication. recompute fw while recv grads from next rank
- compute graph
- define op function: recompute, communication
- use autograd and CUDA stream to arange ops, micro batches in order
- system debug & monitor
- torch.profile + tensorboard
- sparse transformer
- sparse attention
- longer sequence
Open source training data
- GPTNeoX
- Pythia7B
- compute graph for staged models: 在pytorch的compute graph的基础上,将跨机器的model之间的通信封装成Function
- sparse transformer