Skip to content

BotNetZero/WarpDrive

Repository files navigation

WarpDrive

a LLM training/inference engine under cluster, CS(Client-Server) environment

tasks:

  1. 🏗️ topology:

    • ✅ cluster mode
    • 🛑 CS mode
  2. 🏗️ distributed communication (optimize c10d)

    • ✅ group, subgroup
    • ✅ fix _store_based_barrier
    • ✅ P2P comm
    • ✅ collective comm
    • 🏗️ timeline for cuda stream sync
  3. 🏗️ models

    • ✅ Pythia7B
    • 🏗️ sparse transformer: sparse attention($\mathcal{O}(N\log{N})$), then much longer sequence (len>>2048)
    • 🛑 parallel models
  4. 🏗️ pipeline parallel

    • ✅ staged model
    • ✅ sequence pipeline schedule
    • 🛑 1f1b, interleave schedule
  5. 🏗️ activation recomputation

    • ✅ full mode
    • 🛑 selective mode
  6. 🛑 data parallel

  7. 🛑 tensor parallel

  8. 🛑 sequence parallel

  9. ✅ training data (open source)

    • ✅ OIG
    • ✅ streaming style dataset, w/o padding
  10. 🏗️ llm training

    • ✅ pretrain
    • 🛑 RLHF
    • 🛑 RLAI
  11. 🛑 llm evaluation

  12. 🏗️ model compression

    • ✅ empty model init, device map, sequential loading
    • ✅ mixed precision training
      • ✅ fp16 (GPU), loss scale
      • ✅ bf16 (CPU, GPU), no loss scale
      • 🏗️ quanz operation
    • 🏗️ quantization
    • 🛑 pruning
  13. 🏗️ optimizer

    • 🏗️ 8-bit Adam
  14. 🛑 Adapter

    • 🛑 Lora, QLoRa
  15. 🏗️ system debug & monitor

    • 🏗️ GPU memory profile
    • 🏗️ comm data sync debug
    • 🏗️ loss convergence
  16. 🛑 compute graph for distributed computing

GPUs topology

cluster环境下的均配结构: world_size = pp_size * dp_size * tp_size avatar

e.g.: 
world_size = 12
pp_size = 3
dp_size = 2
tp_size = 2
pp groups: [[0, 4, 8], [1, 5, 9], [2, 6, 10], [3, 7, 11]]
dp groups: [[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]]
tp groups: [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11]]

CS环境下的不均配结构: 按照各stage的GPUs组群 avatar

gpus = [1, 3, 3]
world_size = sum(gpus)
pp_size = len(gpus)
dp_size*tp_size = max(gpus)
ppg: [(0,1,4), (0,2,5), (0,3,6)]
dpg: [(1,2,3), (4,5,6)]

run

  1. training

bash scripts/pretrain_pythia7B.sh

Concept

  1. process group
  • main group: 区别pytorch.distributed的default pg, 可以有多个main group
  • subgroup: main group可以有多个sub group, pp/dp/tp mode对应不同的subgroup
  1. operations
  • model forward, backward
  • activations recomputing
  • communication: p2p, collective comm

通信模式解释 avatar

  1. schedule
  • pipeline schedule:

    • sequence
    • 1f1b w/o interleave
    • 1f1b with interleave avatar
  • learning rate schedule:

  1. mixed precision training:
    • lower precision: fp32 --> fp16(gpu), bf16(cpu, gpu)
    • scale loss: for fp16
    • quantized optimizer: int8 optimzier states

mixed precision training: avatar

Mixed precision primarily benefits Tensor Core-enabled architectures (Volta, Turing, Ampere). NOTE:
torch.autocast doesn't support mixed precision well, some ops can't be autocasted.

  1. Quantization
  • post-training quantization: weights, buffers
  • quantization aware trainging: weights, buffers, actiovations
  1. timeline for CUDA streams
  • default stream for computation, non-default stream for communication across ranks
  • different streams for each micro batch, and synchronize micro-batches at all-reduce step. Since no relationship btw micro-batches, async is possible.
  • parallel recomputing and communication. recompute fw while recv grads from next rank avatar
  1. compute graph
  • define op function: recompute, communication
  • use autograd and CUDA stream to arange ops, micro batches in order avatar
  1. system debug & monitor
  • torch.profile + tensorboard
  1. sparse transformer
  • sparse attention
  • longer sequence

training data

Open source training data

models

  1. GPTNeoX
  • Pythia7B

todo

  • compute graph for staged models: 在pytorch的compute graph的基础上,将跨机器的model之间的通信封装成Function
  • sparse transformer

About

a LLM training engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages