WarpDrive

a LLM training/inference engine under cluster, CS(Client-Server) environment

tasks:

🏗️ topology:
- ✅ cluster mode
- 🛑 CS mode
🏗️ distributed communication (optimize c10d)
- ✅ group, subgroup
- ✅ fix _store_based_barrier
- ✅ P2P comm
- ✅ collective comm
- 🏗️ timeline for cuda stream sync
🏗️ models
- ✅ Pythia7B
- 🏗️ sparse transformer: sparse attention($\mathcal{O}(N\log{N})$), then much longer sequence (len>>2048)
- 🛑 parallel models
🏗️ pipeline parallel
- ✅ staged model
- ✅ sequence pipeline schedule
- 🛑 1f1b, interleave schedule
🏗️ activation recomputation
- ✅ full mode
- 🛑 selective mode
🛑 data parallel
🛑 tensor parallel
🛑 sequence parallel
✅ training data (open source)
- ✅ OIG
- ✅ streaming style dataset, w/o padding
🏗️ llm training
- ✅ pretrain
- 🛑 RLHF
- 🛑 RLAI
🛑 llm evaluation
🏗️ model compression
- ✅ empty model init, device map, sequential loading
- ✅ mixed precision training
  - ✅ fp16 (GPU), loss scale
  - ✅ bf16 (CPU, GPU), no loss scale
  - 🏗️ quanz operation
- 🏗️ quantization
- 🛑 pruning
🏗️ optimizer
- 🏗️ 8-bit Adam
🛑 Adapter
- 🛑 Lora, QLoRa
🏗️ system debug & monitor
- 🏗️ GPU memory profile
- 🏗️ comm data sync debug
- 🏗️ loss convergence
🛑 compute graph for distributed computing

GPUs topology

cluster环境下的均配结构: world_size = pp_size * dp_size * tp_size

e.g.: 
world_size = 12
pp_size = 3
dp_size = 2
tp_size = 2
pp groups: [[0, 4, 8], [1, 5, 9], [2, 6, 10], [3, 7, 11]]
dp groups: [[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]]
tp groups: [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11]]

CS环境下的不均配结构: 按照各stage的GPUs组群

gpus = [1, 3, 3]
world_size = sum(gpus)
pp_size = len(gpus)
dp_size*tp_size = max(gpus)
ppg: [(0,1,4), (0,2,5), (0,3,6)]
dpg: [(1,2,3), (4,5,6)]

run

training

bash scripts/pretrain_pythia7B.sh

Concept

process group

main group: 区别pytorch.distributed的default pg, 可以有多个main group
subgroup: main group可以有多个sub group, pp/dp/tp mode对应不同的subgroup

operations

model forward, backward
activations recomputing
communication: p2p, collective comm

通信模式解释

schedule

pipeline schedule:
- sequence
- 1f1b w/o interleave
- 1f1b with interleave
learning rate schedule:

mixed precision training:
- lower precision: fp32 --> fp16(gpu), bf16(cpu, gpu)
- scale loss: for fp16
- quantized optimizer: int8 optimzier states

mixed precision training:

Mixed precision primarily benefits Tensor Core-enabled architectures (Volta, Turing, Ampere). NOTE:
torch.autocast doesn't support mixed precision well, some ops can't be autocasted.

Quantization

post-training quantization: weights, buffers
quantization aware trainging: weights, buffers, actiovations

timeline for CUDA streams

default stream for computation, non-default stream for communication across ranks
different streams for each micro batch, and synchronize micro-batches at all-reduce step. Since no relationship btw micro-batches, async is possible.
parallel recomputing and communication. recompute fw while recv grads from next rank

compute graph

define op function: recompute, communication
use autograd and CUDA stream to arange ops, micro batches in order

system debug & monitor

torch.profile + tensorboard

sparse transformer

sparse attention
longer sequence

training data

Open source training data

OIG: https://huggingface.co/datasets/laion/OIG

models

GPTNeoX

Pythia7B

todo

compute graph for staged models: 在pytorch的compute graph的基础上，将跨机器的model之间的通信封装成Function
sparse transformer

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
configs		configs
datafiles/OIG		datafiles/OIG
docs/imgs		docs/imgs
inference		inference
logs		logs
pretrained		pretrained
scripts		scripts
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WarpDrive

tasks:

GPUs topology

run

Concept

training data

models

todo

About

Releases

Packages

Languages

BotNetZero/WarpDrive

Folders and files

Latest commit

History

Repository files navigation

WarpDrive

tasks:

GPUs topology

run

Concept

training data

models

todo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages