Galvatron Documents | Galvatron 中文文档
Galvatron is an automatic distributed training system designed for Transformer models, including Large Language Models (LLMs). It leverages advanced automatic parallelism techniques to deliver exceptional training efficiency. This repository houses the official implementation of Galvatron-2, our latest version enriched with several new features.
Incorporate multiple popular parallelism dimensions of distributed training, including DP (Data Parallelism), SDP (Sharded Data Parallelism, support ZeRO-1, ZeRO-2 and ZeRO-3), PP (Pipeline Parallelism, support both GPipe & Pipedream-flush / 1F1B-flush), TP (Tensor Parallelism), SP (Sequence Parallelism, support Megatron-SP and Deepspeed-Ulysses). Also incorporate CKPT (Activation Checkpointing) as a special parallelism dimension.
Galvatron's approach to hybrid parallelism represents a significant advancement in distributed training optimization. Rather than applying a one-size-fits-all strategy, the system enables layer-wise parallelization, allowing each transformer layer to utilize an independent combination of parallel strategies. This granular approach ensures optimal resource utilization by adapting to the specific computational and memory requirements of each layer.
The system dynamically combines multiple parallelism types, carefully considering the trade-offs between computation, memory usage, and communication overhead. This hybrid approach is particularly powerful when dealing with complex model architectures, where different layers may benefit from different parallelization strategies.
The heart of Galvatron's efficiency lies in its sophisticated optimization engine. Through careful cost modeling, the system accurately estimates computation requirements, predicts memory usage patterns, and models communication overhead for different parallelization strategies. This comprehensive modeling enables intelligent decision-making in strategy selection.
The optimization process employs advanced search algorithms with dynamic programming that consider multiple objectives simultaneously, including memory efficiency and communication costs. The system automatically adapts to hardware constraints while ensuring optimal performance.
Galvatron's versatility extends across the entire spectrum of Transformer architectures. In the realm of language models, it excels at handling everything from traditional BERT-style encoders and GPT decoders to complex T5-style encoder-decoder models. For Large Language Models (LLMs), the system provides specialized optimizations that enable efficient training of models with trillions of parameters, carefully managing memory and computational resources.
The system's capabilities extend beyond language models to vision transformers. Galvatron maintains its efficiency while adapting to the unique requirements of each architecture. In the future, Galvatron will also support multi-modal architectures.
Despite its sophisticated underlying technology, Galvatron prioritizes user accessibility. Users can begin training with minimal code changes, supported by comprehensive documentation and practical examples. The system also offers seamless integration with dataloader of popular framework , alongside robust checkpoint management capabilities, making it a practical choice for both research and production environments.
Galvatron's architecture consists of three tightly integrated core modules that work together to deliver efficient distributed training:
The Profiler serves as the foundation of the system, conducting comprehensive analysis of both hardware capabilities and model characteristics. On the hardware side, it measures inter-device communication bandwidth and computational throughput of each device. For model profiling, it analyzes computation patterns, memory requirements, and communication needs of different model components. This detailed profiling information forms the basis for intelligent strategy decisions.
The Search Engine represents the brain of the system, leveraging the profiling data to discover optimal parallelization strategies. It employs sophisticated algorithms to explore the vast space of possible parallel configurations and automatically determine the most efficient combination of parallelism strategies for each layer of the model.
The Runtime Framework implements the execution layer, translating the high-level parallelization strategies into efficient distributed operations. The framework provides a robust and flexible execution environment that adapts to different hardware configurations and model architectures.
These three modules work seamlessly together to simplify the distributed training process. Users only need to provide hardware environment and Transformer model configuration.
The system automatically handles all aspects of distributed training optimization, from initial profiling through strategy selection to efficient execution. This architecture ensures both ease of use and high performance, making sophisticated distributed training accessible to a broader range of users while maintaining the flexibility needed for advanced applications.
Through this modular design, Galvatron achieves a balance between automation and customization, enabling both simple deployment for standard cases and detailed control for specialized requirements.
Requirements:
- PyTorch >= 2.1.0
To install Galvatron:
pip install hetu-galvatron
Alternatively, you can install Galvatron from source with pip install .
To use FlashAttention-2 features in Galvatron-2, you can either:
- Install the FlashAttention-2 manually and then
pip install hetu-galvatron
. - Alternatively, you can install Galvatron-2 with FlashAttention-2 as follows:
- Make sure that PyTorch,
packaging
(pip install packaging
),ninja
is installed. - Install Galvatron-2 with FlashAttention-2:
GALVATRON_FLASH_ATTN_INSTALL=TRUE pip install hetu-galvatron
The first step to use Galvatron is to profile the hardware environment and the model computation time. Galvatron will automatically save the profiled results into config files.
(1) Firstly, to profile the hardward environment, cd galvatron/profile_hardware
, write the host address into hostfile
, set NUM_NODES, NUM_GPUS_PER_NODE, MPI_PATH
in scripts/profile_hardware.sh
and run:
sh scripts/profile_hardware.sh
Galvatron will call nccl-tests to profile the communication bandwidth.
(2) Secondly, to profile the model computation time, cd galvatron/models/model_name
and run:
sh scripts/profile_computation.sh
After profiling the environments, Galvatron is able to automatically optimize the parallelism strategy for the given Transformer model. Given the memory budget, Galvatron provides the fine-grained hybrid parallel strategy with maximum throughput. The optimized parallelism strategy will be saved in galvatron/models/model_name/configs
for the training. Users can train the model with the provided optimal strategy to obtain the optimal throughput.
To conduct parallelim optimization, cd galvatron/models/model_name
, customize NUM_NODES, NUM_GPUS_PER_NODE, MEMORY
in scripts/search_dist.sh
, run:
sh scripts/search_dist.sh
See more usage details of the customized parallelism optimization in Galvatron Model Usage.
Galvatron provides a simple way to train Transformer models in fined-grained hybrid parallelism fashion. Users can either train Transformer models with the searched optimal parallel strategy by specifying argument galvatron_config_path
to obtain the optimal throughput, or use any parallel strategies as they like. Galvatron support two hybrid parallel config modes, including JSON config mode and GLOBAL config mode. Users can specify parallel strategies by modifying only a few arguments.
To train the model with Galvatron, cd galvatron/models/model_name
, set NUM_NODES, NUM_GPUS_PER_NODE, MASTER_ADDR, MASTER_PORT, NODE_RANK
, and run:
sh scripts/train_dist.sh
See detailed guidance and more customized training options in Galvatron Model Usage.
![]() |
Huawei |
![]() |
ZTE |
![]() |
Alibaba |
![]() |
ByteDance |
Check our release plan for upcoming features.
Fill an issue or contact us via Xinyi Liu, [email protected], Yujie Wang, [email protected], or Shenhan Zhu, [email protected].
Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, Bin Cui; VLDB 2023, CCF-A. [paper] [arxiv]
Improving automatic parallel training via balanced memory workload optimization. Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, Bin Cui; TKDE 2024, CCF-A. [paper] [arxiv]
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, Bin Cui; ASPLOS 2025, CCF-A. [arxiv]
If you use Galvatron in your research, please cite the following paper:
@article{DBLP:journals/pvldb/MiaoWJSNZ022,
author = {Xupeng Miao and
Yujie Wang and
Youhe Jiang and
Chunan Shi and
Xiaonan Nie and
Hailin Zhang and
Bin Cui},
title = {Galvatron: Efficient Transformer Training over Multiple GPUs Using
Automatic Parallelism},
journal = {Proc. {VLDB} Endow.},
volume = {16},
number = {3},
pages = {470--479},
year = {2022},
url = {https://www.vldb.org/pvldb/vol16/p470-miao.pdf},
}