- Reuse fused parameter tensors in fuse_step (#410)
- Call step closure in qadam optimizer step (#432)
- Fix need_reset condition (#454)
- Do negotiation in async native op (#447)
- Fix find_unused_parameters (#452)
- Fix qadam non-deterministic (#459)
- Add
LIBRARY_PATH
env ininstall_master.sh
(#465) - Fix typo in
install_master.sh
(#471)
- CUDA 11.5 can't get nccl package (#415)
- Fix process group compatibility with torch 1.6.0 (#413)
- Fix ci random fail (#445)
- Fix async algorithm (#479)
- Initial support for C interface (#325)
- Support NODE_RANK environment variable (#426)
- Choose bagua service port dynamically (#431)
- Use bagua_module_name to identify different modules (#438)
- Add algorithm registry (#433)
- Add compatibility for NCCL version under 2.10 (#449)
- Add broadcast object api (#437)
- Support qadam in fused optimizer (#477)
- Support PyTorch DDP compatible distributed training API (#312)
- Support torch-api-compatiable all_reduce (#377)
- Associate PyTorch Process Group with Bagua Process Group using cache (#402)
- Support find_unused_parameters on BaguaDDP (#409)
- Add
BAGUA_AUTOTUNE_SERVER_WAIT_TIME
env (#474)
- Fuse optimizer oom and make it stateless (#207)
- To_bagua_tensor compatibility with torch 1.6.0 (#355)
- Use separate process group for async communication thread to avoid potential hangs (#298)
- Do not fail if checkpoints path exist (#305)
- Fix is_moe_param (#306)
- Change
to_bagua_tensor
API to support PyTorch 1.10 (#338) - Fix fused optimizer with multiple param groups (#356)
- Support switching between different algorithms (#299)
- Separate algorithm declaration and implementation (#246)
- Support process group in
with_bagua
, support hierarchical communication in bytegrad algorithm (#300) - Support mutable bucket tensors (#271)
- Support all_to_all_single (#361)
- Use single bucket for decentralized algorithm to improve performance (#275)
- Support process group (#228)
- Add barrier api (#290)
- Support moe (#208)
- Support checkpointing for moe (#242)
- Only run publish once on git tag
- Fix compressed buffer can not be scattered to odd number of ranks
- Fix ci pypi versioning
- Remove init.py and python version, use cargo version
- Move import bagua_install_library to install library function
- Merge bagua_install_library and setup.py, remove nccl<=2.6 support
- Fix alltoall_v parameter (#17)
- Reduce and allgather python interface
- Fix decompress incorrect pointer and typo in error msg
- Fix python gil deadlock during getting data ptr
- Fix benchmark script requirements
- Fix alltoall_v parameter types (#27)
- Always mark bagua padding tensor as ready
- Make compress/decompress of BaguaTensor
method
string consistent (#33) - Fix scatter and reduce_scatter implementation (#40)
- Substract overflow error for decentralized op (#39)
- Fix QADAM params (#17)
- Fix assert precision (#18)
- Replace mutex with atomic bool for async op and add Aluminum submodule update (#67)
- Fix duplicated dependency downloading during installation (#77)
- Fix async algorithm aborting and hanging (#78, #81)
- Fix qadam algorithm call (#20)
- Fix missing symbols in the zip library (#24)
- Fix random autotune server hang (#206)
- Bagua-net library path mismatch, make
--enable_bagua_net
argument style consistent with other args (#218)
- Fix random autotune-service hang
- Handle conflicts caused by sklearn upgrade (#225)
- Only publish pypi for master commits
- Add async model average algorithm (#110)
- Add cached dataset wrapper (#148)
- Support sync batchnorm (#151)
- Add
--enable-bagua-net
option in launcher (#183) - Add pytorch examples for MNIST, ImageNet, SQuAD training (#1)
- Add requirements.txt, only download dataset on local rank 0 (#2)
- Add python packaging related files
- Add
__version__
variable - Install nccl deps in bagua core and add generated
__version__
variable - Add version.py placeholder to prevent file not found error
- Initial support for python op (#2)
- Add 5 min timeout for buckets' comm op (#5)
- Replace NCCL with Aluminum (#7)
- Add synethetic benchmark script (#5)
- Add elastic training example (#7)
- Support alltoall_v (vector alltoall) (#14)
- Add reduce and allgather python interface
- Support reduce and allgather op with Reduction op enum
- Support creating BaguaTensor by passing torch tensor directly (#19)
- Compatible mode for getting pytorch tensor info with Python interpreter
- Better debug log including tensor info when executing ops
- Add native low precision decentralized operator (#26)
- Add (scatter, gather, scatter_reduce) and all inplace version communication primitives (#37)
- Make full precision decentralized op stateless (#36)
- Add communication_primitives example (#12)
- Use nccl 2.10 avg op for all algorithms using averaging (#46, #45)
- Add opentelemetry to report tensor ready order (#42)
- Add deterministic flag (#15)
- Add native async model average algorithm (#41)
- Add examples for async model average algorithm (#14)
- Support packet splitting and multi-stream parallel transmission (#5)
- Support ncclnet v3 and remove the dependency on nccl in the installation environment (#17)
- Add sync interval param to async examples (#19)
- Suppport tokio backend (#21)
- Support bagua-net (#89)
- Broadcast scalars for optimizers (#202)
- Make compress/decompress of BaguaTensor
method
string consistent (#33) - Fix scatter and reduce_scatter implementation (#40)
- Substract overflow error for decentralized op (#39)
- Autotune api conflict (#131)
- Autotune pytest run forever (#132)
- Fix bagua.distributed.run --is_output_autotune_log parsing (#145)
- Fix QADAM params (#17)
- Fix assert precision (#18)
- Fix torch version check (#150)
- Add native low precision decentralized operator (#26)
- Add low precision decentralized algorithm (#103)
- Add (scatter, gather, scatter_reduce) and all inplace version communication primitives (#37)
- Add all communication primitives such as send recv to communication module (#128)
- Make full precision decentralized op stateless (#126)
- Make full precision decentralized op stateless (#36)
- Add communication_primitives example (#12)
- Support duplicated parameters acorss different modules (#147)
- Support nccl 2.10 ReduceOp.AVG (#149)
- Support nccl 2.10 ncclAvg (#45)
- Use nccl 2.10 avg op for all algorithms using averaging (#46)
- Add opentelemetry to report tensor ready order (#42)
- Add support for reporting tensor completion order (#146)
- Add deterministic flag (#15)
- Autotune service defaults with a fixed random seed (#117)
- Improve autotune speed metrics measurement for better accuracy (#86)
- Install.sh will not install rust if already exist on the system
- Install.sh upgrades existing bagua
- Sort q_adam variables for better performance (#102)
- Better debug log including tensor info when executing ops
- Support multiple models on autotune service (#107)
- Support multiple models in buckets registration (#113)
- Support different ssh port on different nodes (#93)
- Fix QAdam gradient is not BaguaTensor during first stage
- Fix alltoall_v parameter types (#27)
- Fix BaguaBacket.clear_ops() return value
- Always mark bagua padding tensor as ready
- Fix append python op callable reference
- BaguaBucket.tensors should only contain original passed in tensors
- Add append op methods to python
BaguaBucket
class (#87) - Wrap python op in communication stream context by default
- Broadcast model parameters on every algorithm reset
- Add QAdam algorithm (#92)
- The environment variable LOCAL_SIZE has been renamed in LOCAL_WORLD_SIZE (#51)
- Fix alltoall_v parameter (#17)
- Reduce and allgather python interface
- Fix decompress incorrect pointer and typo in error msg
- Fix python gil deadlock during getting data ptr
- Auto installation for centos (#66)
- Fix algoirthm pre forward hook not returned
- Fix benchmark script requirements
- Add synethetic benchmark script (#5)
- Auto installation support centos (#50)
- Add elastic training example (#7)
- Support alltoall_v (vector alltoall) (#14)
- Add reduce and allgather python interface
- Support reduce and allgather op with Reduction op enum
- Support reduction op and reduce
- Support creating BaguaTensor by passing torch tensor directly (#19)
- Compatible mode for getting pytorch tensor info with Python interpreter
- Add algorithm import in bagua.torch_api
- Add all algorithms import in bagua.torch_api.algorithms
- Do not setup python dependencies when performing codeql check
- Remove logging in load balancing dataloader to avoid deadlock (#35)
- Add back user interfacing imports in init.py (#38)
- Fix bucket size switch not effective (#48)
- Add broadcast_buffer in bagua_init (#29)
- Elastic training (#31)
- Add 5 min timeout for buckets' comm op (#5)
- Replace NCCL with Aluminum (#7)
- Add dependency installation script for ubuntu (#41)
- Fix ci pypi versioning
- Remove init.py and python version, use cargo version
- Only run publish once on git tag
- Fix baguaelastic launcher
- Fix baguaelastic launch script
- Fix setup.py for low version setuptools (#14)
- Move import bagua_install_library to install library function
- Merge bagua_install_library and setup.py, remove nccl<=2.6 support
- Add pytorch examples for MNIST, ImageNet, SQuAD training (#1)
- Add requirements.txt, only download dataset on local rank 0 (#2)
- Initial commit of bagua core impl
- Add python packaging related files
- Only publish pypi for master commits
- Add version variable
- Install nccl deps in bagua core and add generated version variable
- Initial public release of bagua python code
- Update interface and doc for loadbalance dataloader and add doc for fused optimizer (#17)
- Add version.py placeholder to prevent file not found error
- Initial support for python op (#2)
- Support new python op supported backend (#26)