Releases: axonn-ai/axonn
Releases · axonn-ai/axonn
v0.2.0
What's Changed
- Update README by @bhatele in #16
- fix evaluation bug for inter-layer by @siddharth9820 in #18
- Support for intra-layer parallelism by @siddharth9820 in #21
- add checkpointing and post backward hook support by @siddharth9820 in #24
- docs: fix readthedocs.org build issues by @bhatele in #26
- fix g_intra print by @zsat in #27
- Tests: convert memopt to int before bool by @adityaranjan in #28
- Docs: installation and running mnist test by @adityaranjan in #29
- add 2D tensor parallelism for FC layers by @siddharth9820 in #30
- readme: add slack link by @bhatele in #31
- CI/CD tests for intra-layer parallelism by @siddharth9820 in #33
- add AxoNN logo by @bhatele in #34
- changes to the intra-layer API for the GPT benchmark by @siddharth9820 in #36
- add dependencies between workflows by @bhatele in #41
- [WIP] ILP Conv Layer support by @prajwal1210 in #38
- Intra-layer - Overlap communication in backward pass by @siddharth9820 in #44
- [WIP] A tensor parallel API for beginners by @siddharth9820 in #40
- first iteration of 3D tensor parallelism by @siddharth9820 in #49
- Initialize layers on the GPU by @siddharth9820 in #51
- add option to change batch dimension in drop by @siddharth9820 in #52
- change outer variables by @siddharth9820 in #53
- A context manager to optimize communication by @siddharth9820 in #54
- Rebase axonn-cpu to master by @Avuxon in #56
- More communication optimizations by @siddharth9820 in #57
- Parallel transformers by @jwendlan in #59
- Added Depth Tensor Parallelism to Conv Layer by @prajwal1210 in #60
- Change overlap for depth tp and do not initialize MPI unless absolutely needed by @siddharth9820 in #62
- removed mpi4py dependency by @S-Mahua in #63
- adding parallelize context for opt by @jwendlan in #65
- Removing the drop and gathers in depth tensor parallelism for the easy API by @siddharth9820 in #66
- change parallelize context to use AutoConfig by @siddharth9820 in #67
- Bugfix: Initialize grad_input, grad_weight to None by @adityaranjan in #68
- docs: fix build issues and add sub-sections by @bhatele in #69
- added automatic_parallelism by @S-Mahua in #70
- This PR shards the Dataloader across depth and data parallel ranks both by @siddharth9820 in #74
- Make monkeypatching more efficient and change easy API to a single argument by @siddharth9820 in #72
- Add API for tensor parallel model checkpointing by @siddharth9820 in #77
- Changes to fix issues in IFT. by @siddharth9820 in #78
- AxonnStrategy for Lightning Fabric backend by @anishbh in #76
- initial doc for EasyAPI, Accelerate, and FT example by @jwendlan in #73
- User guide Changes by @siddharth9820 in #80
- Update advanced.rst by @siddharth9820 in #81
- More lightning features by @siddharth9820 in #82
- Supporting init_module, load/save checkpoint by @siddharth9820 in #83
- make no-grad-sync yield None by @siddharth9820 in #88
- create an engine for all things pipelining and deprecate custom mixed precision by @siddharth9820 in #91
- Tensor parallel embedding by @siddharth9820 in #93
- Improving AxoNN's memory consumption by @siddharth9820 in #95
- Correct url of ci tests badge by @siddharth9820 in #99
- reorg code and first implementation of the new easy API by @siddharth9820 in #96
- Minor changes for Release 0.2.0 by @siddharth9820 in #100
New Contributors
- @zsat made their first contribution in #27
- @adityaranjan made their first contribution in #28
- @prajwal1210 made their first contribution in #38
- @jwendlan made their first contribution in #59
- @S-Mahua made their first contribution in #63
- @anishbh made their first contribution in #76
Full Changelog: v0.1.0...v0.2.0
AxoNN 0.1.0
AxoNN is a parallel framework for training deep neural networks.
Features:
- Offers a hybrid of inter-layer parallelism with pipelining and data parallelism.
- Supports both 16-bit mixed precision and 32-bit full precision training.
- A highly efficient and scalable implementation of inter-layer parallelism with pipelining using asynchronous MPI-based communication and message-driven scheduling that achieves significant overlap of computation and communication.
- Memory optimizations that can reduce the model state memory consumption by 5x for mixed precision training using the Adam optimizer and indirectly also increase hardware efficiency.