forked from microsoft/msccl
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enable msccl capability (microsoft#3) (microsoft#4)
* initial checkin * add submodule nccl-tests and update the readme * Update README with MSCCL scheduler * update submodule to latest --------- Co-authored-by: Ziyue Yang <[email protected]> Co-authored-by: root <root@liand-h100-validation-vmss000002.wxea2wklo2jenp1trbnjn0dkpb.jx.internal.cloudapp.net>
- Loading branch information
1 parent
56b1667
commit 2c98bd4
Showing
5 changed files
with
110 additions
and
29 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
[submodule "executor/msccl-executor-nccl"] | ||
path = executor/msccl-executor-nccl | ||
url = https://github.com/Azure/msccl-executor-nccl.git | ||
branch = main | ||
[submodule "tests/msccl-tests-nccl"] | ||
path = tests/msccl-tests-nccl | ||
url = https://github.com/Azure/msccl-tests-nccl.git | ||
branch = main |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,20 +1,100 @@ | ||
# Project | ||
# MSCCL | ||
|
||
> This repo has been populated by an initial template to help get you started. Please | ||
> make sure to update the content to build a great experience for community-building. | ||
Microsoft Collective Communication Library (MSCCL) is a platform to execute custom collective communication algorithms for multiple accelerators supported by Microsoft Azure. The research prototype of this project is [microsoft/msccl](https://github.com/microsoft/msccl). | ||
|
||
As the maintainer of this project, please make a few updates: | ||
## Introduction | ||
|
||
- Improving this README.MD file to provide a great experience | ||
- Updating SUPPORT.MD with content about this project's support experience | ||
- Understanding the security reporting process in SECURITY.MD | ||
- Remove this section from the README | ||
MSCCL vision is to provide a unified, efficient, and scalable framework for executing collective communication algorithms across multiple accelerators. To achieve this, MSCCL has multiple components: | ||
|
||
- [MSCCL toolkit](https://github.com/microsoft/msccl-tools): Inter-connection among accelerators have different latencies and bandwidths. Therefore, a generic collective communication algorithm does not necessarily well for all topologies and buffer sizes. In order to provide the flexibility, we provide the MSCCL toolkit, which allows a user to write a hyper-optimized collective communication algorithm for a given topology and a buffer size. MSCCL toolkit contains a high-level DSL (MSCCLang) and a compiler which generate an IR for the MSCCL executor([msccl-executor-nccl](https://github.com/Azure/msccl-executor-nccl)) to run on the backend. [Example](#Example) provides some instances on how MSCCL toolkit with the runtime works. Please refer to [MSCCL toolkit](https://github.com/microsoft/msccl-tools) for more information. | ||
|
||
- [MSCCL scheduler](https://github.com/microsoft/msccl-scheduler): MSCCL scheduler provides an example design and implementation of how to select optimal MSCCL algorithms for MSCCL executors. | ||
|
||
- MSCCL executor([msccl-executor-nccl](https://github.com/Azure/msccl-executor-nccl)): msccl-executor-nccl is an inter-accelerator communication framework that is built on top of [NCCL](https://github.com/nvidia/nccl) and uses its building blocks to execute custom-written collective communication algorithms. | ||
|
||
- MSCCL test toolkit([msccl-tests-nccl](https://github.com/Azure/msccl-tests-nccl)): These tests check both the performance and the correctness of MSCCL operations. | ||
|
||
## Example | ||
|
||
In order to use MSCCL, you may follow these steps to use two different MSCCL algorithms for AllReduce on Azure NDv4 which has 8xA100 GPUs: | ||
|
||
Follow below steps to download the source code of msccl and related submodules | ||
|
||
```sh | ||
$ git clone https://github.com/microsoft/msccl.git --recurse-submodules | ||
``` | ||
|
||
Steps to install MSCCL executor: | ||
|
||
```sh | ||
$ git clone https://github.com/microsoft/msccl.git --recurse-submodules | ||
$ cd msccl/executor/msccl-executor-nccl | ||
$ make -j src.build | ||
$ cd ../ | ||
$ cd ../ | ||
``` | ||
|
||
Then, follow these steps to install msccl-tests-nccl for performance evaluation: | ||
|
||
```sh | ||
$ cd tests/msccl-tests-nccl/ | ||
$ make MPI=1 NCCL_HOME=$HOME/msccl/executor/msccl-executor-nccl/build/ -j | ||
$ cd ../ | ||
$ cd ../ | ||
``` | ||
|
||
Next install [MSCCL toolkit](https://github.com/microsoft/msccl-tools) to compile a few custom algorithms: | ||
|
||
```sh | ||
$ git clone https://github.com/microsoft/msccl-tools.git | ||
$ cd msccl-tools/ | ||
$ pip install . | ||
$ cd ../ | ||
$ python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml | ||
$ cd ../ | ||
``` | ||
|
||
The compiler's generated code is an XML file (`test.xml`) that is fed to MSCCL runtime. To evaluate its performance, copy the `test.xml` to the msccl/exector/msccl-executor-nccl/build/lib/msccl-algorithms/ and execute the following command line on an Azure NDv4 node or any 8xA100 system: | ||
|
||
```sh | ||
$ mpirun -np 8 -x LD_LIBRARY_PATH=msccl/exector/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0 | ||
``` | ||
|
||
If everything is installed correctly, you should see the following output in log: | ||
|
||
```sh | ||
[0] NCCL INFO Connected 1 MSCCL algorithms | ||
``` | ||
|
||
You may evaluate the performance of `test.xml` by comparing in-place (the new algorithm) vs out-of-place (default ring algorithm) and it should up-to 2-3x faster on 8xA100 NVLink-interconnected GPUs. [MSCCL toolkit](https://github.com/microsoft/msccl-tools) has a rich set of algorithms for different Azure SKUs and collective operations with significant speedups over vanilla NCCL. | ||
|
||
## Build | ||
|
||
To build the library: | ||
|
||
```sh | ||
$ cd msccl/exector/msccl-executor-nccl | ||
$ make -j src.build | ||
``` | ||
|
||
If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with : | ||
|
||
```sh | ||
$ make src.build CUDA_HOME=<path to cuda install> | ||
``` | ||
|
||
MSCCL will be compiled and installed in `build/` unless `BUILDDIR` is set. | ||
|
||
By default, MSCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining `NVCC_GENCODE` (defined in `makefiles/common.mk`) to only include the architecture of the target platform : | ||
```sh | ||
$ make -j src.build NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" | ||
``` | ||
|
||
## Contributing | ||
|
||
This project welcomes contributions and suggestions. Most contributions require you to agree to a | ||
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us | ||
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. | ||
the rights to use your contribution. For details, visit [CLA](https://cla.opensource.microsoft.com). | ||
|
||
When you submit a pull request, a CLA bot will automatically determine whether you need to provide | ||
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions | ||
|
@@ -26,8 +106,8 @@ contact [[email protected]](mailto:[email protected]) with any additio | |
|
||
## Trademarks | ||
|
||
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft | ||
trademarks or logos is subject to and must follow | ||
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft | ||
trademarks or logos is subject to and must follow | ||
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). | ||
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. | ||
Any use of third-party trademarks or logos are subject to those third-party's policies. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,16 @@ | ||
# TODO: The maintainer of this repo has not yet edited this file | ||
|
||
**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project? | ||
|
||
- **No CSS support:** Fill out this template with information about how to file issues and get help. | ||
- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps. | ||
- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide. | ||
|
||
*Then remove this first heading from this SUPPORT.MD file before publishing your repo.* | ||
|
||
# Support | ||
|
||
## How to file issues and get help | ||
## How to file issues and get help | ||
|
||
This project uses GitHub Issues to track bugs and feature requests. Please search the existing | ||
issues before filing new issues to avoid duplicates. For new issues, file your bug or | ||
feature request as a new Issue. | ||
This project uses [GitHub Issues] to track bugs and feature requests. Please search the existing | ||
issues before filing new issues to avoid duplicates. For new issues, file your bug or | ||
feature request as a new issue. | ||
|
||
For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE | ||
FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER | ||
CHANNEL. WHERE WILL YOU HELP PEOPLE?**. | ||
For help and questions about using this project, please create a new post in [GitHub Discussions]. | ||
|
||
## Microsoft Support Policy | ||
## Microsoft Support Policy | ||
|
||
Support for this **PROJECT or PRODUCT** is limited to the resources listed above. | ||
|
||
[GitHub Issues]: https://github.com/Azure/msccl/issues | ||
[GitHub Discussions]: https://github.com/Azure/msccl/discussions |
Submodule msccl-executor-nccl
added at
58c893
Submodule msccl-tests-nccl
added at
958998