From 2c98bd424bcae01230d12729489b3c51640c675a Mon Sep 17 00:00:00 2001 From: Andy li Date: Mon, 10 Jul 2023 20:26:13 +0800 Subject: [PATCH] Enable msccl capability (#3) (#4) * initial checkin * add submodule nccl-tests and update the readme * Update README with MSCCL scheduler * update submodule to latest --------- Co-authored-by: Ziyue Yang Co-authored-by: root --- .gitmodules | 8 +++ README.md | 102 +++++++++++++++++++++++++++++++---- SUPPORT.md | 27 ++++------ executor/msccl-executor-nccl | 1 + tests/msccl-tests-nccl | 1 + 5 files changed, 110 insertions(+), 29 deletions(-) create mode 100644 .gitmodules create mode 160000 executor/msccl-executor-nccl create mode 160000 tests/msccl-tests-nccl diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 0000000..a942512 --- /dev/null +++ b/.gitmodules @@ -0,0 +1,8 @@ +[submodule "executor/msccl-executor-nccl"] + path = executor/msccl-executor-nccl + url = https://github.com/Azure/msccl-executor-nccl.git + branch = main +[submodule "tests/msccl-tests-nccl"] + path = tests/msccl-tests-nccl + url = https://github.com/Azure/msccl-tests-nccl.git + branch = main diff --git a/README.md b/README.md index 5cd7cec..59f3f09 100644 --- a/README.md +++ b/README.md @@ -1,20 +1,100 @@ -# Project +# MSCCL -> This repo has been populated by an initial template to help get you started. Please -> make sure to update the content to build a great experience for community-building. +Microsoft Collective Communication Library (MSCCL) is a platform to execute custom collective communication algorithms for multiple accelerators supported by Microsoft Azure. The research prototype of this project is [microsoft/msccl](https://github.com/microsoft/msccl). -As the maintainer of this project, please make a few updates: +## Introduction -- Improving this README.MD file to provide a great experience -- Updating SUPPORT.MD with content about this project's support experience -- Understanding the security reporting process in SECURITY.MD -- Remove this section from the README +MSCCL vision is to provide a unified, efficient, and scalable framework for executing collective communication algorithms across multiple accelerators. To achieve this, MSCCL has multiple components: + +- [MSCCL toolkit](https://github.com/microsoft/msccl-tools): Inter-connection among accelerators have different latencies and bandwidths. Therefore, a generic collective communication algorithm does not necessarily well for all topologies and buffer sizes. In order to provide the flexibility, we provide the MSCCL toolkit, which allows a user to write a hyper-optimized collective communication algorithm for a given topology and a buffer size. MSCCL toolkit contains a high-level DSL (MSCCLang) and a compiler which generate an IR for the MSCCL executor([msccl-executor-nccl](https://github.com/Azure/msccl-executor-nccl)) to run on the backend. [Example](#Example) provides some instances on how MSCCL toolkit with the runtime works. Please refer to [MSCCL toolkit](https://github.com/microsoft/msccl-tools) for more information. + +- [MSCCL scheduler](https://github.com/microsoft/msccl-scheduler): MSCCL scheduler provides an example design and implementation of how to select optimal MSCCL algorithms for MSCCL executors. + +- MSCCL executor([msccl-executor-nccl](https://github.com/Azure/msccl-executor-nccl)): msccl-executor-nccl is an inter-accelerator communication framework that is built on top of [NCCL](https://github.com/nvidia/nccl) and uses its building blocks to execute custom-written collective communication algorithms. + +- MSCCL test toolkit([msccl-tests-nccl](https://github.com/Azure/msccl-tests-nccl)): These tests check both the performance and the correctness of MSCCL operations. + +## Example + +In order to use MSCCL, you may follow these steps to use two different MSCCL algorithms for AllReduce on Azure NDv4 which has 8xA100 GPUs: + +Follow below steps to download the source code of msccl and related submodules + +```sh +$ git clone https://github.com/microsoft/msccl.git --recurse-submodules +``` + +Steps to install MSCCL executor: + +```sh +$ git clone https://github.com/microsoft/msccl.git --recurse-submodules +$ cd msccl/executor/msccl-executor-nccl +$ make -j src.build +$ cd ../ +$ cd ../ +``` + +Then, follow these steps to install msccl-tests-nccl for performance evaluation: + +```sh +$ cd tests/msccl-tests-nccl/ +$ make MPI=1 NCCL_HOME=$HOME/msccl/executor/msccl-executor-nccl/build/ -j +$ cd ../ +$ cd ../ +``` + +Next install [MSCCL toolkit](https://github.com/microsoft/msccl-tools) to compile a few custom algorithms: + +```sh +$ git clone https://github.com/microsoft/msccl-tools.git +$ cd msccl-tools/ +$ pip install . +$ cd ../ +$ python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml +$ cd ../ +``` + +The compiler's generated code is an XML file (`test.xml`) that is fed to MSCCL runtime. To evaluate its performance, copy the `test.xml` to the msccl/exector/msccl-executor-nccl/build/lib/msccl-algorithms/ and execute the following command line on an Azure NDv4 node or any 8xA100 system: + +```sh +$ mpirun -np 8 -x LD_LIBRARY_PATH=msccl/exector/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0 +``` + +If everything is installed correctly, you should see the following output in log: + +```sh +[0] NCCL INFO Connected 1 MSCCL algorithms +``` + +You may evaluate the performance of `test.xml` by comparing in-place (the new algorithm) vs out-of-place (default ring algorithm) and it should up-to 2-3x faster on 8xA100 NVLink-interconnected GPUs. [MSCCL toolkit](https://github.com/microsoft/msccl-tools) has a rich set of algorithms for different Azure SKUs and collective operations with significant speedups over vanilla NCCL. + +## Build + +To build the library: + +```sh +$ cd msccl/exector/msccl-executor-nccl +$ make -j src.build +``` + +If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with : + +```sh +$ make src.build CUDA_HOME= +``` + +MSCCL will be compiled and installed in `build/` unless `BUILDDIR` is set. + +By default, MSCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining `NVCC_GENCODE` (defined in `makefiles/common.mk`) to only include the architecture of the target platform : +```sh +$ make -j src.build NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" +``` ## Contributing This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us -the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. +the rights to use your contribution. For details, visit [CLA](https://cla.opensource.microsoft.com). When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions @@ -26,8 +106,8 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio ## Trademarks -This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft -trademarks or logos is subject to and must follow +This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft +trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies. diff --git a/SUPPORT.md b/SUPPORT.md index 291d4d4..dce5fdb 100644 --- a/SUPPORT.md +++ b/SUPPORT.md @@ -1,25 +1,16 @@ -# TODO: The maintainer of this repo has not yet edited this file - -**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project? - -- **No CSS support:** Fill out this template with information about how to file issues and get help. -- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps. -- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide. - -*Then remove this first heading from this SUPPORT.MD file before publishing your repo.* - # Support -## How to file issues and get help +## How to file issues and get help -This project uses GitHub Issues to track bugs and feature requests. Please search the existing -issues before filing new issues to avoid duplicates. For new issues, file your bug or -feature request as a new Issue. +This project uses [GitHub Issues] to track bugs and feature requests. Please search the existing +issues before filing new issues to avoid duplicates. For new issues, file your bug or +feature request as a new issue. -For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE -FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER -CHANNEL. WHERE WILL YOU HELP PEOPLE?**. +For help and questions about using this project, please create a new post in [GitHub Discussions]. -## Microsoft Support Policy +## Microsoft Support Policy Support for this **PROJECT or PRODUCT** is limited to the resources listed above. + +[GitHub Issues]: https://github.com/Azure/msccl/issues +[GitHub Discussions]: https://github.com/Azure/msccl/discussions \ No newline at end of file diff --git a/executor/msccl-executor-nccl b/executor/msccl-executor-nccl new file mode 160000 index 0000000..58c893b --- /dev/null +++ b/executor/msccl-executor-nccl @@ -0,0 +1 @@ +Subproject commit 58c893bc4c3a2b428b970279a399a33185882dd5 diff --git a/tests/msccl-tests-nccl b/tests/msccl-tests-nccl new file mode 160000 index 0000000..958998b --- /dev/null +++ b/tests/msccl-tests-nccl @@ -0,0 +1 @@ +Subproject commit 958998b7395bc2502cceb6f1b5b12d1083acc6b7