You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<a href="https://github.com/ParCoreLab/Uniconn" class="text-xl font-semibold font-sans visited:text-teal-700">Unified Communication Library</a>
593
594
</div>
594
-
<p class="text-lg">We're undertaking the design of an API for a unified communication library to streamline device-to-device communication within the CPU-free model by aiming to optimize communication efficiency across diverse devices. We are also investigating how the available communication libraries for a system perform under different
595
-
message sizes and communication patterns. Thus, we ex-
596
-
tensively benchmark current communication methods for
597
-
single-process, multi-threaded, and multi-process codes. More details about the project will be available soon. The related paper is under preparation.</p>
595
+
<p class="text-lg">Modern HPC and AI systems increasingly rely on multi-GPU clusters, where communication libraries such as MPI, NCCL/RCCL, and NVSHMEM enable data movement across GPUs. While these libraries are widely used in frameworks and solver packages, their distinct APIs, synchronization models, and integration mechanisms introduce programming complexity and limit portability. Performance also varies across workloads and system architectures, making it difficult to achieve consistent efficiency. These issues present a significant obstacle to writing portable, high-performance code for large-scale GPU systems. We present Uniconn, a unified, portable high-level C++ communication library that supports both point-to-point and collective operations across GPU clusters. Uniconn enables seamless switching between backends and APIs (host or device) with minimal or no changes to application code. We describe its design and core constructs, and evaluate its performance using network benchmarks, a Jacobi solver, and a Conjugate Gradient solver. Across three supercomputers, we compare Uniconn's overhead against CUDA/ROCm-aware MPI, NCCL/RCCL, and NVSHMEM on up to 64 GPUs. In most cases, Uniconn incurs negligible overhead, typically under 1 % for the Jacobi solver and under 2% for the Conjugate Gradient solver.</p>
598
596
599
597
<p>
600
598
<a href="https://github.com/ParCoreLab/Uniconn" class="text-xl font-semibold font-sans visited:text-teal-700">More details and git repository of the project.</a>
@@ -606,13 +604,32 @@ single-process, multi-threaded, and multi-process codes. More details about the
<a href="https://github.com/ParCoreLab/aCG" class="text-xl font-semibold font-sans visited:text-teal-700">CPU- and GPU-initiated Communication Strategies for CG Methods</a>
612
+
</div>
613
+
<p class="text-lg">This work revisits Conjugate Gradient (CG) parallelization for large-scale multi-GPU systems, addressing challenges from low computational intensity and communication overhead. We develop scalable CG and pipelined CG solvers for NVIDIA and AMD GPUs, employing GPU-aware MPI, NCCL/RCCL, and NVSHMEM for both CPU- and GPU-initiated communication. A monolithic GPU-offloaded variant further enables fully device-driven execution, removing CPU involvement. Optimizations across all designs reduce data transfers and synchronization costs. Evaluations on SuiteSparse matrices and a real finite element application show 8–14% gains over state-of-the-art on single GPUs and 5–15% improvements in strong scaling tests on over 1,000 GPUs. While CPU-driven variants currently benefit from stronger library support, results highlight the promising scalability of GPU-initiated execution for future large-scale systems.</p>
614
+
615
+
<p>
616
+
<a href="https://github.com/ParCoreLab/aCG" class="text-xl font-semibold font-sans visited:text-teal-700">More details and git repository of the project.</a>
<a href="https://github.com/ParCoreLab/" class="text-xl font-semibold font-sans visited:text-teal-700">CPU Free Model Compiler</a>
614
629
</div>
615
-
<p class="text-lg">We're actively crafting a compiler to empower developers to write high-level Python code that compiles into efficient CPU-free device code. This compiler integrates GPU-initiated communication libraries, NVSHMEM for NVIDIA and ROC_SHMEM for AMD, enabling GPU communication directly within Python code. With automatic generation of GPU-initiated communication calls and persistent kernels, we aim to streamline development workflows. Our prototype will be available soon.</p>
630
+
<p class="text-lg">We're actively crafting a compiler to empower developers to write high-level Python code that compiles into efficient CPU-free device code. This compiler integrates GPU-initiated communication libraries, NVSHMEM for NVIDIA and ROC_SHMEM for AMD, enabling GPU communication directly within Python code. With automatic generation of GPU-initiated communication calls and persistent kernels, we aim to streamline development workflows.</p> <p>
631
+
<a href="https://github.com/ParCoreLab/CPU-Free-model" class="text-xl font-semibold font-sans visited:text-teal-700">More details and git repository of the project.</a>
<divclass="card text-lg"> Javid Baydamirli, Tal Ben Nun, Didem Unat (2024) <ahref="https://ieeexplore.ieee.org/abstract/document/10820747">Autonomous Execution for Multi-GPU Systems:
692
-
Compiler Support</a>. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. <aclass="italic"downloadhref="./assets/preprint-pdfs/P3HPC_____Autonomous_Execution_for_Multi_GPU_Systems__Compiler_Support-2 (1).pdf">preprint pdf</a>
709
+
Compiler Support</a> <ahref="https://docs.google.com/presentation/d/1nBsANrcLh0Tnc2qqqDL_-6khqo-Y-_mX5kfJbmRwawE/edit?slide=id.p#slide=id.p">(presentation)</a>. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. <aclass="italic"downloadhref="./assets/preprint-pdfs/P3HPC_____Autonomous_Execution_for_Multi_GPU_Systems__Compiler_Support-2 (1).pdf">preprint pdf</a>
693
710
</div>
694
711
<divclass="card text-lg"> Javid Baydamirli, Tal Ben Nun, Didem Unat (2024) <ahref="https://sc24.supercomputing.org/proceedings/workshops/workshop_pages/ws_p3hpc108.html">Autonomous Execution for Multi-GPU Systems:
695
712
Compiler Support</a> <ahref="https://sc24.conference-program.com/presentation/?id=ws_p3hpc108&sess=sess751">(presentation)</a>. In the 2024 International Workshop on Performance, Portability, and Productivity in HPC. <aclass="italic"downloadhref="./assets/preprint-pdfs/sc24-workshop-autonomous-execution-for-multi-gpu-systems-compiler-support.pdf">preprint pdf</a>
0 commit comments