proposal.tex

\documentclass{article}

\usepackage{amsmath, amssymb}
\usepackage[left=1in, right=1in, top=1in, bottom=1in]{geometry}

\usepackage{hyperref}
\usepackage[utf8]{inputenc}

\title{Draft: Testing FSDP as a viable alternative to DeepSpeed}
\author{}


\begin{document}

\maketitle

Training large language models (LLM) at scale is at the forefront of 
distributed computing research. As both the size of the data and model grow 
larger different parallelization schemes for both become critical in increasing
training efficiency. Among several frameworks, DeepSpeed and PyTorch-FSDP are
leading candidates in facilitating distributed deployment of LLM training. It 
is timely to make an effort in comparing the performance and scaling efficiency
of FSDP and DeepSpeed on ALCF machines.

This effort is crucial to ALCF for better supporting users who will be doing 
distributed LLM trainings on ALCF system. This effort will also benefit the 
AuroraGPT project. 

\section*{Concrete Tasks}
\begin{itemize}
    \item Exploring the scope of adopting various parallelization schemes with
        FSDP. For example, tensor parallelism and sequence parallelism. This 
        would require development work.
    \item Exploring compute and communication overhead at scale.
    \item Profile both frameworks in detail to identify bottlenecks.
    \item Identify opportunities to leverage available system architectures.
\end{itemize}

\end{document}