-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
End-to-End LLM Model Development with Torchtitan and Torchtune #341
Open
KeitaW
wants to merge
856
commits into
main
Choose a base branch
from
torchtitan-torchtune
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
856 commits
Select commit
Hold shift + click to select a range
0e7a427
Update README.md
KeitaW d6e56b2
Update README
KeitaW 5c9ecae
update scripts
KeitaW b30d471
update log file name
KeitaW 5d160c6
Remove 14k log lines
736a029
Merge pull request #216 from aws-samples/reduce-build-log-efa-exporter
KeitaW ae6b020
Merge pull request #215 from aws-samples/pytorch-cpu-ddp-conda-enroot
KeitaW 61ddfa5
Merge pull request #214 from aws-samples/smph-fix-dcgm-exporter-gpu-util
mhuguesaws 098c222
Change nccl version to 2.20.3
mhuguesaws 0e77ca5
Merge pull request #217 from aws-samples/nccl_tests_version_changes
verdimrc 1a88359
smp v2 llama2 training example using fp8
arunkumarl87 318f9d9
Update 3.container-train.sbatch
KeitaW 3dc358c
Merge pull request #221 from aws-samples/KeitaW-patch-1
verdimrc bea1b68
Added second subnet for other AWS services which require multi-AZ
shimomut da7a51d
Removed FSXSecurityGroup as it is unused
shimomut 1de3e5a
Renamed resources to Primary/Backup Subnet
shimomut 018f4e9
Revert "Removed FSXSecurityGroup as it is unused"
shimomut a474f65
Merge branch 'hyperpod_backup_subnet_20240326'
shimomut 447d45c
Rename 0.crate-conda-env.sh to 0.create-conda-env.sh
sean-smith c6a146b
Merge pull request #225 from aws-samples/sean-smith-patch-2
KeitaW 67d6af7
Deleted unused security group FSXSecurityGroup
shimomut 8d59eef
Merge pull request #222 from shimomut/main
shimomut ac8f5bd
Added comments to conda setup scripts
arunkumarl87 44701fd
Merge pull request #218 from aruncs2005/main
aruncs2005 73b2ccb
Update 1.conda-train.sbatch
KeitaW 4aa19c5
Update 3.container-train.sbatch
KeitaW 437783a
Merge pull request #229 from aws-samples/KeitaW-patch-1
KeitaW c608899
updated pytorch version to 2.2
johnbensnyder 3adaa5c
Validate Json in preflight check
sean-smith 7d25c4a
Merge pull request #233 from aws-samples/validate-json
KeitaW 7e76b5d
Adding for ActiveDirectory/LDAPS integration for HyperPod (#224)
shimomut 6fc2e9d
DCGM exporter Updates - added responses to comments. create systemd s…
nghtm 5debc4f
updated comments with references
johnbensnyder 3cf0ea2
Merge pull request #230 from johnbensnyder/fsdp_version_update
verdimrc 13a8b32
smhp: shorter wget log (less 3k lines)
a069026
smhp: increase apt lock timeout
0bb48b3
Merge pull request #236 from aws-samples/smhp-shorter-log
KeitaW a477f22
Merge pull request #227 from nghtm/exporter-updates
mhuguesaws d46de06
Update setup_conda_env.sh
aruncs2005 5b748ab
Merge pull request #237 from aruncs2005/main
perifaws d4c68bb
Merge pull request #235 from aws-samples/smhp-apt-lock-timeout
perifaws d08ba66
Deprecate chrony timesync
d53342d
Driver for apply hotfix
605d9e6
smhp: add hotfix to hold lustre client
4967a8a
smhp: hotfix to mock gpu .deb package
b156771
Revert "DCGM exporter Updates - added responses to comments. create s…
mhuguesaws 1e04b44
Enabled Prometheus agent mode.
giuseppeporcelli 43b0f84
Merge pull request #238 from giuseppeporcelli/main
verdimrc 621ac21
Merge pull request #226 from aws-samples/upstream-ppc-v2403.02
verdimrc 69734b2
updated pytorch version to 2.2
johnbensnyder 8cc9fb3
updated comments with references
johnbensnyder 8d5daac
smhp: shorter wget log (less 3k lines)
bdf9d10
DCGM exporter Updates - added responses to comments. create systemd s…
nghtm 96818b5
Update setup_conda_env.sh
aruncs2005 521b1b9
smhp: increase apt lock timeout
046270b
Revert "DCGM exporter Updates - added responses to comments. create s…
mhuguesaws 7b6ce70
Revert "DCGM exporter Updates - added responses to comments. create s…
mhuguesaws 29eb8be
Enabled Prometheus agent mode.
giuseppeporcelli afa41f3
Enabled Prometheus agent mode.
giuseppeporcelli d256f8f
Deprecate chrony timesync
306c8b4
Deprecate chrony timesync
15c8108
Driver for apply hotfix
5590534
Driver for apply hotfix
2621a2c
smhp: add hotfix to hold lustre client
c537bbc
smhp: add hotfix to hold lustre client
d0fe50d
smhp: hotfix to mock gpu .deb package
df1b0e3
smhp: hotfix to mock gpu .deb package
08fd1b6
Bump pillow from 10.2.0 to 10.3.0 in /3.test_cases/4.DDP
dependabot[bot] 13a441d
use docker restart always for DCGM and EFA NODE containers, to sustai…
nghtm 0625a02
use docker restart always for DCGM and EFA NODE containers, to sustai…
nghtm ba7a748
Merge branch 'aws-samples:main' into exporter-updates
nghtm 577aace
Merge branch 'aws-samples:main' into exporter-updates
nghtm 84999f2
Merge pull request #240 from nghtm/exporter-updates
mhuguesaws a73df1b
Merge pull request #240 from nghtm/exporter-updates
mhuguesaws 23f6963
Adding comment why we uninstall ec2-instance-connect (#241)
shimomut 2cd2b1e
Adding comment why we uninstall ec2-instance-connect (#241)
shimomut 8f874e0
Merge pull request #239 from aws-samples/dependabot/pip/3.test_cases/…
perifaws cb2c9d6
Merge pull request #239 from aws-samples/dependabot/pip/3.test_cases/…
perifaws 06f78ee
pcluster: add a small util script to fetch config from a running cluster
6343a90
pcluster: add a small util script to fetch config from a running cluster
6cb9781
smhp: fix issue #243
6e90775
smhp: fix issue #243
a73ac66
Merge pull request #244 from aws-samples/smhp-dpkg-retry
verdimrc fdd246c
Merge pull request #244 from aws-samples/smhp-dpkg-retry
verdimrc 57415ae
start adding deepspeed example
KeitaW 32fe215
start adding deepspeed example
KeitaW f130f6b
Activate conda environment
sean-smith 3cdb653
Activate conda environment
sean-smith bdf2487
Merge pull request #245 from aws-samples/sean-smith-patch-2
KeitaW 8c4dae9
Merge pull request #245 from aws-samples/sean-smith-patch-2
KeitaW a8b8aad
adopt code from megotron-deepspeed repository
KeitaW 29da5fb
adopt code from megotron-deepspeed repository
KeitaW 2da855d
Fix typo in 15.gpt-neox README
KeitaW 58bb612
Fix typo in 15.gpt-neox README
KeitaW eb891e4
Update README.md
KeitaW 3c88a6f
Update README.md
KeitaW 34ba897
cleanup
KeitaW e682407
cleanup
KeitaW 74b8479
update readme
KeitaW d09696c
update readme
KeitaW ea15006
update
KeitaW ab93f3c
update
KeitaW 0b214b1
update
KeitaW 525159e
update
KeitaW e0aa9d0
cleanup
KeitaW 1a39144
cleanup
KeitaW 7d0dfab
Update 2.train-mpt-manual-distributed.sbatch
KeitaW c05056d
Update 2.train-mpt-manual-distributed.sbatch
KeitaW 18fd145
Merge pull request #248 from aws-samples/KeitaW-patch-2
verdimrc 55d9397
Merge pull request #248 from aws-samples/KeitaW-patch-2
verdimrc 7fffe50
Merge pull request #242 from aws-samples/pcluster-util-fetch-config
verdimrc b38410a
Merge pull request #242 from aws-samples/pcluster-util-fetch-config
verdimrc a6f9581
update
KeitaW 538d64f
update
KeitaW 00c07ac
update
KeitaW fa68fc3
update
KeitaW 6c99cba
update
KeitaW 580e825
update
KeitaW c0b2a85
update
KeitaW 82a69ea
update
KeitaW a5ce2b2
removed
KeitaW 8f35ef6
removed
KeitaW 6b28131
Skip incomplete checkpoints in FSDP sample app (#251)
shimomut d16967a
Skip incomplete checkpoints in FSDP sample app (#251)
shimomut 41052b7
Enable Auto-resume
sean-smith 429fe94
Enable Auto-resume
sean-smith 35af783
Validate provisioning_parameters.json
sean-smith ef4b597
Validate provisioning_parameters.json
sean-smith 53ba90e
Merge pull request #253 from aws-samples/validate-config
KeitaW db79703
Merge pull request #253 from aws-samples/validate-config
KeitaW bffd4e6
Merge pull request #247 from aws-samples/deepspeed
KeitaW aae94b1
Merge pull request #247 from aws-samples/deepspeed
KeitaW ea0a581
Bump transformers in /3.test_cases/12.SM-dataparallel-FSDP/scripts
dependabot[bot] 7d2fc9c
Bump transformers in /3.test_cases/12.SM-dataparallel-FSDP/scripts
dependabot[bot] dac63aa
Bump transformers in /3.test_cases/13.SM-dataparallel-deepspeed/code
dependabot[bot] a361538
Bump transformers in /3.test_cases/13.SM-dataparallel-deepspeed/code
dependabot[bot] e7824e8
Merge pull request #255 from aws-samples/dependabot/pip/3.test_cases/…
KeitaW 313712d
Merge pull request #255 from aws-samples/dependabot/pip/3.test_cases/…
KeitaW dc4c301
Merge pull request #254 from aws-samples/dependabot/pip/3.test_cases/…
KeitaW b680627
Merge pull request #254 from aws-samples/dependabot/pip/3.test_cases/…
KeitaW b7f6ff8
Merge pull request #246 from aws-samples/KeitaW-patch-1
verdimrc 284ea5a
Merge pull request #246 from aws-samples/KeitaW-patch-1
verdimrc efbbe53
nemo-launcher: support nemo-launcher with patch version; increase ver…
1bca0ff
nemo-launcher: support nemo-launcher with patch version; increase ver…
654ba82
Merge pull request #258 from aws-samples/nemo-launcher-bcm
KeitaW b6461fb
Merge pull request #258 from aws-samples/nemo-launcher-bcm
KeitaW 2eca998
torchtune usecase
pbelevich db51efe
torchtune usecase
pbelevich 6002abb
torchtune usecase
pbelevich 6416560
add initial draft
KeitaW a8e5bba
add initial draft
KeitaW 0105a19
add initial draft
KeitaW d5f3555
add docs
KeitaW 345b729
add docs
KeitaW 001a09d
add docs
KeitaW 0b4e8e5
update
KeitaW 309ef58
update
KeitaW 2003183
update
KeitaW ab5d3d5
reorganize
KeitaW 2726b58
reorganize
KeitaW 1e8d3e8
reorganize
KeitaW f98accf
update
KeitaW 4c0c69d
update
KeitaW c907933
update
KeitaW 678f985
current state
KeitaW 8b65168
current state
KeitaW ac4c45f
current state
KeitaW 27d6967
update
KeitaW 94a0dbc
update
KeitaW e9ac7d2
update
KeitaW 5b91caf
Make *.sh files executable
pbelevich e63f623
Make *.sh files executable
pbelevich 244db77
Make *.sh files executable
pbelevich 04adb95
Update 3.test_cases/torchtitan-torchtune/slurm/README.md
KeitaW 9a3160a
Update 3.test_cases/torchtitan-torchtune/slurm/README.md
KeitaW 28d43c2
Update 3.test_cases/torchtitan-torchtune/slurm/README.md
KeitaW 762f21e
Update 3.test_cases/torchtitan-torchtune/slurm/README.md
KeitaW 7ca1fc9
Update 3.test_cases/torchtitan-torchtune/slurm/README.md
KeitaW 5f7eb84
Update 3.test_cases/torchtitan-torchtune/slurm/README.md
KeitaW f1da782
local change
KeitaW 690df84
local change
KeitaW 83a7e2d
local change
KeitaW 5d8b9b7
update README.md
KeitaW e341ec3
update README.md
KeitaW fbdb034
update README.md
KeitaW c9f8ac7
update README
KeitaW 45ac55d
update README
KeitaW e5ac23f
update README
KeitaW fa15546
separate libraries
KeitaW f826581
separate libraries
KeitaW 0d15a9d
separate libraries
KeitaW dd13ba0
update README
KeitaW daa65b6
update README
KeitaW bfb35c6
update README
KeitaW ff02c72
move container image
KeitaW f55f60d
move container image
KeitaW 06e6df0
move container image
KeitaW 0adc47f
update
KeitaW 948418a
update
KeitaW 9af9bb9
update
KeitaW a55c333
update to make it compatible with SMHP
KeitaW 13a5aff
update to make it compatible with SMHP
KeitaW e24b21a
update to make it compatible with SMHP
KeitaW 08342fc
update readme
KeitaW 9db7487
update readme
KeitaW c7a8cf0
update readme
KeitaW 6cb0dc3
update
KeitaW 5276eda
update
KeitaW fbd278e
update
KeitaW 523de6e
update
KeitaW b322db3
update
KeitaW 3eba59c
update
KeitaW 19e4cba
update README
KeitaW a604cf0
update README
KeitaW 392d28e
update README
KeitaW 6a940fb
update script
KeitaW fe62d87
update script
KeitaW 1be9da2
update script
KeitaW 6d4da01
remove torchtitan
KeitaW 19dd02c
remove torchtitan
KeitaW 1e5aedd
remove torchtitan
KeitaW 7b292ab
update tutorials
KeitaW c8ceac8
update tutorials
KeitaW e6c47cf
update tutorials
KeitaW 89b4927
update LoRA part WIP
KeitaW b024895
update LoRA part WIP
KeitaW b195f0b
update LoRA part WIP
KeitaW fbabf38
update
KeitaW 1026a36
update
KeitaW eacd729
update
KeitaW 33e99e2
update
KeitaW ebf995c
update
KeitaW e9abb4e
update
KeitaW 3083036
clean up
KeitaW c0397a4
clean up
KeitaW 7eb0f66
clean up
KeitaW 3b0d9e6
update
KeitaW 332285e
update
KeitaW d4029d2
update
KeitaW ae98bf9
update
KeitaW 64e0724
update
KeitaW 00dfbf5
update
KeitaW b929043
Merge branch 'torchtitan-torchtune' of github.com:aws-samples/awsome-…
KeitaW 4ac5496
Merge branch 'torchtitan-torchtune' of github.com:aws-samples/awsome-…
KeitaW 952eba3
update
KeitaW 563e807
update
KeitaW 71c33f6
Merge branch 'main' into torchtitan-torchtune
KeitaW 77d4908
Update 3.test_cases/torchtune/slurm/README.md
KeitaW 0133094
Update 3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-developm…
KeitaW f8833b7
Update 3.test_cases/torchtune/slurm/README.md
KeitaW File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
|
||
**Torchtitan** is a pioneering library for large-scale LLM training utilizing native PyTorch. It highlights PyTorch's latest distributed training features through a clean, minimalistic codebase. | ||
|
||
Characteristics of Torchtitan include: | ||
|
||
* User-friendly design, making it easy to understand, use, and extend for various training purposes. | ||
* Minimal modifications required to the model code for applying 1D, 2D, or upcoming 3D parallelism. | ||
* A modular approach over a monolithic codebase, facilitating quick start-ups. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
#!/bin/bash | ||
|
||
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
# SPDX-License-Identifier: MIT-0 | ||
|
||
#SBATCH --job-name=pretrain | ||
#SBATCH --nodes=2 | ||
#SBATCH --ntasks=2 | ||
#SBATCH --gpus-per-node=8 # Number of GPU per node | ||
#SBATCH --output=logs/%x_%j.out # logfile for stdout | ||
#SBATCH --error=logs/%x_%j.err # logfile for stderr, remove it to merge both outputs | ||
#SBATCH --wait-all-nodes=1 | ||
#SBATCH --exclusive | ||
set -euxo pipefail | ||
|
||
################################################################## | ||
############# Load environment variables ######################### | ||
################################################################## | ||
# Load environment variables | ||
if [ ! -f .env ] | ||
then | ||
echo "Please create a .env file with the required environment variables" | ||
exit 1 | ||
else | ||
source .env | ||
fi | ||
|
||
################################################################## | ||
######### Define EFA/NCCL/Slurm environment variables ############ | ||
################################################################## | ||
## EFA settings | ||
export FI_LOG_LEVEL=1 | ||
export FI_PROVIDER=efa # change to eth if you want to use ENA for comparisons | ||
export FI_EFA_USE_HUGE_PAGE=0 | ||
# https://discuss.pytorch.org/t/nccl-network-is-unreachable-connection-refused-when-initializing-ddp/137352 | ||
# https://github.com/pytorch/pytorch/issues/68893 | ||
export NCCL_SOCKET_IFNAME=en | ||
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 | ||
export NCCL_DEBUG=INFO | ||
export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"` | ||
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) | ||
export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l` | ||
export NODES=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) ) | ||
export NODES_ARRAY=($NODES) | ||
export HEAD_NODE=${NODES_ARRAY[0]} | ||
export MASTER_ADDR=$(hostname --ip-address) | ||
export MASTER_PORT=$RANDOM | ||
export NNODES=$SLURM_JOB_NUM_NODES | ||
export NPROC=$SLURM_GPUS_PER_NODE | ||
export WORLD_SIZE=$(( $NNODES * $NPROC )) | ||
|
||
################################################################## | ||
############### Create train config ############################## | ||
################################################################## | ||
|
||
if [ ! -d ${FSX_PATH}/tmp ]; then | ||
mkdir -p ${FSX_PATH}/tmp | ||
fi | ||
cat ${PWD}/train_configs/pretrain_llama3_70b.toml | envsubst > ${FSX_PATH}/tmp/pretrain_llama3_70b.toml | ||
|
||
################################################################## | ||
################# Set arguments ################################## | ||
################################################################## | ||
|
||
: "${CONTAINER_MOUNT:=$FSX_PATH:$FSX_PATH}" | ||
declare -a SRUN_ARGS=( | ||
--container-image $ENROOT_IMAGE | ||
--container-mounts $CONTAINER_MOUNT | ||
) | ||
declare -a TORCHRUN_ARGS=( | ||
# change this to match the number of gpus per node: | ||
--master_addr $MASTER_ADDR | ||
--master_port $RANDOM | ||
--nproc_per_node=8 | ||
--nnodes $NNODES | ||
--nnodes=$SLURM_JOB_NUM_NODES | ||
--rdzv_backend=c10d | ||
--rdzv_endpoint=$(hostname) | ||
) | ||
declare -a TRAIN_ARGS=( | ||
--job.config_file ${FSX_PATH}/tmp/pretrain_llama3_70b.toml | ||
) | ||
|
||
srun -l "${SRUN_ARGS[@]}" \ | ||
torchrun "${TORCHRUN_ARGS[@]}" ${PWD}/../torchtitan/train.py "${TRAIN_ARGS[@]}" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,234 @@ | ||
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
# SPDX-License-Identifier: MIT-0 | ||
|
||
#################################################################################################### | ||
# This is a sample Dockerfile, with optional stanzas. Please read through this Dockerfile, | ||
# understand what it does, then create your own Dockerfile. | ||
# | ||
# Sample build instructions: | ||
# | ||
# docker build --progress=plain -t nvidia-pt-od:latest -f 0.nvcr-pytorch-aws.dockerfile . | ||
# rm /fsx/nvidia-pt-od__latest.sqsh ; enroot import -o /fsx/nvidia-pt-od__latest.sqsh dockerd://nvidia-pt-od:latest | ||
# | ||
# Compute nodes (aka build nodes) are transient, so we need to keep the docker image on shared fs, | ||
# which head node can load into its local registry. | ||
# | ||
# # Build node: save image to file | ||
# docker save nvidia-pt-od:latest > /fsx/nvidia-pt-od__latest.tar | ||
# | ||
# # Load image to local docker registry -> on head node, or new compute/build node. | ||
# docker load < /fsx/nvidia-pt-od__latest.tar | ||
#################################################################################################### | ||
FROM nvcr.io/nvidia/pytorch:24.04-py3 | ||
ENV DEBIAN_FRONTEND=noninteractive | ||
|
||
# The three must-be-built packages. | ||
# Efa-installer>=1.29.0 required for nccl>=2.19.0 to avoid libfabric NCCL error. | ||
ARG EFA_INSTALLER_VERSION=1.31.0 | ||
ARG AWS_OFI_NCCL_VERSION=v1.8.1-aws | ||
ARG NCCL_TESTS_VERSION=2.13.9 | ||
ARG NCCL_VERSION=2.20.3-1 | ||
|
||
RUN apt-get update -y | ||
RUN apt-get remove -y --allow-change-held-packages \ | ||
libmlx5-1 ibverbs-utils libibverbs-dev libibverbs1 | ||
|
||
# We noticed that since 23.09, we can't just delete the whole /opt/hpcx/, otherwise `import torch` | ||
# complains about missing libuc?.so. | ||
RUN rm -rf /opt/hpcx/ompi \ | ||
&& rm -rf /usr/local/mpi \ | ||
&& rm -rf /opt/hpcx/nccl_rdma_sharp_plugin \ | ||
&& ldconfig | ||
ENV OPAL_PREFIX= | ||
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \ | ||
git \ | ||
gcc \ | ||
vim \ | ||
kmod \ | ||
openssh-client \ | ||
openssh-server \ | ||
build-essential \ | ||
curl \ | ||
autoconf \ | ||
libtool \ | ||
gdb \ | ||
automake \ | ||
cmake \ | ||
apt-utils \ | ||
libhwloc-dev \ | ||
aptitude && \ | ||
DEBIAN_FRONTEND=noninteractive apt autoremove -y | ||
|
||
# EFA | ||
RUN apt-get update && \ | ||
cd /tmp && \ | ||
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \ | ||
tar -xf aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \ | ||
cd aws-efa-installer && \ | ||
# ONLY add `--skip-kmod`, `--no-verify` and `--skip-limit-conf` flags to container image. | ||
# Those three flags must NOT be used on the host. | ||
# | ||
# Explanations: | ||
# - to build EFA in the Dockerfile, we added --skip-kmod and --no-verify. Without these flags, | ||
# the Dockerfile will fail to build. If installing EFA on the host and not in a container, | ||
# please remove these flags. | ||
# - The --skip-limit-conf can be retained in Dockerfile, but it's redundant as the host already | ||
# has these limits set by efa_installer. | ||
./efa_installer.sh -y -g -d --skip-kmod --no-verify --skip-limit-conf && \ | ||
ldconfig && \ | ||
rm -rf /tmp/aws-efa-installer /var/lib/apt/lists/* | ||
ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH | ||
ENV PATH=/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:$PATH | ||
|
||
|
||
#################################################################################################### | ||
# [CUSTOM_NCCL_OPTION_1] Uncomment below stanza to install another NCCL version using the official | ||
# binaries. | ||
# | ||
# NCCL EFA plugin (aws-ofi-nccl) depends on mpi, hence we must rebuild openmpi before building the | ||
# aws-ofi-ccnl. | ||
#################################################################################################### | ||
# RUN cd /opt && \ | ||
# wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb && \ | ||
# dpkg -i cuda-keyring_1.0-1_all.deb && \ | ||
# apt update && \ | ||
# apt install -y libnccl2==${NCCL_VERSION} libnccl-dev==${NCCL_VERSION} && \ | ||
# echo NCCL_SOCKET_IFNAME=^docker0,lo >> /etc/nccl.conf | ||
|
||
|
||
#################################################################################################### | ||
# [CUSTOM_NCCL_OPTION_2] Install NCCL from source to the same location as the built-in ones. The | ||
# benefits of installing to the same location as the built-in version are: | ||
# | ||
# 1. There's only ever a single libnccl version offered by this image, preventing application from | ||
# mistakenly chooses a wrong version. | ||
# 2. No longer needing extra settings for LD_LIBRARY_PATH or LD_PRELOAD. | ||
# | ||
# NCCL EFA plugin (aws-ofi-nccl) depends on mpi, hence we must rebuild openmpi before building the | ||
# aws-ofi-ccnl. | ||
#################################################################################################### | ||
RUN cd /tmp \ | ||
&& git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION} \ | ||
&& cd nccl \ | ||
&& make -j src.build BUILDDIR=/usr \ | ||
# Build for p4 & p5. | ||
NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90, -gencode=arch=compute_80,code=sm_80" \ | ||
&& rm -rf /tmp/nccl \ | ||
&& echo NCCL_SOCKET_IFNAME=^docker0,lo >> /etc/nccl.conf | ||
|
||
|
||
#################################################################################################### | ||
# Rebuild OpenMPI with custom PMIX version. E.g., to match what host's Slurm is built with (see | ||
# /opt/pmix/ on host, or run pmix_info on host). | ||
# | ||
# May be needed on rare occassions when `srun --mpi=pmix --container-image=... <mpi_application>` | ||
# mysteriously crashes. | ||
# | ||
# NCCL EFA plugin (aws-ofi-nccl) depends on mpi, hence we must rebuild openmpi before building the | ||
# aws-ofi-ccnl. | ||
#################################################################################################### | ||
ENV OPEN_MPI_PATH=/opt/amazon/openmpi | ||
|
||
# OpenMPI build script claims PMIX_VERSION, and complains if we use it. | ||
ENV CUSTOM_PMIX_VERSION=4.2.6 | ||
RUN apt-get update && apt-get install -y libevent-dev \ | ||
&& cd /tmp \ | ||
&& wget https://github.com/openpmix/openpmix/releases/download/v${CUSTOM_PMIX_VERSION}/pmix-${CUSTOM_PMIX_VERSION}.tar.gz \ | ||
&& tar -xzf pmix-${CUSTOM_PMIX_VERSION}.tar.gz \ | ||
&& rm pmix-${CUSTOM_PMIX_VERSION}.tar.gz \ | ||
&& cd pmix-${CUSTOM_PMIX_VERSION}/ \ | ||
&& ./autogen.pl \ | ||
&& ./configure --prefix=/opt/pmix \ | ||
&& make -j \ | ||
&& make install \ | ||
&& echo /opt/pmix/lib > /etc/ld.so.conf.d/pmix.conf \ | ||
&& ldconfig \ | ||
&& cd / \ | ||
&& rm -fr /tmp/pmix-${CUSTOM_PMIX_VERSION}/ | ||
# To silence this runtime error message: | ||
# [p4de-st-p4de-2:110912] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168 | ||
ENV PMIX_GDS_MODULE=^ds12 \ | ||
PMIX_MCA_gds=^ds12 | ||
|
||
# Rebuild openmpi with DLC style (which it remarks as "without libfabric"), with the above pmix. | ||
ENV OMPI_VERSION=4.1.6 | ||
RUN rm -fr ${OPEN_MPI_PATH} \ | ||
&& mkdir /tmp/openmpi \ | ||
&& cd /tmp/openmpi \ | ||
&& wget --quiet https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-${OMPI_VERSION}.tar.gz \ | ||
&& tar zxf openmpi-${OMPI_VERSION}.tar.gz \ | ||
&& rm openmpi-${OMPI_VERSION}.tar.gz \ | ||
&& cd openmpi-${OMPI_VERSION} \ | ||
&& ./configure --enable-orterun-prefix-by-default --prefix=$OPEN_MPI_PATH --with-cuda=${CUDA_HOME} --with-slurm --with-pmix=/opt/pmix \ | ||
&& make -j $(nproc) all \ | ||
&& make install \ | ||
&& ldconfig \ | ||
&& cd / \ | ||
&& rm -rf /tmp/openmpi \ | ||
&& ompi_info --parsable --all | grep mpi_built_with_cuda_support:value \ | ||
# Verify pmix from /opt/pmix/ | ||
&& ldd /opt/amazon/openmpi/lib/openmpi/mca_pmix_ext3x.so | grep '/opt/pmix/lib/libpmix.so.* ' > /opt/amazon/openmpi-pmix.txt | ||
#################################################################################################### | ||
|
||
|
||
## NCCL EFA Plugin | ||
#RUN mkdir -p /tmp && \ | ||
# cd /tmp && \ | ||
# curl -LO https://github.com/aws/aws-ofi-nccl/archive/refs/tags/v${AWS_OFI_NCCL_VERSION}.tar.gz && \ | ||
# tar -xzf /tmp/v${AWS_OFI_NCCL_VERSION}.tar.gz && \ | ||
# rm /tmp/v${AWS_OFI_NCCL_VERSION}.tar.gz && \ | ||
# mv aws-ofi-nccl-${AWS_OFI_NCCL_VERSION} aws-ofi-nccl && \ | ||
# cd /tmp/aws-ofi-nccl && \ | ||
# ./autogen.sh && \ | ||
# ./configure --prefix=/opt/amazon/efa \ | ||
# --with-libfabric=/opt/amazon/efa \ | ||
# --with-cuda=/usr/local/cuda \ | ||
# --enable-platform-aws \ | ||
# --with-mpi=/opt/amazon/openmpi && \ | ||
# make -j$(nproc) install && \ | ||
# rm -rf /tmp/aws-ofi/nccl | ||
|
||
################################################### | ||
## Install AWS-OFI-NCCL plugin | ||
RUN apt-get install libtool autoconf cmake nasm unzip pigz parallel nfs-common build-essential hwloc libhwloc-dev libjemalloc2 libnuma-dev numactl libjemalloc-dev preload htop iftop liblapack-dev libgfortran5 ipcalc wget curl devscripts debhelper check libsubunit-dev fakeroot pkg-config dkms -y | ||
RUN export OPAL_PREFIX="" \ | ||
&& git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl \ | ||
&& cd /opt/aws-ofi-nccl \ | ||
&& git checkout ${AWS_OFI_NCCL_VERSION} \ | ||
&& ./autogen.sh \ | ||
&& ./configure --prefix=/opt/aws-ofi-nccl/install \ | ||
--with-mpi=/opt/amazon/openmpi \ | ||
--with-libfabric=/opt/amazon/efa \ | ||
--with-cuda=/usr/local/cuda \ | ||
--enable-platform-aws \ | ||
&& make -j $(nproc) && make install | ||
|
||
|
||
# Do this to minimize the ld path env vars that users need to define when running this image. | ||
RUN echo "/usr/local/lib" >> /etc/ld.so.conf.d/local.conf && \ | ||
echo "/opt/amazon/openmpi/lib" >> /etc/ld.so.conf.d/efa.conf && \ | ||
ldconfig | ||
|
||
ENV OMPI_MCA_pml=^cm,ucx \ | ||
OMPI_MCA_btl=tcp,self \ | ||
OMPI_MCA_btl_tcp_if_exclude=lo,docker0 \ | ||
OPAL_PREFIX=/opt/amazon/openmpi \ | ||
# https://discuss.pytorch.org/t/nccl-network-is-unreachable-connection-refused-when-initializing-ddp/137352 | ||
# https://github.com/pytorch/pytorch/issues/68893 | ||
NCCL_SOCKET_IFNAME=^docker,lo | ||
|
||
ENV LD_LIBRARY_PATH="/usr/local/lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}" | ||
|
||
# NCCL-tests: always good to include this as a diagnostic tool. | ||
RUN git clone https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests \ | ||
&& cd /opt/nccl-tests \ | ||
&& git checkout v${NCCL_TESTS_VERSION} \ | ||
&& make MPI=1 \ | ||
MPI_HOME=/opt/amazon/openmpi \ | ||
CUDA_HOME=/usr/local/cuda \ | ||
NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_80,code=sm_80" | ||
|
||
|
||
RUN pip install accelerate appdirs loralib bitsandbytes datasets fire peft transformers>=4.40.0 sentencepiece wandb vllm gradio openai | ||
RUN pip install hydra-core huggingface_hub safetensors tiktoken blobfile>=2 tqdm torchao==0.1 lm_eval==0.4.* | ||
RUN pip uninstall -y transformer-engine |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
torchtune | ||
.env |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# End-to-End LLM Model Development with Torchtune <!-- omit in toc --> | ||
|
||
This guide demonstrates the comprehensive process of developing a Large Language Model (LLM) from start to finish using [Torchtune](https://github.com/pytorch/torchtune). The journey of creating an LLM encompasses five pivotal steps: | ||
|
||
![LLMOps](docs/LLMOps.png) | ||
|
||
1. **(Continuous) Pretraining the Language Model**: Next, the language model undergoes pretraining on a vast corpus of text data. This step can be bypassed if starting with an already pretrained model. Pretraining is essential for the model to learn the general patterns and structures of language. Refer `torchtitan` test case for the large scale pretraining with the latest techniques such as 3D parallelism and `torch.compile`. | ||
|
||
2. **Instruction Tuning**: The pretrained model is then fine-tuned to cater to specific tasks by updating its parameters with a new dataset. This process involves partially retraining the model with samples that exemplify the desired behavior, thus refining the model weights for the particular application. | ||
|
||
3. **Aligment**: The pretrained model is then fine-tuned to cater to specific tasks by updating its parameters with a new dataset. This process involves partially retraining the model with samples that exemplify the desired behavior, thus refining the model weights for the particular application. | ||
|
||
4. **Evaluation**: Evaluating the LLM's performance is a critical step. It involves using various metrics to assess the model's accuracy and effectiveness. This step is vital for validating new techniques and objectively comparing different model releases. | ||
|
||
5. **Deployment**: Upon achieving the desired performance, the model is deployed as an API. This deployment enables the model's integration into applications, making it accessible to users and other systems. | ||
|
||
Following these steps allows for the iterative development and refinement of a Large Language Model to meet specific needs and ensure its successful deployment. This guide specifically addresses all steps except the initial data preparation. The pretraining phase is facilitated by Torchtitan, while Torchtune manages the fine-tuning and evaluation phases. | ||
|
||
**Torchtune** emerges as a PyTorch-native library dedicated to the easy authoring, fine-tuning, and experimentation with LLMs, proudly announcing its alpha release. | ||
|
||
Features of Torchtune encompass: | ||
|
||
* Native-PyTorch implementations of renowned LLMs using composable and modular building blocks. | ||
* Straightforward and adaptable training recipes for popular fine-tuning techniques such as LoRA and QLoRA, emphasizing a PyTorch-centric approach without the need for trainers or frameworks. | ||
* YAML configurations for simplifying the setup of training, evaluation, quantization, or inference recipes. | ||
* Comprehensive support for numerous popular dataset formats and prompt templates, ensuring a smooth start to training endeavors. | ||
|
||
This case study provides examples for two schedulers, Slurm and Kubernetes, with detailed instructions available in the `slurm` or `kubernetes` subdirectories. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--nnodes twice