diff --git a/3.test_cases/torchtune/README.md b/3.test_cases/torchtune/README.md index 6737f64a..4c2a1b85 100644 --- a/3.test_cases/torchtune/README.md +++ b/3.test_cases/torchtune/README.md @@ -4,11 +4,11 @@ This guide demonstrates the comprehensive process of developing a Large Language ![LLMOps](docs/LLMOps.png) -1. **Data Preparation**: The journey begins with the collection and preparation of data for training. This step is crucial as it involves exploring the data's characteristics, performing necessary cleaning, and applying preprocessing techniques to ensure the data is in the right shape for model training. +1. **(Continuous) Pretraining the Language Model**: Next, the language model undergoes pretraining on a vast corpus of text data. This step can be bypassed if starting with an already pretrained model. Pretraining is essential for the model to learn the general patterns and structures of language. Refer `torchtitan` test case for the large scale pretraining with the latest techniques such as 3D parallelism and `torch.compile`. -2. **Pretraining the Language Model**: Next, the language model undergoes pretraining on a vast corpus of text data. This step can be bypassed if starting with an already pretrained model. Pretraining is essential for the model to learn the general patterns and structures of language. Refer `torchtitan` test case for the large scale pretraining with the latest techniques such as 3D parallelism and `torch.compile`. +2. **Instruction Tuning**: The pretrained model is then fine-tuned to cater to specific tasks by updating its parameters with a new dataset. This process involves partially retraining the model with samples that exemplify the desired behavior, thus refining the model weights for the particular application. -3. **Fine-Tuning**: The pretrained model is then fine-tuned to cater to specific tasks by updating its parameters with a new dataset. This process involves partially retraining the model with samples that exemplify the desired behavior, thus refining the model weights for the particular application. +3. **Aligment**: The pretrained model is then fine-tuned to cater to specific tasks by updating its parameters with a new dataset. This process involves partially retraining the model with samples that exemplify the desired behavior, thus refining the model weights for the particular application. 4. **Evaluation**: Evaluating the LLM's performance is a critical step. It involves using various metrics to assess the model's accuracy and effectiveness. This step is vital for validating new techniques and objectively comparing different model releases. diff --git a/3.test_cases/torchtune/docs/LLMOps.png b/3.test_cases/torchtune/docs/LLMOps.png index 97796275..fbdb82cd 100644 Binary files a/3.test_cases/torchtune/docs/LLMOps.png and b/3.test_cases/torchtune/docs/LLMOps.png differ diff --git a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md index 583759c4..ec65f64e 100644 --- a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md +++ b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md @@ -1,10 +1,11 @@ # End-to-End LLama3-70B model development with Torchtune In this tutorial, you will see how to: -* Pretrain -* Finetune -* Evaluate -* Deploy +* Contious Pretraining +* Instruction Finetuning +* Alignment +* Evaluation +* Deployment ## 1. Prerequisites Before starting, ensure you have requested access to Meta-Llama-3-70B by visiting [Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) on Hugging Face and following the access request instructions. Additionally, make sure all prerequisites described in the [slurm](..) directory are set up. @@ -64,16 +65,16 @@ This output confirms that the `torchtune download` command has been executed wit By following these steps, you ensure that the necessary model components are in place, setting the stage for subsequent tasks such as pretraining, finetuning, evaluation, and deployment. -## 3. Full-parameter finetuning +## 3. Continuous Pretraining -WIP In this step, you will author Llama3 model using c4 dataset. +In this step, you will fine-tune the Llama model. Specifically, the finetune process in this step is called Full-parameter finetuning, which will update all the parameters in the original model. ```bash sbatch tutorials/e2e-llama3-70b-development/pretrain.sbatch ``` -## 4. Lora parameter efficient finetuning +## 4. Instruction-tuning In this step, you will fine-tune the LLaMA model using Low-Rank Adaptation (LoRA) with the Alpaca dataset. We will first cover the basic concepts and relevant configurations found in the [config file](configs/lora_finetune_distributed.yaml), followed by a detailed fine-tuning tutorial. @@ -111,6 +112,10 @@ dataset: As the config suggests, we use a predefined dataset class prepared in torchtune. +## 5. Alignment + + + ### Submit Finetuning job You can submit the finetuning job with the following command: @@ -226,15 +231,33 @@ quantizer: groupsize: 256 ``` -`Int4WeightOnlyQuantizer` performs per-axis group quantization, which means it quantizes weights in groups rather than individually. This helps maintain a balance between compression and model accuracy. +`Int4WeightOnlyQuantizer` performs per-axis group quantization, which means it quantizes weights in groups rather than individually. By adjusting the `groupsize`, one can control the trade-off between compression ratio and accuracy. Smaller group sizes typically lead to higher accuracy but lower compression, while larger group sizes achieve higher compression at the potential cost of accuracy. ```bash sbatch quentize.sbatch ``` +```bash +Executing following command: +torchtune run quantize --config /fsx/ubuntu/awsome-distributed-training/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/configs/quantize.yaml tokenizer.path=/fsx/ubuntu/models/torchtune/meta-llama/Meta-Llama-3-70B/original/tokenizer.model checkpointer.checkpoint_dir=/fsx/ubuntu/models/torchtune/meta-llama/Meta-Llama-3-70B-tuned checkpointer.output_dir=/fsx/ubuntu/models/torchtune/meta-llama/Meta-Llama-3-70B-quantized +``` + +The resultant quantized weights is saved as follows: + +```bash +0: 2024-05-31:02:10:46,964 DEBUG [seed.py:60] Setting manual seed to local seed 1234. Local seed is seed + rank = 1234 + 0 +0: 2024-05-31:02:18:17,728 INFO [quantize.py:90] Model is initialized with precision torch.bfloat16. +0: 2024-05-31:02:20:33,576 INFO [quantize.py:98] Time for quantization: 133.08 sec +0: 2024-05-31:02:20:33,577 INFO [quantize.py:99] Memory used: 40.03 GB +0: 2024-05-31:02:21:18,609 INFO [quantize.py:112] Model checkpoint of size 37.94 GB saved to /fsx/ubuntu/models/torchtune/meta-llama/Meta-Llama-3-70B-quantized/hf_model_0001_0-4w.pt +``` + + ## 7. Generation +Now that you have production-ready quantized model. This last step test text generation using the model. + ```bash sbatch 7.generate.sbatch --config configs/generate_llama3.yaml --prompt "Hello, my name is" ``` diff --git a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/configs/quantize.yaml b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/configs/quantize.yaml index 61344ca9..1060a081 100644 --- a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/configs/quantize.yaml +++ b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/configs/quantize.yaml @@ -12,42 +12,43 @@ checkpointer: _component_: torchtune.utils.FullModelHFCheckpointer checkpoint_dir: ${MODEL_PATH}/${HF_MODEL} checkpoint_files: [ - model-00001-of-00030.safetensors, - model-00002-of-00030.safetensors, - model-00003-of-00030.safetensors, - model-00004-of-00030.safetensors, - model-00005-of-00030.safetensors, - model-00006-of-00030.safetensors, - model-00007-of-00030.safetensors, - model-00008-of-00030.safetensors, - model-00009-of-00030.safetensors, - model-00010-of-00030.safetensors, - model-00011-of-00030.safetensors, - model-00012-of-00030.safetensors, - model-00013-of-00030.safetensors, - model-00014-of-00030.safetensors, - model-00015-of-00030.safetensors, - model-00016-of-00030.safetensors, - model-00017-of-00030.safetensors, - model-00018-of-00030.safetensors, - model-00019-of-00030.safetensors, - model-00020-of-00030.safetensors, - model-00021-of-00030.safetensors, - model-00022-of-00030.safetensors, - model-00023-of-00030.safetensors, - model-00024-of-00030.safetensors, - model-00025-of-00030.safetensors, - model-00026-of-00030.safetensors, - model-00027-of-00030.safetensors, - model-00028-of-00030.safetensors, - model-00029-of-00030.safetensors, - model-00030-of-00030.safetensors, + hf_model_0001_0.pt, + hf_model_0002_0.pt, + hf_model_0003_0.pt, + hf_model_0004_0.pt, + hf_model_0005_0.pt, + hf_model_0006_0.pt, + hf_model_0007_0.pt, + hf_model_0007_0.pt, + hf_model_0008_0.pt, + hf_model_0009_0.pt, + hf_model_0010_0.pt, + hf_model_0011_0.pt, + hf_model_0012_0.pt, + hf_model_0013_0.pt, + hf_model_0014_0.pt, + hf_model_0015_0.pt, + hf_model_0016_0.pt, + hf_model_0017_0.pt, + hf_model_0018_0.pt, + hf_model_0019_0.pt, + hf_model_0020_0.pt, + hf_model_0021_0.pt, + hf_model_0022_0.pt, + hf_model_0023_0.pt, + hf_model_0024_0.pt, + hf_model_0025_0.pt, + hf_model_0026_0.pt, + hf_model_0027_0.pt, + hf_model_0028_0.pt, + hf_model_0029_0.pt, + hf_model_0030_0.pt, ] recipe_checkpoint: null output_dir: ${MODEL_PATH}/${HF_MODEL}-quantized model_type: LLAMA3 -device: cuda +device: cpu dtype: bf16 seed: 1234 diff --git a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/full_finetune_distributed.sbatch b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/full_finetune_distributed.sbatch index e69de29b..239c8b02 100644 --- a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/full_finetune_distributed.sbatch +++ b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/full_finetune_distributed.sbatch @@ -0,0 +1,95 @@ +#!/bin/bash + +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: MIT-0 + +#SBATCH --job-name=full-finetuning +#SBATCH --nodes=2 +#SBATCH --ntasks=2 +#SBATCH --gpus-per-node=8 # Number of GPU per node +#SBATCH --output=logs/%x_%j.out # logfile for stdout +#SBATCH --error=logs/%x_%j.err # logfile for stderr, remove it to merge both outputs +#SBATCH --wait-all-nodes=1 +#SBATCH --exclusive +set -euxo pipefail + +################################################################## +########### Check current working directory ###################### +################################################################## +if [ $(basename $(pwd)) != "slurm" ] +then + echo "Please run this script from the slurm directory" + exit 1 +fi +################################################################## +############# Load environment variables ######################### +################################################################## +# Load environment variables +if [ ! -f .env ] +then + echo "Please create a .env file with the required environment variables" + exit 1 +else + source .env +fi + +################################################################## +######### Define EFA/NCCL/Slurm environment variables ############ +################################################################## +## EFA settings +export FI_LOG_LEVEL=1 +export FI_PROVIDER=efa # change to eth if you want to use ENA for comparisons +export FI_EFA_USE_HUGE_PAGE=0 +# https://discuss.pytorch.org/t/nccl-network-is-unreachable-connection-refused-when-initializing-ddp/137352 +# https://github.com/pytorch/pytorch/issues/68893 +export NCCL_SOCKET_IFNAME=en +export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 +export NCCL_DEBUG=INFO +export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"` +export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) +export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l` +export NODES=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) ) +export NODES_ARRAY=($NODES) +export HEAD_NODE=${NODES_ARRAY[0]} +export MASTER_ADDR=$(hostname --ip-address) +export MASTER_PORT=$RANDOM +export NNODES=$SLURM_JOB_NUM_NODES +export NPROC=$SLURM_GPUS_PER_NODE +export WORLD_SIZE=$(( $NNODES * $NPROC )) + +################################################################## +############# Set training arguments ############################# +################################################################## +export HF_MODEL="meta-llama/Meta-Llama-3-70B" +: "${CONTAINER_MOUNT:=$FSX_PATH:$FSX_PATH}" +declare -a SRUN_ARGS=( + --container-image $ENROOT_IMAGE + --container-mounts $CONTAINER_MOUNT +) +declare -a TORCHRUN_ARGS=( + # change this to match the number of gpus per node: + --master_addr $MASTER_ADDR + --master_port $RANDOM + --nproc_per_node=8 + --nnodes $NNODES + --nnodes=$SLURM_JOB_NUM_NODES + --rdzv_backend=c10d + --rdzv_endpoint=$(hostname) +) +declare -a TRAIN_ARGS=( + --config ${PWD}/tutorials/e2e-llama3-70b-development/configs/lora_finetune_distributed.yaml + tokenizer.path=${MODEL_PATH}/${HF_MODEL}/original/tokenizer.model + checkpointer.checkpoint_dir=${MODEL_PATH}/${HF_MODEL} + checkpointer.output_dir=${MODEL_PATH}/${HF_MODEL}-tuned + output_dir=${MODEL_PATH}/${HF_MODEL}-tuned/log + metric_logger.log_dir=${MODEL_PATH}/${HF_MODEL}-tuned/log/metrics +) +################################################################## +################# Run torchtune ################################## +################################################################## +export PYTHONPATH=${PWD}/torchtune +export TORCHTUNE=${PWD}/torchtune/torchtune/_cli/tune.py +export TORCHTUNE_COMMAND="full_finetune_distributed" +echo "Executing following command:" +echo "torchtune" "run" "${TORCHRUN_ARGS[@]}" "${TORCHTUNE_COMMAND}" "${TORCHTUNE_ARGS[@]}" +srun -l "${SRUN_ARGS[@]}" python ${TORCHTUNE} run "${TORCHRUN_ARGS[@]}" "${TORCHTUNE_COMMAND}" "${TRAIN_ARGS[@]}" diff --git a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/7.generate.sbatch b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/generate.sbatch similarity index 100% rename from 3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/7.generate.sbatch rename to 3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/generate.sbatch diff --git a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/quantize.sbatch b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/quantize.sbatch index 73e50462..c094e87b 100644 --- a/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/quantize.sbatch +++ b/3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/quantize.sbatch @@ -13,6 +13,14 @@ #SBATCH --exclusive set -euxo pipefail +################################################################## +########### Check current working directory ###################### +################################################################## +if [ $(basename $(pwd)) != "slurm" ] +then + echo "Please run this script from the slurm directory" + exit 1 +fi ################################################################## ############# Load environment variables ######################### ################################################################## @@ -50,26 +58,26 @@ export NPROC=$SLURM_GPUS_PER_NODE export WORLD_SIZE=$(( $NNODES * $NPROC )) ################################################################## -############### Create train config ############################## -################################################################## -if [ ! -d ${FSX_PATH}/tmp ]; then - mkdir -p ${FSX_PATH}/tmp -fi -cat ${PWD}/train_configs/quantize_llama3.yaml | envsubst > ${FSX_PATH}/tmp/quantize_llama3.yaml -################################################################## -################# Set arguments ################################## +############# Set training arguments ############################# ################################################################## +export HF_MODEL="meta-llama/Meta-Llama-3-70B" : "${CONTAINER_MOUNT:=$FSX_PATH:$FSX_PATH}" declare -a SRUN_ARGS=( --container-image $ENROOT_IMAGE --container-mounts $CONTAINER_MOUNT ) declare -a TRAIN_ARGS=( - --config ${FSX_PATH}/tmp/quantize_llama3.yaml + --config ${PWD}/tutorials/e2e-llama3-70b-development/configs/quantize.yaml + tokenizer.path=${MODEL_PATH}/${HF_MODEL}/original/tokenizer.model + checkpointer.checkpoint_dir=${MODEL_PATH}/${HF_MODEL}-tuned + checkpointer.output_dir=${MODEL_PATH}/${HF_MODEL}-quantized ) - -export TORCHTUNE=${PWD}/torchtune/torchtune/_cli/tune.py +################################################################## +################# Run torchtune ################################## +################################################################## export PYTHONPATH=${PWD}/torchtune - -#srun -l "${SRUN_ARGS[@]}" python ${TORCHTUNE} cp generation /fsx/tmp/generate_llama3.yaml -srun -l "${SRUN_ARGS[@]}" python ${TORCHTUNE} run quantize "${TRAIN_ARGS[@]}" +export TORCHTUNE=${PWD}/torchtune/torchtune/_cli/tune.py +export TORCHTUNE_COMMAND="quantize" +echo "Executing following command:" +echo "torchtune" "run" "${TORCHTUNE_COMMAND}" "${TRAIN_ARGS[@]}" +srun -l "${SRUN_ARGS[@]}" python ${TORCHTUNE} run "${TORCHTUNE_COMMAND}" "${TRAIN_ARGS[@]}"