Update llama3 70b and removes the neuron-top in update_neuron_sdk.sh script #530

jianyinglangaws · 2025-01-22T19:17:11Z

Issue #, if available:
The llama3-70B training example is out dated.

Description of changes:
Update the script to be compatible with the latest Neuron SDK 2.21
Remove the neuron-top from update_neuron_sdk.sh script so the Neuron SDK update takes effect during the cluster setup.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

amanshanbhag · 2025-01-22T19:33:44Z

Changes on update_neuron_sdk.sh look good and tested. Thanks Jianying!

amanshanbhag · 2025-01-22T19:35:07Z

Please resolve conflicts before approval.

3.test_cases/neuronx-distributed/llama3/slurm/README.md

Minor bug fix

jianyinglangaws · 2025-01-22T20:13:45Z

Please resolve conflicts before approval.

Rebased the PR.

KeitaW

Thanks @jianyinglangaws ! Left few comments, will test the model training part by next week.

1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/update_neuron_sdk.sh

3.test_cases/neuronx-distributed/llama3/slurm/README.md

jianyinglangaws

Looks good to me.

3.test_cases/neuronx-distributed/llama3/slurm/README.md

mhuguesaws · 2025-01-23T15:18:37Z

3.test_cases/neuronx-distributed/llama3/slurm/README.md

-# Install Python
-sudo apt-get install -y python3.8 python3.8-venv
+# Install Python venv 
+sudo apt-get install -y python3.8-venv g++ 


Do we need Python 3.8? Why are we forcing a specific python version or even install python on hyperpod?

The Neuron SDK depends on the default python version for a specific OS (Ubuntu20.04:python3.8). The wrong python version can lead to Neuron SDK installation/upgrade failure.

mhuguesaws · 2025-01-23T15:18:51Z

3.test_cases/neuronx-distributed/llama3/slurm/README.md

-sudo apt-get install -y python3.8 python3.8-venv
+# Install Python venv 
+sudo apt-get install -y python3.8-venv g++ 
+
 # Create Python venv
 python3.8 -m venv /fsx/ubuntu/aws_neuron_venv_pytorch 


Let's trash the file system.

mhuguesaws · 2025-01-23T15:20:27Z

3.test_cases/neuronx-distributed/llama3/slurm/README.md

@@ -198,12 +213,14 @@ Next, we use the `convert_checkpoints.py` script to shard the checkpoints. Execu
 ```bash
 mkdir -p /fsx/ubuntu/llama3_70B/pretrained_weight
 sbatch --job-name=convert-checkpoint --output=logs/convert-checkpoint.out \


Please write a sbatch script instead of wrapping stuff.

I can update this with a sbatch script. What is the main concern for wrapping stuff?

It's highly depends on NxD core repo and the scripts in it frequently being updated. So I'm against introducing sbatch scripts in ADR.

Besides, that is the way official doc is recommending https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_llama_tp_pp.html

mhuguesaws

Let's move to containerization.
Address comments.

KeitaW · 2025-01-23T22:26:02Z

Let's move to containerization. Address comments.

+1 to it but containerization is something we should address in a separate PR. This PR is already becoming too big.

nghtm

Approving, however please ensure latest updates have been tested successfully before rebasing and pushing. Thanks for the contribution

KeitaW · 2025-01-26T01:51:32Z

Running

python -m pip install -r requirements.txt

On the latest HP caused

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
neuronx-cc 2.16.372.0+4a9b2326 requires requests<2.32.0, but you have requests 2.32.3 which is incompatible.

We need to install datasets==3.0.2 instead of the latest.

This error supposed to be fixed in the original NxD repo. Created PR here aws-neuron/neuronx-distributed#39.

KeitaW · 2025-01-26T03:20:29Z

On side note, I tried to update the test case with Llama3.1 but encountered aws-neuron/neuronx-distributed#40.

jianyinglangaws · 2025-01-26T04:26:55Z

On side note, I tried to update the test case with Llama3.1 but encountered aws-neuron/neuronx-distributed#40.

For llama3.1, we need to use the config file of https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3.1 instead of the default huggingface one. I recorded a demo for llama3.1-8B continual pretraining https://amazon.awsapps.com/workdocs-amazon/index.html#/document/f372f9fde32d4f66c906e7ac45ba619658f87b4cba8359ae1861c51369ea88fe .

I am updating the SageMaker HyperPod workshop using this example.

There are some comments on the transformers version update: https://amzn-aws.slack.com/archives/C02MFFXRP4Z/p1736895416720939

jianyinglangaws · 2025-01-26T04:36:54Z

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
neuronx-cc 2.16.372.0+4a9b2326 requires requests<2.32.0, but you have requests 2.32.3 which is incompatible.

Yes, I saw this error during installation. Although it does not affect functionality of the training, it is better be fixed in NxD doc.

KeitaW · 2025-01-26T04:38:52Z

Thansk @jianyinglangaws . No need to close the PR.

jianyinglangaws · 2025-01-26T04:42:11Z

Sorry, this PR was closed by an accident click. Let me reopen it.

KeitaW

LGTM! Tested the updated test case e2e on HP.

mhuguesaws

LGTM

…script (aws-samples#530) * Update the Neuron SDK to 2.21.0 * Update the Llama3-70B pretraining with the Neuron SDK 2.21 * Fix a typo * Add --hw_backend trn1 in the convert_checkpoint command * More update * Update the update_neuron_sdk.sh by removing the neuron-top check * Keep enable_update_neuron_sdk as Flase by default * Update automate-eks-cluster-creation.sh (aws-samples#529) Minor bug fix * Update according to the review comments. * minor updates in doc --------- Co-authored-by: Aman Shanbhag <[email protected]> Co-authored-by: Keita Watanabe <[email protected]>

jianyinglangaws added 7 commits January 14, 2025 17:52

Update the Neuron SDK to 2.21.0

dcc0148

Update the Llama3-70B pretraining with the Neuron SDK 2.21

2a635e4

Fix a typo

e3b32d5

Add --hw_backend trn1 in the convert_checkpoint command

f323463

More update

4628ad1

Update the update_neuron_sdk.sh by removing the neuron-top check

cc53ffc

Keep enable_update_neuron_sdk as Flase by default

3f61688

jianyinglangaws requested review from KeitaW, amanshanbhag and nghtm January 22, 2025 19:19

nghtm reviewed Jan 22, 2025

View reviewed changes

3.test_cases/neuronx-distributed/llama3/slurm/README.md Show resolved Hide resolved

Update automate-eks-cluster-creation.sh (#529)

71550bb

Minor bug fix

KeitaW reviewed Jan 22, 2025

View reviewed changes

Merge branch 'main' into update_llama3_70b

1111761

jianyinglangaws commented Jan 22, 2025

View reviewed changes

mhuguesaws reviewed Jan 23, 2025

View reviewed changes

3.test_cases/neuronx-distributed/llama3/slurm/README.md Outdated Show resolved Hide resolved

mhuguesaws reviewed Jan 23, 2025

View reviewed changes

mhuguesaws requested changes Jan 23, 2025

View reviewed changes

Update according to the review comments.

7604644

nghtm approved these changes Jan 24, 2025

View reviewed changes

jianyinglangaws closed this Jan 26, 2025

jianyinglangaws requested a review from mhuguesaws January 26, 2025 04:38

jianyinglangaws reopened this Jan 26, 2025

minor updates in doc

dd6c029

KeitaW approved these changes Jan 26, 2025

View reviewed changes

mhuguesaws approved these changes Feb 19, 2025

View reviewed changes

mhuguesaws merged commit 6e49278 into main Feb 19, 2025

mhuguesaws deleted the update_llama3_70b branch February 19, 2025 21:37

KeitaW assigned jianyinglangaws Feb 21, 2025

Update llama3 70b and removes the neuron-top in update_neuron_sdk.sh script #530

Update llama3 70b and removes the neuron-top in update_neuron_sdk.sh script #530

Uh oh!

Conversation

jianyinglangaws commented Jan 22, 2025

Uh oh!

amanshanbhag commented Jan 22, 2025

Uh oh!

amanshanbhag commented Jan 22, 2025

Uh oh!

Uh oh!

jianyinglangaws commented Jan 22, 2025

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jianyinglangaws left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhuguesaws left a comment

Choose a reason for hiding this comment

Uh oh!

KeitaW commented Jan 23, 2025

Uh oh!

nghtm left a comment

Choose a reason for hiding this comment

Uh oh!

KeitaW commented Jan 26, 2025

Uh oh!

KeitaW commented Jan 26, 2025

Uh oh!

jianyinglangaws commented Jan 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jianyinglangaws commented Jan 26, 2025

Uh oh!

KeitaW commented Jan 26, 2025

Uh oh!

jianyinglangaws commented Jan 26, 2025

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

mhuguesaws left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jianyinglangaws commented Jan 26, 2025 •

edited

Loading