Skip to content

Update llama3 70b and removes the neuron-top in update_neuron_sdk.sh script #530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 19, 2025

Conversation

jianyinglangaws
Copy link
Contributor

Issue #, if available:
The llama3-70B training example is out dated.

Description of changes:
Update the script to be compatible with the latest Neuron SDK 2.21
Remove the neuron-top from update_neuron_sdk.sh script so the Neuron SDK update takes effect during the cluster setup.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@amanshanbhag
Copy link
Contributor

Changes on update_neuron_sdk.sh look good and tested. Thanks Jianying!

@amanshanbhag
Copy link
Contributor

Please resolve conflicts before approval.

@jianyinglangaws
Copy link
Contributor Author

Please resolve conflicts before approval.

Rebased the PR.

Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jianyinglangaws ! Left few comments, will test the model training part by next week.

Copy link
Contributor Author

@jianyinglangaws jianyinglangaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

# Install Python
sudo apt-get install -y python3.8 python3.8-venv
# Install Python venv
sudo apt-get install -y python3.8-venv g++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need Python 3.8? Why are we forcing a specific python version or even install python on hyperpod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Neuron SDK depends on the default python version for a specific OS (Ubuntu20.04:python3.8). The wrong python version can lead to Neuron SDK installation/upgrade failure.

sudo apt-get install -y python3.8 python3.8-venv
# Install Python venv
sudo apt-get install -y python3.8-venv g++

# Create Python venv
python3.8 -m venv /fsx/ubuntu/aws_neuron_venv_pytorch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's trash the file system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

@@ -198,12 +213,14 @@ Next, we use the `convert_checkpoints.py` script to shard the checkpoints. Execu
```bash
mkdir -p /fsx/ubuntu/llama3_70B/pretrained_weight
sbatch --job-name=convert-checkpoint --output=logs/convert-checkpoint.out \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write a sbatch script instead of wrapping stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update this with a sbatch script. What is the main concern for wrapping stuff?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's highly depends on NxD core repo and the scripts in it frequently being updated. So I'm against introducing sbatch scripts in ADR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move to containerization.
Address comments.

@KeitaW
Copy link
Contributor

KeitaW commented Jan 23, 2025

Let's move to containerization. Address comments.

+1 to it but containerization is something we should address in a separate PR. This PR is already becoming too big.

Copy link
Contributor

@nghtm nghtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, however please ensure latest updates have been tested successfully before rebasing and pushing. Thanks for the contribution

@KeitaW
Copy link
Contributor

KeitaW commented Jan 26, 2025

Running

python -m pip install -r requirements.txt

On the latest HP caused

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
neuronx-cc 2.16.372.0+4a9b2326 requires requests<2.32.0, but you have requests 2.32.3 which is incompatible.

We need to install datasets==3.0.2 instead of the latest.

This error supposed to be fixed in the original NxD repo. Created PR here aws-neuron/neuronx-distributed#39.

@KeitaW
Copy link
Contributor

KeitaW commented Jan 26, 2025

On side note, I tried to update the test case with Llama3.1 but encountered aws-neuron/neuronx-distributed#40.

@jianyinglangaws
Copy link
Contributor Author

jianyinglangaws commented Jan 26, 2025

On side note, I tried to update the test case with Llama3.1 but encountered aws-neuron/neuronx-distributed#40.

For llama3.1, we need to use the config file of https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3.1 instead of the default huggingface one. I recorded a demo for llama3.1-8B continual pretraining https://amazon.awsapps.com/workdocs-amazon/index.html#/document/f372f9fde32d4f66c906e7ac45ba619658f87b4cba8359ae1861c51369ea88fe .

I am updating the SageMaker HyperPod workshop using this example.

There are some comments on the transformers version update: https://amzn-aws.slack.com/archives/C02MFFXRP4Z/p1736895416720939

@jianyinglangaws
Copy link
Contributor Author

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
neuronx-cc 2.16.372.0+4a9b2326 requires requests<2.32.0, but you have requests 2.32.3 which is incompatible.

Yes, I saw this error during installation. Although it does not affect functionality of the training, it is better be fixed in NxD doc.

@KeitaW
Copy link
Contributor

KeitaW commented Jan 26, 2025

Thansk @jianyinglangaws . No need to close the PR.

@jianyinglangaws
Copy link
Contributor Author

Sorry, this PR was closed by an accident click. Let me reopen it.

Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Tested the updated test case e2e on HP.

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mhuguesaws mhuguesaws merged commit 6e49278 into main Feb 19, 2025
@mhuguesaws mhuguesaws deleted the update_llama3_70b branch February 19, 2025 21:37
dongjin-ml pushed a commit to dongjin-ml/awsome-distributed-training that referenced this pull request Feb 20, 2025
…script (aws-samples#530)

* Update the Neuron SDK to 2.21.0

* Update the Llama3-70B pretraining with the Neuron SDK 2.21

* Fix a typo

* Add --hw_backend trn1 in the convert_checkpoint command

* More update

* Update the update_neuron_sdk.sh by removing the neuron-top check

* Keep enable_update_neuron_sdk as Flase by default

* Update automate-eks-cluster-creation.sh (aws-samples#529)

Minor bug fix

* Update according to the review comments.

* minor updates in doc

---------

Co-authored-by: Aman Shanbhag <[email protected]>
Co-authored-by: Keita Watanabe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants