-
Notifications
You must be signed in to change notification settings - Fork 127
Update llama3 70b and removes the neuron-top in update_neuron_sdk.sh script #530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Changes on |
Please resolve conflicts before approval. |
Minor bug fix
Rebased the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jianyinglangaws ! Left few comments, will test the model training part by next week.
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/update_neuron_sdk.sh
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
# Install Python | ||
sudo apt-get install -y python3.8 python3.8-venv | ||
# Install Python venv | ||
sudo apt-get install -y python3.8-venv g++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need Python 3.8? Why are we forcing a specific python version or even install python on hyperpod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Neuron SDK depends on the default python version for a specific OS (Ubuntu20.04:python3.8). The wrong python version can lead to Neuron SDK installation/upgrade failure.
sudo apt-get install -y python3.8 python3.8-venv | ||
# Install Python venv | ||
sudo apt-get install -y python3.8-venv g++ | ||
|
||
# Create Python venv | ||
python3.8 -m venv /fsx/ubuntu/aws_neuron_venv_pytorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's trash the file system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??
@@ -198,12 +213,14 @@ Next, we use the `convert_checkpoints.py` script to shard the checkpoints. Execu | |||
```bash | |||
mkdir -p /fsx/ubuntu/llama3_70B/pretrained_weight | |||
sbatch --job-name=convert-checkpoint --output=logs/convert-checkpoint.out \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write a sbatch script instead of wrapping stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can update this with a sbatch script. What is the main concern for wrapping stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's highly depends on NxD core repo and the scripts in it frequently being updated. So I'm against introducing sbatch scripts in ADR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, that is the way official doc is recommending https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_llama_tp_pp.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move to containerization.
Address comments.
+1 to it but containerization is something we should address in a separate PR. This PR is already becoming too big. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, however please ensure latest updates have been tested successfully before rebasing and pushing. Thanks for the contribution
Running
On the latest HP caused
We need to install This error supposed to be fixed in the original NxD repo. Created PR here aws-neuron/neuronx-distributed#39. |
On side note, I tried to update the test case with Llama3.1 but encountered aws-neuron/neuronx-distributed#40. |
For llama3.1, we need to use the config file of https://github.com/aws-neuron/neuronx-distributed/tree/main/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3.1 instead of the default huggingface one. I recorded a demo for llama3.1-8B continual pretraining https://amazon.awsapps.com/workdocs-amazon/index.html#/document/f372f9fde32d4f66c906e7ac45ba619658f87b4cba8359ae1861c51369ea88fe . I am updating the SageMaker HyperPod workshop using this example. There are some comments on the transformers version update: https://amzn-aws.slack.com/archives/C02MFFXRP4Z/p1736895416720939 |
Yes, I saw this error during installation. Although it does not affect functionality of the training, it is better be fixed in NxD doc. |
Thansk @jianyinglangaws . No need to close the PR. |
Sorry, this PR was closed by an accident click. Let me reopen it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Tested the updated test case e2e on HP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…script (aws-samples#530) * Update the Neuron SDK to 2.21.0 * Update the Llama3-70B pretraining with the Neuron SDK 2.21 * Fix a typo * Add --hw_backend trn1 in the convert_checkpoint command * More update * Update the update_neuron_sdk.sh by removing the neuron-top check * Keep enable_update_neuron_sdk as Flase by default * Update automate-eks-cluster-creation.sh (aws-samples#529) Minor bug fix * Update according to the review comments. * minor updates in doc --------- Co-authored-by: Aman Shanbhag <[email protected]> Co-authored-by: Keita Watanabe <[email protected]>
Issue #, if available:
The llama3-70B training example is out dated.
Description of changes:
Update the script to be compatible with the latest Neuron SDK 2.21
Remove the neuron-top from update_neuron_sdk.sh script so the Neuron SDK update takes effect during the cluster setup.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.