Skip to content

Latest commit

 

History

History
151 lines (110 loc) · 6.19 KB

Run_MaxText_via_xpk.md

File metadata and controls

151 lines (110 loc) · 6.19 KB

How to run MaxText with XPK?

This document focuses on steps required to setup XPK on TPU VM and assumes you have gone through the README to understand XPK basics.

Steps to setup XPK on TPU VM

  • Verify you have these permissions for your account or service account

    Storage Admin
    Kubernetes Engine Admin

  • gcloud is installed on TPUVMs using the snap distribution package. Install kubectl using snap

sudo apt-get update
sudo apt install snapd
sudo snap install kubectl --classic
  • Install gke-gcloud-auth-plugin
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -

sudo apt update && sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
  • Authenticate gcloud installation by running this command and following the prompt
gcloud auth login
  • Run this command to configure docker to use docker-credential-gcloud for GCR registries:
gcloud auth configure-docker us-docker.pkg.dev
  • Test the installation by running
docker run hello-world
  • If getting a permission error, try running
sudo usermod -aG docker $USER

after which log out and log back in to the machine.

Build Docker Image for Maxtext

  1. Git clone maxtext locally

    git clone https://github.com/google/maxtext.git
    cd maxtext
  2. Build local Maxtext docker image

    This only needs to be rerun when you want to change your dependencies. This image may expire which would require you to rerun the below command

    # Default will pick stable versions of dependencies
    bash docker_build_dependency_image.sh

    Build Maxtext Docker Image with JAX Stable Stack (Preview)

    We're excited to announce the preview of building Maxtext Docker images using the JAX Stable Stack base image, available for both TPUs and GPUs. This provides a more reliable and consistent build environment.

    What is JAX Stable Stack?

    JAX Stable Stack provides a consistent environment for Maxtext by bundling JAX with core packages like orbax, flax, and optax, along with Google Cloud utilities and other essential tools. These libraries are tested to ensure compatibility, providing a stable foundation for building and running Maxtext and eliminating potential conflicts due to incompatible package versions.

    How to Use It

    Use the docker_build_dependency_image.sh script to build your Maxtext Docker image with JAX Stable Stack. Set MODE to stable_stack and specify the desired BASEIMAGE. The DEVICE variable determines whether to build for TPUs or GPUs.

    For TPUs:
    # Example: bash docker_build_dependency_image.sh DEVICE=tpu MODE=stable_stack BASEIMAGE=us-docker.pkg.dev/cloud-tpu-images/jax-stable-stack/tpu:jax0.4.37-rev1
    bash docker_build_dependency_image.sh DEVICE=tpu MODE=stable_stack BASEIMAGE={{JAX_STABLE_STACK_TPU_BASEIMAGE}}
    

    You can find a list of available JAX Stable Stack base images here.

    [New] For GPUs:
    # Example bash docker_build_dependency_image.sh DEVICE=gpu MODE=stable_stack BASEIMAGE=us-central1-docker.pkg.dev/deeplearning-images/jax-stable-stack/gpu:jax0.4.37-cuda_dl24.10-rev1
    bash docker_build_dependency_image.sh MODE=stable_stack BASEIMAGE={{JAX_STABLE_STACK_BASEIMAGE}}
    

    You can find a list of available JAX Stable Stack base images here.

    Important Note: The JAX Stable Stack is currently in the experimental phase. We encourage you to try it out and provide feedback.

  3. After building the dependency image maxtext_base_image, xpk can handle updates to the working directory when running xpk workload create and using --base-docker-image.

    See details on docker images in xpk here: https://github.com/google/xpk/blob/main/README.md#how-to-add-docker-images-to-a-xpk-workload

    Using xpk to upload image to your gcp project and run Maxtext

    gcloud config set project $PROJECT_ID
    gcloud config set compute/zone $ZONE
    
    # See instructions in README.me to create below buckets.
    BASE_OUTPUT_DIR=gs://output_bucket/
    DATASET_PATH=gs://dataset_bucket/
    
    # Install xpk
    pip install xpk
    
    # Make sure you are still in the maxtext github root directory when running this command
    xpk workload create \
    --cluster ${CLUSTER_NAME} \
    --base-docker-image maxtext_base_image \
    --workload ${USER}-first-job \
    --tpu-type=v5litepod-256 \
    --num-slices=1  \
    --command "python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=${BASE_OUTPUT_DIR} dataset_path=${DATASET_PATH} steps=100 per_device_batch_size=1"

    Using xpk github repo

    git clone https://github.com/google/xpk.git
    
    # Make sure you are still in the maxtext github root directory when running this command
    python3 xpk/xpk.py workload create \
    --cluster ${CLUSTER_NAME} \
    --base-docker-image maxtext_base_image \
    --workload ${USER}-first-job \
    --tpu-type=v5litepod-256 \
    --num-slices=1  \
    --command "python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=${BASE_OUTPUT_DIR} dataset_path=${DATASET_PATH} steps=100 per_device_batch_size=1"