-
Notifications
You must be signed in to change notification settings - Fork 177
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add finetuning gemma on GKE with L4 GPUs example (#697)
* add finetuning gemma example Signed-off-by: Mofi Rahman <[email protected]> * resolve comments Signed-off-by: Mofi Rahman <[email protected]> * remove extra tag from title Signed-off-by: Mofi Rahman <[email protected]> --------- Signed-off-by: Mofi Rahman <[email protected]>
- Loading branch information
Showing
5 changed files
with
513 additions
and
0 deletions.
There are no files selected for viewing
30 changes: 30 additions & 0 deletions
30
tutorials-and-examples/genAI-LLM/finetuning-gemma-2b-on-l4/Dockerfile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Copyright 2024 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04 | ||
|
||
RUN apt-get update && \ | ||
apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
RUN pip3 install --no-cache-dir \ | ||
accelerate==0.30.1 bitsandbytes==0.43.1 \ | ||
datasets==2.19.1 transformers==4.41.0 \ | ||
peft==0.11.1 trl==0.8.6 torch==2.3.0 | ||
|
||
COPY finetune.py /finetune.py | ||
|
||
ENV PYTHONUNBUFFERED 1 | ||
|
||
CMD python3 /finetune.py --device cuda |
174 changes: 174 additions & 0 deletions
174
tutorials-and-examples/genAI-LLM/finetuning-gemma-2b-on-l4/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
# Tutorial: Finetuning Gemma 2b on GKE using L4 GPUs | ||
|
||
We’ll walk through fine-tuning a Gemma 2b model using GKE using 8 x L4 GPUs. L4 GPUs are suitable for many use cases beyond serving models. We will demonstrate how the L4 GPU is a great option for fine tuning LLMs, at a fraction of the cost of using a higher end GPU. | ||
|
||
Let’s get started and fine-tune Gemma 2B on the [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset using GKE. | ||
Parameter Efficient Fine Tuning (PEFT) and LoRA is used so fine-tuning is posible | ||
on GPUs with less GPU memory. | ||
|
||
As part of this tutorial, you will get to do the following: | ||
|
||
1. Prepare your environment with a GKE cluster in | ||
Autopilot mode. | ||
2. Create a finetune container. | ||
3. Use GPU to finetune the Gemma 2B model and upload the model to huggingface. | ||
|
||
## Prerequisites | ||
|
||
* A terminal with `kubectl` and `gcloud` installed. Cloud Shell works great! | ||
* Create a [Hugging Face](https://huggingface.co/) account, if you don't already have one. | ||
* Ensure your project has sufficient quota for GPUs. To learn more, see [About GPUs](/kubernetes-engine/docs/concepts/gpus#gpu-quota) and [Allocation quotas](/compute/resource-usage#gpu_quota). | ||
* To get access to the Gemma models for deployment to GKE, you must first sign the license consent agreement then generate a Hugging Face access token. Make sure the token has `Write` permission. | ||
|
||
## Creating the GKE cluster with L4 nodepools | ||
|
||
Let’s start by setting a few environment variables that will be used throughout this post. You should modify these variables to meet your environment and needs. | ||
|
||
Download the code and files used throughout the tutorial: | ||
|
||
```bash | ||
git clone https://github.com/GoogleCloudPlatform/ai-on-gke | ||
cd ai-on-gke/tutorials-and-examples/genAI-LLM/finetuning-gemma-2b-on-l4 | ||
``` | ||
|
||
Run the following commands to set the env variables and make sure to replace `<my-project-id>`: | ||
|
||
```bash | ||
gcloud config set project <my-project-id> | ||
export PROJECT_ID=$(gcloud config get project) | ||
export REGION=us-central1 | ||
export HF_TOKEN=<YOUR_HF_TOKEN> | ||
export CLUSTER_NAME=finetune-gemma | ||
``` | ||
|
||
> Note: You might have to rerun the export commands if for some reason you reset your shell and the variables are no longer set. This can happen for example when your Cloud Shell disconnects. | ||
Create the GKE cluster by running: | ||
|
||
```bash | ||
gcloud container clusters create-auto ${CLUSTER_NAME} \ | ||
--project=${PROJECT_ID} \ | ||
--region=${REGION} \ | ||
--release-channel=rapid \ | ||
--cluster-version=1.29 | ||
``` | ||
|
||
### Create a Kubernetes secret for Hugging Face credentials | ||
|
||
In your shell session, do the following: | ||
|
||
1. Configure `kubectl` to communicate with your cluster: | ||
|
||
```sh | ||
gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${REGION} | ||
``` | ||
|
||
2. Create a Kubernetes Secret that contains the Hugging Face token: | ||
|
||
```sh | ||
kubectl create secret generic hf-secret \ | ||
--from-literal=hf_api_token=${HF_TOKEN} \ | ||
--dry-run=client -o yaml | kubectl apply -f - | ||
``` | ||
|
||
### Containerize the Code with Docker and Cloud Build | ||
|
||
1. Create an Artifact Registry Docker Repository | ||
|
||
```sh | ||
gcloud artifacts repositories create gemma \ | ||
--project=${PROJECT_ID} \ | ||
--repository-format=docker \ | ||
--location=us \ | ||
--description="Gemma Repo" | ||
``` | ||
|
||
2. Execute the build and create inference container image. | ||
|
||
```sh | ||
gcloud builds submit . | ||
``` | ||
|
||
## Run Finetune Job on GKE | ||
|
||
1. Open the `finetune.yaml` manifest. | ||
2. Edit the `image` name with the container image built with Cloud Build and `NEW_MODEL` environment variable value. This `NEW_MODEL` will be the name of the model you would save as a public model in your Hugging Face account. | ||
3. Run the following command to create the finetune job: | ||
|
||
```sh | ||
kubectl apply -f finetune.yaml | ||
``` | ||
|
||
4. Monitor the job by running: | ||
|
||
```sh | ||
watch kubectl get pods | ||
``` | ||
|
||
5. You can check the logs of the job by running: | ||
|
||
```sh | ||
kubectl logs -f -l app=gemma-finetune | ||
``` | ||
|
||
6. Once the job is completed, you can check the model in Hugging Face. | ||
|
||
## Serve the Finetuned Model on GKE | ||
|
||
To deploy the finetuned model on GKE you can follow the instructions from Deploy a pre-trained Gemma model on [Hugging Face TGI](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi#deploy-pretrained) or [vLLM](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-vllm#deploy-vllm). Select the Gemma 2B instruction and change the `MODEL_ID` to `<YOUR_HUGGING_FACE_PROFILE>/gemma-2b-sql-finetuned`. | ||
|
||
### Set up port forwarding | ||
|
||
Once the model is deploye, run the following command to set up port forwarding to the model: | ||
|
||
```sh | ||
kubectl port-forward service/llm-service 8000:8000 | ||
``` | ||
|
||
The output is similar to the following: | ||
|
||
```sh | ||
Forwarding from 127.0.0.1:8000 -> 8000 | ||
``` | ||
|
||
### Interact with the model using curl | ||
|
||
Once the model is deployed In a new terminal session, use curl to chat with your model: | ||
|
||
> The following example command is for TGI. | ||
|
||
```sh | ||
USER_PROMPT="Question: What is the total number of attendees with age over 30 at kubecon eu? Context: CREATE TABLE attendees (name VARCHAR, age INTEGER, kubecon VARCHAR)" | ||
curl -X POST http://localhost:8000/generate \ | ||
-H "Content-Type: application/json" \ | ||
-d @- <<EOF | ||
{ | ||
"inputs": "${USER_PROMPT}", | ||
"parameters": { | ||
"temperature": 0.1, | ||
"top_p": 0.95, | ||
"max_new_tokens": 25 | ||
} | ||
} | ||
EOF | ||
``` | ||
|
||
The following output shows an example of the model response: | ||
|
||
```sh | ||
{"generated_text":" Answer: SELECT COUNT(age) FROM attendees WHERE age > 30 AND kubecon = 'eu'\n"} | ||
``` | ||
|
||
## Clean Up | ||
|
||
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources. | ||
|
||
### Delete the deployed resources | ||
|
||
To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following command: | ||
|
||
```sh | ||
gcloud container clusters delete ${CLUSTER_NAME} \ | ||
--region=${REGION} | ||
``` |
5 changes: 5 additions & 0 deletions
5
tutorials-and-examples/genAI-LLM/finetuning-gemma-2b-on-l4/cloudbuild.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
steps: | ||
- name: 'gcr.io/cloud-builders/docker' | ||
args: [ 'build', '-t', 'us-docker.pkg.dev/$PROJECT_ID/gemma/finetune-gemma-gpu:1.0.0', '.' ] | ||
images: | ||
- 'us-docker.pkg.dev/$PROJECT_ID/gemma/finetune-gemma-gpu:1.0.0' |
Oops, something went wrong.