-
Notifications
You must be signed in to change notification settings - Fork 177
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Change-Id: I0533938e5bbac6c7c88190b472f8df0548153881
- Loading branch information
Showing
1 changed file
with
240 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,240 @@ | ||
# KServe on GKE Autopilot | ||
KServe is a highly scalable, standards-based platform for model inference on Kubernetes. Installing KServe on GKE Autopilot can be challenging due to the security policies enforced by Autopilot. This tutorial will guide you step by step through the process of installing KServe in a GKE Autopilot cluster. | ||
|
||
Additionally, this tutorial includes an example of serving Gemma2 with vLLM in KServe, demonstrating how to utilize GPU resources in KServe on Google Kubernetes Engine (GKE). | ||
|
||
## Before you begin | ||
|
||
1. Ensure you have a gcp project with billing enabled and [enabled the GKE API](https://cloud.google.com/kubernetes-engine/docs/how-to/enable-gkee). | ||
|
||
2. Ensure you have the following tools installed on your workstation | ||
* [gcloud CLI](https://cloud.google.com/sdk/docs/install) | ||
* [gcloud kubectl](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_kubectl) | ||
* [helm](https://helm.sh/docs/intro/install/) | ||
|
||
## Set up your GKE Cluster | ||
|
||
1. Set the default environment variables: | ||
```bash | ||
export PROJECT_ID=$(gcloud config get project) | ||
export REGION=us-central1 | ||
export CLUSTER_NAME=kserve-demo | ||
``` | ||
|
||
2. Create a GKE Autopilot cluster: | ||
```bash | ||
gcloud container clusters create-auto ${CLUSTER_NAME} \ | ||
--location=$REGION \ | ||
--project=$PROJECT_ID \ | ||
--workload-policies=allow-net-admin | ||
|
||
# Get credentials | ||
gcloud container clusters get-credentials ${CLUSTER_NAME} \ | ||
--region ${REGION} \ | ||
--project ${PROJECT_ID} | ||
``` | ||
If you're using an existing cluster, ensure it is updated to allow net admin permissions. This is necessary for the installation of Istio later on: | ||
```bash | ||
gcloud container clusters update ${CLUSTER_NAME} \ | ||
--region=${REGION} | ||
--project=$PROJECT_ID \ | ||
--workload-policies=allow-net-admin | ||
``` | ||
|
||
## Install KServe | ||
KServe relies on [Knative](https://knative.dev/docs/concepts/) and requires a networking layer. In this tutorial, we will use [Istio](https://istio.io/), the networking layer that integrates best with Knative. | ||
|
||
1. Install Knative | ||
```bash | ||
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-crds.yaml | ||
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-core.yaml | ||
``` | ||
|
||
> **Note**: | ||
> You will see warnings that Autopilot mutated the CRDs during this tutorial. These warnings are safe to ignore. | ||
2. Install Istio | ||
```bash | ||
helm repo add istio https://istio-release.storage.googleapis.com/charts | ||
helm repo update | ||
kubectl create namespace istio-system | ||
helm install istio-base istio/base -n istio-system --set defaultRevision=default | ||
helm install istiod istio/istiod -n istio-system --wait | ||
helm install istio-ingressgateway istio/gateway -n istio-system | ||
|
||
# Verify the installation | ||
kubectl get deployments -n istio-system | ||
|
||
# Example Output | ||
NAME READY UP-TO-DATE AVAILABLE AGE | ||
istiod 1/1 1 1 58s | ||
``` | ||
|
||
3. Install Knative-Istio | ||
```bash | ||
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.15.1/net-istio.yaml | ||
|
||
# Verify the installation | ||
kubectl get pods -n knative-serving | ||
|
||
# Example Output | ||
NAME READY STATUS RESTARTS AGE | ||
activator-749cf94f87-b7p9n 1/1 Running 0 17m | ||
autoscaler-5c764b5f7d-m8zvk 1/1 Running 1 (14m ago) 17m | ||
controller-5649f5bbb7-wvlmk 1/1 Running 4 (13m ago) 17m | ||
net-istio-controller-7f8dfbddb7-d8cmq 1/1 Running 0 18s | ||
net-istio-webhook-54ffc96585-cpgfl 2/2 Running 0 18s | ||
webhook-64c67b4fc-smdtl 1/1 Running 3 (13m ago) 17m | ||
``` | ||
|
||
4. Install DNS | ||
In this tutorial we use Magic DNS. To configure a real DNS, follow the steps [here](https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#__tabbed_2_2). | ||
```bash | ||
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-default-domain.yaml | ||
``` | ||
|
||
5. Install [Cert Manager](https://cert-manager.io/docs/installation/), which is required to provision webhook certs for production grade installation. | ||
```bash | ||
helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --version v1.15.3 --set crds.enabled=true --set global.leaderElection.namespace=cert-manager | ||
``` | ||
|
||
6. Install Kserve and Kserve cluster runtimes | ||
```bash | ||
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0-rc0/kserve.yaml | ||
|
||
# Wait until kserve-controller-manager is ready | ||
kubectl rollout status deployment kserve-controller-manager -n kserve | ||
|
||
# Install cluster runtimes | ||
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0-rc0/kserve-cluster-resources.yaml | ||
|
||
# View these runtimes | ||
kubectl get ClusterServingRuntimes -n kserve | ||
``` | ||
|
||
7. To request accelerators (GPUs) for your Google Kubernetes Engine (GKE) Autopilot workloads, nodeSelector is used in the manifest. Therefore, we will enable nodeSelector in Knative in the next step, which is disabled by default. | ||
|
||
```bash | ||
kubectl patch configmap/config-features \ | ||
--namespace knative-serving \ | ||
--type merge \ | ||
--patch '{"data":{"kubernetes.podspec-nodeselector":"enabled", "kubernetes.podspec-tolerations":"enabled"}}' | ||
|
||
# restart knative webhook to consume the config, for example | ||
kubectl get pods -n knative-serving | ||
# Find the webhook pod and delete it to restart the pod. | ||
kubeclt delete pod webhook-64c67b4fc-nmzwt -n knative-serving | ||
``` | ||
After successfully installing KServe, you can now explore various examples such as, [first inference service](https://kserve.github.io/website/master/get_started/first_isvc/), [canary rollout](https://kserve.github.io/website/master/modelserving/v1beta1/rollout/canary-example/), inference [batcher](https://kserve.github.io/website/master/modelserving/batcher/batcher/) and [auto-scaling](https://kserve.github.io/website/master/modelserving/autoscaling/autoscaling/). | ||
In the next step, we'll demonstrate how to deploy Gemma2 using vLLM in KServe with GKE Autopilot. | ||
|
||
## Deploy Gemma2 served with vllm. | ||
1. Generate a hugging face access token follow [these steps](https://huggingface.co/docs/hub/en/security-tokens). Specify a Name of your choice and a Role of at least **Read**. | ||
|
||
2. Make sure you [accepted the term](https://huggingface.co/google/gemma-2-2b) to use gemma2 in hugging face. | ||
|
||
3. Create the hugging face token | ||
```bash | ||
kubectl create namespace kserve-test | ||
|
||
# Specify your hugging face token. | ||
export HF_TOKEN = XXX | ||
|
||
kubectl apply -f - <<EOF | ||
apiVersion: v1 | ||
kind: Secret | ||
metadata: | ||
name: hf-secret | ||
namespace: kserve-test | ||
type: Opaque | ||
stringData: | ||
hf_api_token: ${HF_TOKEN} | ||
EOF | ||
|
||
``` | ||
|
||
4. Create the inference service | ||
```bash | ||
kubectl apply -f - <<EOF | ||
apiVersion: serving.kserve.io/v1beta1 | ||
kind: InferenceService | ||
metadata: | ||
name: huggingface-gemma2 | ||
namespace: kserve-test | ||
spec: | ||
predictor: | ||
nodeSelector: | ||
cloud.google.com/gke-accelerator: nvidia-l4 | ||
cloud.google.com/gke-accelerator-count: "1" | ||
model: | ||
modelFormat: | ||
name: huggingface | ||
args: | ||
- --enable_docs_url=True | ||
- --model_name=gemma2 | ||
- --model_id=google/gemma-2-2b | ||
env: | ||
- name: HF_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
name: hf-secret | ||
key: hf_api_token | ||
resources: | ||
limits: | ||
cpu: "6" | ||
memory: 24Gi | ||
nvidia.com/gpu: "1" | ||
requests: | ||
cpu: "6" | ||
memory: 24Gi | ||
nvidia.com/gpu: "1" | ||
EOF | ||
``` | ||
|
||
Wait for the service to be ready: | ||
```bash | ||
kubectl get inferenceservice huggingface-gemma2 -n kserve-test | ||
kubectl get pods -n kserve-test | ||
|
||
# Replace pod_name with the correct pod name. | ||
kubectl events --for pod/POD_NAME -n kserve-test --watch | ||
``` | ||
|
||
## Test the Inference Service | ||
1. Find the URL returned in kubectl get inferenceservice | ||
```bash | ||
URL=$(kubectl get inferenceservice huggingface-gemma2 -o jsonpath='{.status.url}') | ||
|
||
# URL should look like this: | ||
http://huggingface-gemma2.kserve-test.34.121.87.225.sslip.io | ||
``` | ||
|
||
2. Open the swagger UI at $URL/docs | ||
|
||
3. Play with the openai chat API with the example input below. Click execute and you can see the response. | ||
```json | ||
{ | ||
"model": "gemma2", | ||
"messages": [ | ||
{ | ||
"role": "system", | ||
"content": "You are an assistant that speaks like Shakespeare." | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Write a poem about colors" | ||
} | ||
], | ||
"max_tokens": 30, | ||
"stream": false | ||
} | ||
``` | ||
|
||
|
||
## Clean up | ||
Delete the GKE cluster. | ||
```bash | ||
gcloud container clusters delete ${CLUSTER_NAME} \ | ||
--location=$REGION \ | ||
--project=$PROJECT_ID \ | ||
``` |