Skip to content

Commit

Permalink
Update README.md to not rely on models chart (#225)
Browse files Browse the repository at this point in the history
The models chart isn't being released correctly. We can switch back to
using models chart once the issue is fixed.
  • Loading branch information
samos123 authored Sep 19, 2024
1 parent a645e60 commit a915f96
Showing 1 changed file with 28 additions and 13 deletions.
41 changes: 28 additions & 13 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,22 +61,21 @@ Install KubeAI and wait for all components to be ready (may take a minute).
helm install kubeai kubeai/kubeai --wait --timeout 10m
```

Install some predefined models.
Install Gemma 2B using CPU and Ollama:

```bash
cat <<EOF > kubeai-models.yaml
catalog:
gemma2-2b-cpu:
enabled: true
minReplicas: 1
qwen2-500m-cpu:
enabled: true
nomic-embed-text-cpu:
enabled: true
kubectl apply -f - <<EOF
apiVersion: kubeai.org/v1
kind: Model
metadata:
name: gemma2-2b-cpu
spec:
features: [TextGeneration]
url: ollama://gemma2:2b
engine: OLlama
resourceProfile: cpu:2
minReplicas: 1
EOF

helm install kubeai-models kubeai/models \
-f ./kubeai-models.yaml
```

Before progressing to the next steps, start a watch on Pods in a standalone terminal to see how KubeAI deploys models.
Expand All @@ -99,6 +98,22 @@ Now open your browser to [localhost:8000](http://localhost:8000) and select the

#### Scale up Qwen2 from Zero

Deploy Qwen2 with minScale set to 0:
```
kubectl apply -f - <<EOF
apiVersion: kubeai.org/v1
kind: Model
metadata:
name: qwen2-500m-cpu
spec:
features: [TextGeneration]
url: ollama://qwen2:0.5b
engine: OLlama
resourceProfile: cpu:1
minReplicas: 0
EOF
```

If you go back to the browser and start a chat with Qwen2, you will notice that it will take a while to respond at first. This is because we set `minReplicas: 0` for this model and KubeAI needs to spin up a new Pod (you can verify with `kubectl get models -oyaml qwen2-500m-cpu`).

NOTE: Autoscaling after initial scale-from-zero is not yet supported for the Ollama backend which we use in this local quickstart. KubeAI relies upon backend-specific metrics and the Ollama project has an open issue: https://github.com/ollama/ollama/issues/3144. To see autoscaling in action, checkout the [GKE install guide](./installation/gke.md) which uses the vLLM backend and autoscales across GPU resources.
Expand Down

0 comments on commit a915f96

Please sign in to comment.