Merge pull request #121 from milo157/main

Added Cerebrium for self-deployment
mistralai · Aug 28, 2024 · 014c3dd · 014c3dd
2 parents 3dbc8dc + 07d1c15
commit 014c3dd
Showing 1 changed file with 162 additions and 0 deletions.
diff --git a/docs/deployment/self-deployment/cerebrium.mdx b/docs/deployment/self-deployment/cerebrium.mdx
@@ -0,0 +1,162 @@
+---
+id: cerebrium
+title: Deploy with Cerebrium
+sidebar_position: 3.34
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+[Cerebrium](https://www.cerebrium.ai/) is a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications. They offer Serverless GPU's with low cold start times with over 12 varieties of GPU chips that auto scale and you only pay for the compute you use.
+
+## Setup Cerebrium
+
+First, we install Cerebrium and login to get authenticated.
+
+```bash
+pip install cerebrium
+cerebrium login
+```
+
+Then let us create our first project
+
+```bash
+cerebrium init mistral-vllm
+```
+
+## Setup Environment and Hardware
+
+You set up your environment and hardware in the **cerebrium.toml** file that was created using the init function above. Here we are using a Ampere A10 GPU etc.
+You can read more [here](https://docs.cerebrium.ai/cerebrium/environments/custom-images)
+
+```toml
+[cerebrium.deployment]
+name = "mistral-vllm"
+python_version = "3.11"
+docker_base_image_url = "debian:bookworm-slim"
+include = "[./*, main.py, cerebrium.toml]"
+exclude = "[.*]"
+
+[cerebrium.hardware]
+cpu = 2
+memory = 14.0
+compute = "AMPERE_A10"
+gpu_count = 1
+provider = "aws"
+region = "us-east-1"
+
+[cerebrium.dependencies.pip]
+sentencepiece = "latest"
+torch = ">=2.0.0"
+vllm = "latest"
+transformers = ">=4.35.0"
+accelerate = "latest"
+xformers = "latest"
+```
+
+## Setup inference
+
+Running code in Cerebrium is like writing normal python with no special syntax. In your **main.py** specify the following:
+
+```python
+from vllm import LLM, SamplingParams
+from huggingface_hub import login
+from cerebrium import get_secret
+
+# Your huggingface token (HF_AUTH_TOKEN) should be stored in your project secrets on your Cerebrium dashboard
+login(token=get_secret("HF_AUTH_TOKEN"))
+
+# Initialize the model
+llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.3", dtype="bfloat16", max_model_len=20000, gpu_memory_utilization=0.9)
+
+```
+
+We need to add our Hugging Face token to our [Cerebrium Secrets](https://docs.cerebrium.ai/cerebrium/environments/using-secrets) since using the Mistral model requires authentication. Please make sure the Huggingface token you added, has <b>WRITE</b> permissions. On first deploy, it will download the model and store it on disk therefore for subsequent calls it will load the model from disk.
+
+Add the following to your main.py:
+
+```python
+def run(prompt: str, temperature: float = 0.8, top_p: float = 0.75, top_k: int = 40, max_tokens: int = 256, frequency_penalty: int = 1):
+
+    sampling_params = SamplingParams(
+        temperature=temperature,
+        top_p=top_p,
+        top_k=top_k,
+        max_tokens=max_tokens,
+        frequency_penalty=frequency_penalty
+    )
+
+    outputs = llm.generate([item.prompt], sampling_params)
+
+    generated_text = []
+    for output in outputs:
+        generated_text.append(output.outputs[0].text)
+
+    return {"result": generated_text}
+```
+
+Every function in Cerebrium is callable through and API endpoint. Code at the top most layer (ie: not in a function) is instantiated only when the container is spun up the first time so for subsequent calls, it will simply run the code defined in the function you call.
+
+Our final main.py should look like this:
+
+```python
+from vllm import LLM, SamplingParams
+from huggingface_hub import login
+from cerebrium import get_secret
+
+# Your huggingface token (HF_AUTH_TOKEN) should be stored in your project secrets on your Cerebrium dashboard
+login(token=get_secret("HF_AUTH_TOKEN"))
+
+# Initialize the model
+llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.3", dtype="bfloat16", max_model_len=20000, gpu_memory_utilization=0.9)
+
+def run(prompt: str, temperature: float = 0.8, top_p: float = 0.75, top_k: int = 40, max_tokens: int = 256, frequency_penalty: int = 1):
+
+    sampling_params = SamplingParams(
+        temperature=temperature,
+        top_p=top_p,
+        top_k=top_k,
+        max_tokens=max_tokens,
+        frequency_penalty=frequency_penalty
+    )
+
+    outputs = llm.generate([item.prompt], sampling_params)
+
+    generated_text = []
+    for output in outputs:
+        generated_text.append(output.outputs[0].text)
+
+    return {"result": generated_text}
+```
+
+## Run on the cloud
+
+```bash
+cerebrium deploy
+```
+
+You will see your application deploy, install pip packages and download the model. Once completed it will output a CURL request you can use to call your endpoint. Just remember to end
+the url with the function you would like to call - in this case /run. 
+
+```CURL
+curl --location --request POST 'https://api.cortex.cerebrium.ai/v4/p-<YOUR PROJECT ID>/mistral-vllm/run' \
+--header 'Authorization: Bearer <YOUR TOKEN HERE>' \
+--header 'Content-Type: application/json' \
+--data-raw '{
+    "prompt: "What is the capital city of France?"
+}'
+```
+
+You should then get a message looking like this:
+
+```json
+{
+  "run_id": "nZL6mD8q66u4lHTXcqmPCc6pxxFwn95IfqQvEix0gHaOH4gkHUdz1w==",
+  "message": "Finished inference request with run_id: `nZL6mD8q66u4lHTXcqmPCc6pxxFwn95IfqQvEix0gHaOH4gkHUdz1w==`",
+  "result": {
+    "result": ["\nA: Paris"]
+  },
+  "status_code": 200,
+  "run_time_ms": 151.24988555908203
+}
+```