Skip to content

Latest commit

 

History

History
57 lines (39 loc) · 1.34 KB

README.md

File metadata and controls

57 lines (39 loc) · 1.34 KB

Ollama

This example uses Ollama to deploy a model: https://ollama.ai/

Get model weights from Triton

In Triton:

module use /scratch/shareddata/modules/core-models

# Choose the model that we want to use
module load model-llama2/13b-chat

# Choose the llama.cpp model quantization we want to use
module load model-llama.cpp/q4_1-2023-08-28

# Get the path to model weights
echo $MODEL_WEIGHTS
# Example output: /scratch/shareddata/LLMs_tools/models/llama2-llama.cpp-2023-08-28/llama-2-13b-chat/ggml-model-q4_1.gguf

On whatever machine you're working on:

# Copy weights from Triton to your machine:
scp triton.aalto.fi:/scratch/shareddata/LLMs_tools/models/llama2-llama.cpp-2023-08-28/llama-2-13b-chat/ggml-model-q4_1.gguf llama-13b-chat-q4_1.gguf

# Get ollama
curl -L https://ollama.ai/download/ollama-linux-amd64 -o ollama
chmod +x ollama

Create a file called Modelfile with the following contents:

FROM ./llama-13b-chat-q4_1.gguf

In one terminal, start up ollama:

./ollama serve

In another terminal, give the model to the ollama server:

./ollama create llama-13b-chat-q4_1 -f Modelfile

Run the model:

./ollama run llama-13b-chat-q4_1

For more information on Ollama installation, see this page