Granite three support #608

gabe-l-hart · 2024-11-04T22:53:26Z

Description

This PR adds support for the "granite" and "granitemoe" architectures in order to support IBM's Granite 3.0. The changes mirror those added in llama.cpp upstream:

"granite": IBM Granite Architecture ggerganov/llama.cpp#9412
"granitemoe": IBM Granite MoE Architecture ggerganov/llama.cpp#9438

These models are currently available via HuggingFace and Ollama:

HuggingFace: https://huggingface.co/collections/ibm-granite/granite-30-language-models-66fdb59bbb54785c3512114f
Ollama:
- granite3-dense ("granite"): https://ollama.com/library/granite3-dense
- granite3-moe ("granitemoe"): https://ollama.com/library/granite3-moe

Testing

I did my development on a Mac M3 without gmake natively installed. To avoid a system-level install, I wrapped my dev environment in docker with the following two scripts:

build_dockerized.sh

#!/usr/bin/env bash

cd $(dirname ${BASH_SOURCE[0]})

docker buildx build . -t llamafile-builder:latest --load
docker run --rm -it --entrypoint bash -w /src -v $PWD:/src -v $HOME/models:/models llamafile-builder:latest

build_in_docker.sh

#!/usr/bin/env bash

gguf_file=$1
if [ $# -ge 2 ]
then
    model_name=$2
else
    model_name=$(basename $gguf_file | cut -d'.' -f 1)
fi
echo "Model Name: $model_name"

# Build (NOTE: First build may fail due to the need to download tools)
make -j || make -j

# Install the built binaries
make install PREFIX=/usr/local

# Make a temp dir to work in
start_dir=$PWD
temp_dir=$(mktemp -d)
cd $temp_dir

# Copy over the model and base binary
echo "Copying source materials..."
cp $gguf_file .
cp $(which llamafile) $model_name.llamafile

# Make the .args file
echo "Making .args file..."
echo "-m
$(basename $gguf_file)
--host
0.0.0.0
-ngl
9999
..." > .args

# Pack it all together
echo "Packing with zipalign..."
zipalign -j0 $model_name.llamafile $(basename $gguf_file) .args

# Move it back to the root dir
mv $model_name.llamafile $start_dir/
echo "DONE"

With these scripts, my workflow was:

Download pre-quantized versions of the models (e.g. ollama pull then grab the $HOME/.ollama/models/blobs/... blob for the GGUF file)
- NOTE: IBM does not currently host official quantized versions, but there are also many community quantizations available in HF (dense, moe)
Launch the docker build shell (./build_dockerized.sh)
Build the llamafile inside (./build_in_docker.sh /models/granite-3.0-2b-instruct.Q4_K_M.gguf granite3-dense-2b)
Run the llamafile outside the docker shell (./granite3-dense-2b.llamafile -p "tell me a story")

Open Questions

Solved! I found the PR added after mine in llama.cpp to update the chat template to support "granite": ggerganov/llama.cpp#10013

When running in interactive mode, the chat template seems to be using different special tokens besides those defined in the chat_template metadata in the GGUF file. I haven't dug enough yet to understand if this is something that can be pulled automatically from the GGUF, or if there's an additional place where the Granite architectures will need to explicitly indicate their chat templates.

This is a port of the work done in llama.cpp directly ggerganov/llama.cpp#9412 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <[email protected]>

This is a port of the work done in llama.cpp directly ggerganov/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <[email protected]>

Branch: GraniteThreeSupport This is a port of the work done in llama.cpp with a slight tweak for the tool call response: ggerganov/llama.cpp#10013 Signed-off-by: Gabe Goodhart <[email protected]>

DK013 · 2024-11-06T17:59:18Z

I was waiting for this. Thanks a lot for your hard work mate @gabe-l-hart

BradHutchings · 2024-11-07T17:32:06Z

Thanks for doing this @gabe-l-hart. And thanks for the link @DK013. I appreciate you both!

-Brad

BradHutchings · 2024-11-09T20:06:21Z

I did my own llamafile build with this branch and was able to use IBM Granite 3.0 8B Instruct. Thank you again @gabe-l-hart!

gabe-l-hart added 2 commits November 4, 2024 15:30

feat(granite): Add support for the "granite" architecture in llama.cpp

7932412

This is a port of the work done in llama.cpp directly ggerganov/llama.cpp#9412 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <[email protected]>

feat(granitemoe): Add support for "granitemoe" architecture

6a13182

This is a port of the work done in llama.cpp directly ggerganov/llama.cpp#9438 Branch: GraniteThreeSupport Signed-off-by: Gabe Goodhart <[email protected]>

github-actions bot added the llama.cpp label Nov 4, 2024

feat(granite*): Add granite chat template

bbe64fe

Branch: GraniteThreeSupport This is a port of the work done in llama.cpp with a slight tweak for the tool call response: ggerganov/llama.cpp#10013 Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the GraniteThreeSupport branch from 473e3fc to bbe64fe Compare November 5, 2024 00:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Granite three support #608

Granite three support #608

gabe-l-hart commented Nov 4, 2024 •

edited

Loading

DK013 commented Nov 6, 2024

BradHutchings commented Nov 7, 2024

BradHutchings commented Nov 9, 2024

Granite three support #608

Are you sure you want to change the base?

Granite three support #608

Conversation

gabe-l-hart commented Nov 4, 2024 • edited Loading

Description

Testing

Open Questions

DK013 commented Nov 6, 2024

BradHutchings commented Nov 7, 2024

BradHutchings commented Nov 9, 2024

gabe-l-hart commented Nov 4, 2024 •

edited

Loading