Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds support for the
"granite"
and"granitemoe"
architectures in order to support IBM's Granite 3.0. The changes mirror those added inllama.cpp
upstream:"granite"
: IBM Granite Architecture ggerganov/llama.cpp#9412"granitemoe"
: IBM Granite MoE Architecture ggerganov/llama.cpp#9438These models are currently available via HuggingFace and Ollama:
granite3-dense
("granite"
): https://ollama.com/library/granite3-densegranite3-moe
("granitemoe"
): https://ollama.com/library/granite3-moeTesting
I did my development on a Mac M3 without
gmake
natively installed. To avoid a system-level install, I wrapped my dev environment indocker
with the following two scripts:build_dockerized.sh
build_in_docker.sh
With these scripts, my workflow was:
ollama pull
then grab the$HOME/.ollama/models/blobs/...
blob for the GGUF file)./build_dockerized.sh
)llamafile
inside (./build_in_docker.sh /models/granite-3.0-2b-instruct.Q4_K_M.gguf granite3-dense-2b
)llamafile
outside the docker shell (./granite3-dense-2b.llamafile -p "tell me a story"
)Open Questions
Solved! I found the PR added after mine in
llama.cpp
to update the chat template to support"granite"
: ggerganov/llama.cpp#10013When running in interactive mode, the chat template seems to be using different special tokens besides those defined in thechat_template
metadata in the GGUF file. I haven't dug enough yet to understand if this is something that can be pulled automatically from the GGUF, or if there's an additional place where the Granite architectures will need to explicitly indicate their chat templates.